The role
We are looking for a brilliant AI Research Engineer to build the brain and body of our Real-Time Avatar & Conversational Stack. This is a hands-on, deep-tech role where you will design, train, and optimize the next generation of Multimodal AI Models. You will join an elite R&D unit, working at the bleeding edge of Generative Video, Speech Synthesis, and Large Language Models. Your mission is to solve one of the hardest problems in AI: creating a unified, ultra-low-latency agent that can see, hear, and speak with human-level fidelity. You won't just implement papers; you will architect the systems that define the state-of-the-art for enterprise video.
The day-to-day
- Collaborate with Technical Leadership: Partner directly with the Head of AI to architect the long-term research roadmap. You will work shoulder-to-shoulder with other AI Research Engineers, brainstorming novel architectures and conducting peer reviews to push the collective intelligence of the team.
- Master Multimodal Architectures: Research and train large-scale models that fuse Video Generation (pixels), Audio (speech/prosody), and Text (semantics) into a cohesive experience.
- Next-Gen Video Synthesis: Develop and optimize advanced architectures—specifically Diffusion Transformers (DiT) and modern GANs—for photorealistic avatar synthesis, focusing on lip-sync accuracy and temporal consistency.
- Conquer Real-Time Constraints: Tackle the challenge of "in-the-wild" inference. You will optimize heavy foundation models to run within strict millisecond latency budgets, ensuring fluid, uninterrupted conversation.
- Advance the Speech Stack: Enhance our proprietary Streaming ASR and Neural TTS architectures to handle interruptions, emotional intonation, and multi-speaker dynamics seamlessly.
Ideally, we’re looking for:
- 5+ years of experience in Deep Learning research and engineering, with a strong track record of bringing research concepts to production.
- Advanced Academic Background: M.Sc. or Ph.D. in Computer Science, AI, or a related field, with a focus on Generative Models or Computer Vision.
- Generative Media Expertise: Deep understanding of modern architectures (Transformers, Diffusion, GANs) applied to video synthesis, neural rendering, or audio generation.
- Strong Engineering Skills: Proficiency in Python and deep learning frameworks (PyTorch is preferred), with the ability to write clean, modular, and scalable code.
- Inference Optimization: Experience optimizing models for low-latency real-time inference (e.g., Quantization, TensorRT, ONNX).
These would also be nice:
- Top-Tier Publications: A record of published papers in major AI conferences (CVPR, NeurIPS, ICCV, etc.).
- Low-Level Optimization: Experience with CUDA or C++ for maximizing GPU performance.
- Streaming Knowledge: Familiarity with real-time media protocols like WebRTC.
The perks:
- Hybrid, flexible work environment
- Extended private health (including mental) insurance
- Personal and professional development programs
- Occasional Cross company long weekends