We are seeking an experienced Machine Learning Engineer to join our Platform and DevOps Engineering group. In this critical role, you will be instrumental in building and maintaining a high-availability, scalable model inference infrastructure that supports advanced machine learning models, including Large Language Models (LLMs) and anomaly detection systems.
We are looking for an experienced Machine Learning Engineer who is passionate about developing scalable and robust machine learning infrastructures and has a keen interest in leveraging advanced AI models to drive significant business impact. The job involves creating and maintaining high-performance model inference systems that support both batch and real-time AI processing in cloud and on-premises environments.
Responsibilities:
- Design and implement scalable, high-availability machine learning inference architectures.
- Develop robust systems that efficiently manage the deployment and operation of complex models like LLMs and anomaly detectors.
- Utilize AWS technologies such as SageMaker, Lambda, SQS, and Redshift to optimize the performance and scalability of our machine learning infrastructure.
- Collaborate with data scientists and AI researchers to ensure seamless integration and optimal performance of machine learning models.
- Build and maintain monitoring systems to ensure the stability and efficiency of the machine learning infrastructure.
- Troubleshoot and resolve issues related to model performance, infrastructure bottlenecks, and system failures.
- Maintain strong communication skills and collaborate effectively within a dynamic team environment.
Skills:
- Proven ability to work effectively in a team setting.
- At least 4-5 years of experience in machine learning engineering or a related field.
- Strong experience in building and maintaining scalable machine learning infrastructures.
- Proficient with AWS services, particularly SageMaker, Lambda, SQS, and Redshift.
- Deep understanding of machine learning operations (MLOps) and best practices for model deployment.
- Expertise in Python, with familiarity in other scripting languages such as Bash.
- Solid experience with big data technologies and data management tools.
- Knowledgeable in continuous integration and continuous deployment (CI/CD) practices.
Advantages:
- Experience with real-time machine learning model deployment.
- Familiarity with cybersecurity applications of machine learning.
- Advanced skills in performance optimization for high-throughput systems.
Tech Stack:
AWS (SageMaker, Lambda, SQS, Redshift), Deep Speed, TensorFlow, PyTorch, Scikit-learn, Airflow, Python, Docker, Kubernetes, Jenkins, Terraform, Ansible, GitHub, and more.