DevJobs

Senior Performance Engineer - OpenShift AI

Overview
About The Job


The Red Hat Performance and Scale Engineering team is looking for a Senior Performance Engineer to join us in the PSAP - Performance and Scale for AI Platforms team. As recent advances in AI technologies have taken the world by storm, IBM and Red Hat are also jointly engineering an enterprise grade platform for leveraging the full potential of generative AI technologies. As part of this team, you will be responsible for the performance and scalability assessments of large scale multi-node, multi-GPU distributed training jobs. Our goal is to make OpenShift AI the platform of choice for our customers when leveraging generative AI technologies. You will help us achieve those goals through targeted improvements in the performance and scalability of the platform for large scale distributed training.


You will be required to formulate and execute performance test plans, investigate linux, OpenShift, cloud infrastructure, and OpenShift AI performance tuning knobs, triage and potentially fix performance issues, create new benchmarking tests and tools as needed, and socialize performance results on a regular basis. This role needs an engineer that thinks creatively, adapts to rapid change, and has the willingness to learn and apply new technologies. You will be joining a vibrant open source culture, and helping promote performance and innovation in this Red Hat engineering team.


The border mission of the Performance and Scale team is to establish performance and scale leadership of the Red Hat product and cloud services portfolio. The scope includes component level, system and solution analysis and targeted enhancements. The team collaborates with engineering, product management, product marketing and customer support as well as hardware and software partners.


What You Will Do


  • Execute performance and scalability benchmarks against OpenShift AI with a targeted focus on large scale multi-node, multi-GPU distributed training jobs
  • Collaborate with Development teams to resolve performance issues
  • Triage, debug, and solve customer cases related to AI performance
  • Publish results, conclusions, recommendations and best practices via documents and blogs to the support team, partners and customers.
  • Participate in internal and external conferences about your work and results


What You Will Bring


  • 5+ year of relevant technical experience
  • Experience in running performance tests, data capture, data analysis, and visualization
  • Programming experience in Python or willingness to learn
  • Experience working with the Linux operating system (RHEL, Fedora or CentOS preferred)
  • Experience with AI/ML technologies and frameworks (classifiers, pytorch, tensorflow etc)
  • Good written and verbal language skills in English


Following is considered a plus


  • Bachelor’s degree or equivalent experience
  • Experience with container technologies (podman, Kubernetes, docker)
  • Experience with systems performance engineering and metrics collection tools such as iostat, vmstat, sar, perf, and prometheus.
  • Knowledge of AI/ML benchmarking suites such as MLperf
  • Knowledge of generative AI (such as transformers) and distributed training technologies (such as Ray)


LI-MM3


Red Hat