Senior System Software Engineer Platform - OpenBMC - NVIDIA - Raanana

Senior System Software Engineer Platform - OpenBMC

No longer accepting applications

Overview

Job TypeOn-site

Experience8 years

Job PositionEmbedded

UpdatedAug 02, 2024

LocationRaanana

SalaryN/A

Skills

Bash
Python
Go
C++
C
Linux
RESTful API
PMBus
Mail-box
MCTP
NVMe
OAuth
OpenBMC
PCIe
PLDM
thermal management
power management
X86
ARM
Redfish
SMBus
SPI
uboot
I2C
BMC Firmware
BMC-BIOS communication
building Linux images
deploying Linux images
device monitoring
firmware security
firmware update
HTTPs
Linux upgrade mechanisms
I3C
IPMI
JSON
KCS
Linux distributions
Linux kernel
Linux packages

NVIDIA’s invention of the GPU in 1999 fueled the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company.” We're looking to grow our company, and form teams with the smartest people in the world.

Are you ready to change the next generation of computing? Join us at the forefront of technological advancement.

What You’ll Be Doing:

Design and implement OpenBMC Firmware for GPU Server platforms focus on but not limited to Arm architecture.
Hands on work with bringing up of BMC firmware, performance analysis and coding various manageability features for NVIDIA’s Server platforms
Developing and reviewing code, writing and reviewing design documents, reviewing QA test plan and working closely with all team members to achieve consensus for design and testability as per product requirements.
Designing solutions for errors, stats & configuration appropriate to CPU, GPU, DIMM, SSDs, NICs, IB, PSU, BMC, FPGA, CPLD etc. for enterprise readiness of NVIDIA Server platforms.
Designing and developing performance optimized active monitoring BMC solutions using DMTF Standards including MCTP, Redfish, SPDM and PLDM specifications
Instrumenting code to ensure maximum code coverage, writing and automating unit tests for each implemented module and maintain detailed unit test case reports
Providing software quality reports based on static analysis, code coverage, CPU load.
Working with security team to ensure developed code is in line with product security goals. Working closely with hardware teams to influence hardware design and review HW architecture & schematics.

What We Need To See:

A Bachelor of Science Degree (or higher) in Electrical Engineering or Computer Science or equivalent experience.
8+ years of experience.
Domain expertise in BMC Firmware development on X86 or ARM Platforms including BMC-BIOS communication, thermal management, power management, firmware update, device monitoring, firmware security, etc.
Board Bring-up expertise with hands-on experience in Device drivers like I2C/I3C, SPI, PCIe, SMBus, Mail-box etc. as well as the device trees for uboot and Linux kernel.
OOB or In-band System Management experience with exposure to standards IPMI, KCS, DMTF Standards (PLDM, MCTP, Redfish, etc), PMBus, NVMe, etc.
Understanding on REST architecture style especially JSON over HTTPs with OAuth
Strong programming and scripting skills using C/C++, Bash, Python, Go etc. both for Linux user-space programs and system programs with thorough code reviewing skills. Strong in Linux fundamentals, various Linux distributions and packages, Linux upgrade mechanisms, building and deploying Linux images.
You should possess excellent written and oral communication skills, good work ethics, high sense of team-work, love to produce quality work and commitment to finish your tasks every single day. You are a self-starter who loves to find creative solutions to challenging problems

Ways To Stand Out From The Crowd:

Contributor to industry standards like Open Compute, OpenBMC, IPMI, DMTF Standards, and open source.
Expertise in system software and platform security for x86/ARM based Rack/Blade server systems.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people on the planet working for us. If you're creative and autonomous, we want to hear from you!

The base salary range is 180,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits . NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Over the last two decades, GPU acceleration became the standard for scientific computing, with the fastest supercomputers in the world combining CPUs and GPUs. The emergence of Generative AI has created the larger GPU datacenters ever, creating the strongest ever supercomputers. These datacenters are sophisticated and thus challenging to operate. Yet, as an analogy from autonomous vehicle industry, we strive to automate it as much as possible. Therefore, today we are building an outstanding data scientist team to tackle these challenges. The emerging field of Artificial Intelligence for IT operations (AIOps https://arxiv.org/pdf/2304.04661.pdf ) strives to apply AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide practical insights with the primary goal of improving availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be used to improve operational efficiency. We categorize the key AIOps tasks as - incident detection (e.g. via Anomaly Detection), failure prediction, root cause analysis and automated actions. Moreover we are also working on optimization problem to increase not only the resiliency but also the performance of the datacenter.

What You'll Be Doing:

Explore high-level, undefined ideas and solve real-life problems using structured and unstructured data.
Craft proof-of-concept rooted in first principles that apply modern data science techniques to operation use cases.
Collaborate in a multi-disciplinary environment with domain experts in various fields such as networking, high performance computing for AI, telemetry etc.
Develop a strategic vision for Nvidia networking together with adjacent architects and research groups.
Define the data pipelines and ML architecture for SaaS for handling hyper scale data problems.
Support software developers to migrate prototyped to end-to-end pipelines that are suitable for deployment in production environments.

What We Need To See:

M.Sc. or PhD. in Science or Engineering
12+ years of relevant experience
Validated excellent and industry experience in data science or machine learning with a variety of ML/DL algorithms and their application
Consistent record of staying ahead of technology envelope, understand pioneering research, dabble into new technologies to develop practical applications and generate innovative ideas.
Great motivation, with strong interpersonal skills and the ability to communicate highly technical concepts with non-technical audiences
"Can do attitude" - ability to succeed in ambiguous settings where part of the challenge is to define it.
Strong programming skills in Python (including unit-tests, CI&CD etc), as well as comfort using Linux and typical development tools (e.g., GitHub, Docker)
Experience in large scale data systems (on-prem and/or cloud).
Proficiency in deep learning frameworks.

Ways To Stand Out From The Crowd:

Past senior technical roles such as principle data scientist, team leader, tech lead, head of ML in a startup. Publications in peer-reviewed journals or conferences. Previous real-world experiencing developing models for anomaly detection, predictive forecasting, root-cause-analysis use cases.
Experience in developing and deploying ML pipelines at large scales (TB+). Beyond supervised learning: optimization using Reinforcement learning and adaptive experimentation. Experience with ML deployment lifecycle including model monitoring and retraining.
Experienced with networking, cloud, data-center, edge computing technologies.

NVIDIA is committed to encouraging a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Nvidia

Your Account

Your Account