NVIDIA is seeking a technical leader to define, craft, implement, and guide firmware architecture for reliability, availability, serviceability, and power management across next-generation NVIDIA Networking products and platforms. You will take a strong hands-on role, working with hardware, firmware, software, validation, customer engineering, and external partners to build robust, diagnosable, power-efficient systems for large-scale deployments.
NVIDIA's invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI, with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as the AI computing company. We are looking to grow our teams with the smartest people in the world. If you're creative and autonomous, we want to hear from you!
What you'll be doing
- Define platform-level firmware architecture for RAS and power management across SoCs, accelerators, DPUs, servers, embedded systems, and data center platforms.
- Own error detection, classification, containment, recovery, escalation, and reporting architecture.
- Define firmware architecture for power sequencing, power states, reset flows, thermal and power fault handling, idle management, and recovery from power-related failures.
- Create firmware specifications for hardware error handling, health monitoring, crash capture, telemetry, diagnostics, debug data, and field serviceability.
- Define interfaces and contracts between firmware, hardware, operating systems, BMCs, management controllers, platform software, and cloud/service infrastructure.
- Drive architecture reviews, tradeoff discussions, failure-mode analysis, validation strategy, and long-term RAS and power management roadmap planning.
- Establish standards for error logs, event schemas, telemetry flows, recovery policies, service diagnostics, and production debug infrastructure.
- Guide engineering teams through implementation, validation, silicon bring-up, platform integration, and production deployment of RAS and power management features.
- Analyze customer and field failures, identify architectural gaps, and feed lessons learned into future platform designs.
What we need to see
- BSc, MS, or PhD in Electrical Engineering, Computer Science, Computer Engineering, or equivalent experience.
- 7+ years of relevant experience in firmware, platform architecture, embedded systems, or low-level systems software.
- Deep understanding of RAS principles, fault modeling, error containment, recovery policies, diagnosability, and serviceability requirements.
- Experience architecting firmware for complex hardware platforms such as SoCs, accelerators, DPUs, servers, networking devices, or embedded systems.
- Strong knowledge of power management concepts, including power sequencing, reset architecture, thermal and power fault handling, power state transitions, and platform recovery flows.
- Familiarity with boot firmware, UEFI/BIOS, BMC, embedded controllers, RTOS, embedded Linux, or platform management stacks.
- Strong understanding of hardware/software interfaces, registers, interrupts, telemetry paths, debug infrastructure, and firmware-to-hardware contracts.
- Programming and debugging fundamentals across languages such as C/C++, Python/Perl scripting, Verilog, assembly, or RISC-V assembly.
- Ability to lead cross-functional architecture discussions and drive alignment across hardware, firmware, software, validation, product, and customer-facing teams.
- Excellent communication skills, strong technical leadership, and a real passion for working collaboratively.
Ways to stand out from the crowd
- Experience with PCIe AER, CXL RAS, memory RAS, ECC/parity, accelerator RAS, networking RAS, high-availability systems, or large-scale data center platforms.
- Knowledge of ACPI, SMBIOS, UEFI, PLDM, MCTP, Redfish, IPMI, or cloud telemetry systems.
- Experience with power/thermal fault handling, dynamic power management, platform power sequencing, low-power states, or autonomous recovery mechanisms.
- Background in silicon bring-up, platform validation, production diagnostics, or customer failure analysis.
- Prior technical leadership experience as a firmware architect, principal engineer, platform lead, or domain owner.
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
, , JR2018727