Description
We are seeking an experienced Senior Software SDET Test Development Engineer to join our platform SWQA team. As a key member of our team, you will be responsible for the development and execution of NVIDIA HGX/DGX/MGX platform test plans on servers, OS, FW, and CUDA SW stack from design documents. You will install and test various systems, OS, server firmware, and SW stack, drive support for root cause analysis on reliability and validation test failures, build, develop, and debug server and OS level automation front-end and back-end frameworks and tests, review partner and supplier test results, and prescribe additional reliability testing on components, servers, and packaging as needed. You will work in an agile software development team with high production quality standards, manage bug lifecycles, and collaborate with inter-groups to drive solutions. To succeed in this role, you will need a bachelor's degree in a STEM field, 5+ years of proven experience, or a master's degree, and strong server and Linux troubleshooting and debugging experience in a bare-metal and KVM/VMWare/Hyper-V environment. You will also need good knowledge and hands-on experience in model testing, AI tools/frameworks, NLP, and LLM benchmarking, experience in using AI development tools for test plans creation, test cases development, and test cases automation, and strong experience in FW, BMC/OpenBMC, network protocol, internal/external enterprise storage devices, PCIe buses and devices, IO sub-devices, CPU and memory, ACPI, UEFI spec, Redfish. You will also need proven years of experience in GitHub/GitLab/Gerrit, PXE, SLURM, Stack/Kubernetes/Docker. You will be eligible for equity and benefits. Applications for this job will be accepted until April 14, 2026.
You will be responsible for:
- Developing and executing NVIDIA HGX/DGX/MGX platform test plans on servers, OS, FW, and CUDA SW stack from design documents.
- Installing and testing various systems, OS, server firmware, and SW stack.
- Driving support for root cause analysis on reliability and validation test failures.
- Building, developing, and debugging server and OS level automation front-end and back-end frameworks and tests.
- Reviewing partner and supplier test results and prescribing additional reliability testing on components, servers, and packaging as needed.
You will need:
- A bachelor's degree in a STEM field.
- 5+ years of proven experience or a master's degree.
- Strong server and Linux troubleshooting and debugging experience in a bare-metal and KVM/VMWare/Hyper-V environment.
- Good knowledge and hands-on experience in model testing, AI tools/frameworks, NLP, and LLM benchmarking.
- Experience in using AI development tools for test plans creation, test cases development, and test cases automation.
- Strong experience in FW, BMC/OpenBMC, network protocol, internal/external enterprise storage devices, PCIe buses and devices, IO sub-devices, CPU and memory, ACPI, UEFI spec, Redfish.
- Proven years of experience in GitHub/GitLab/Gerrit, PXE, SLURM, Stack/Kubernetes/Docker.