Description

We are seeking a Senior Software Engineer to join our team, focusing on building and running reliable large-scale infrastructure platform services. You will ensure that our internal and external-facing EDA services atop NVIDIA hardware are running as reliably as needed.

What you'll be doing:

Design, build, deploy, and run infrastructure services & manage the software life cycle to meet our business goals.
Participate in defining internal-facing service level objectives and error budgets as part of our overall observability strategy.
Eliminate toil or automate it where the ROI of building and maintaining automation is worth it.
Practice sustainable blameless incident prevention and incident response while being a member of an on-call rotation.
Consult with and provide consultation for peer teams on systems design best practices.

What we need to see:

BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
12+ years of relevant experience.
A track record showing a good balance between initiating your own projects, convincing others to collaborate with you, and collaborating well on projects initiated by others.
Experience with infrastructure automation and distributed systems design, developing tools for running large-scale private or public cloud systems in production.
Experience in one or more of the following: Python, Go, Perl, or Ruby.
In-depth knowledge in one or more of Linux, Networking, Storage, and Containers.

Ways to stand out from the crowd:

Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. Experience accelerating positive impact to the business using coding assistants, MCP servers, or AI agents.
Experience working with or developing bare metal as a service (BMaaS) associated systems.
Experience working with or developing multi-cloud infrastructure services and running private or public cloud systems based on one or more of Kubernetes, OpenStack, Docker, or Slurm.
Experience teaching reliability (e.g., SRE) or more general cloud systems good practices to peers or to other companies (e.g., CRE).
Background with NVIDIA Collective Communication Library (NCCL).

You will also be eligible for equity and benefits.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-SC-Remote/Senior-Software-Engineer--Infrastructure-Automation-and-Distributed-Systems_JR2014877