New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
NVIDIA

Senior Software Engineer, Infrastructure Automation and Distributed Systems

NVIDIA
Apply →
remote senior full-time

First indexed 18 Jun 2026

Description

We are seeking a Senior Software Engineer to join our team, focusing on building and running reliable large-scale infrastructure platform services. You will ensure that our internal and external-facing EDA services atop NVIDIA hardware are running as reliably as needed.

What you'll be doing:

  • Design, build, deploy, and run infrastructure services & manage the software life cycle to meet our business goals.
  • Participate in defining internal-facing service level objectives and error budgets as part of our overall observability strategy.
  • Eliminate toil or automate it where the ROI of building and maintaining automation is worth it.
  • Practice sustainable blameless incident prevention and incident response while being a member of an on-call rotation.
  • Consult with and provide consultation for peer teams on systems design best practices.

What we need to see:

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
  • 12+ years of relevant experience.
  • A track record showing a good balance between initiating your own projects, convincing others to collaborate with you, and collaborating well on projects initiated by others.
  • Experience with infrastructure automation and distributed systems design, developing tools for running large-scale private or public cloud systems in production.
  • Experience in one or more of the following: Python, Go, Perl, or Ruby.
  • In-depth knowledge in one or more of Linux, Networking, Storage, and Containers.

Ways to stand out from the crowd:

  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. Experience accelerating positive impact to the business using coding assistants, MCP servers, or AI agents.
  • Experience working with or developing bare metal as a service (BMaaS) associated systems.
  • Experience working with or developing multi-cloud infrastructure services and running private or public cloud systems based on one or more of Kubernetes, OpenStack, Docker, or Slurm.
  • Experience teaching reliability (e.g., SRE) or more general cloud systems good practices to peers or to other companies (e.g., CRE).
  • Background with NVIDIA Collective Communication Library (NCCL).

You will also be eligible for equity and benefits.