Description

We are looking for a Sr. Engineer to design, build, and scale the infrastructure powering NVIDIA's AI agent ecosystem. You will work at the intersection of distributed systems, developer platforms, and agentic AI , building the foundational services that enable teams across the company to develop, deploy, orchestrate, and operate autonomous AI agents at production scale.

Key Responsibilities:

Build and develop platform services that own the full agent lifecycle from registration through deployment, execution, and teardown
Architect Kubernetes-based execution environments with pod lifecycle management, namespace isolation, persistent storage, and identity propagation
Develop and maintain automated CI/CD pipelines using GitLab CI and ArgoCD, including reusable pipeline templates and deployment blueprints that standardize how agents are built across teams
Build framework-agnostic infrastructure supporting multiple agent SDKs (Claude Code, OpenAI Codex, LangGraph), with hands-on experience using harnesses, lifecycle hooks, skills configurability, observability (OTEL), and memory services
Build and operate Kafka-based message pipelines and real-time event streaming using Redis PubSub and SSE
Develop data ingestion pipelines, access interfaces, and storage layers that power AI agent knowledge and context
Implement session management for state persistence, conversation history, and agent recovery across sessions
Develop multi-layer auth using OAuth 2.0, JWT validation, token exchange, and gateway integration, and manage secrets lifecycle with Vault (provisioning, rotation, container injection)

Requirements:

Bachelor's or Master's degree in Computer Science, Engineering, or related field (or equivalent experience), with 12+ years in software engineering , ideally in platform engineering, infrastructure, or developer tools
Experience building and scaling AI agents in production using frameworks like Claude Code, Codex, or LangGraph
Deep Kubernetes expertise including pod orchestration, persistent storage, RBAC, and multi-cluster management
Strong Python skills with production API experience using FastAPI, Flask, or similar async frameworks
Proven track record designing distributed systems with Kafka, Redis, and MongoDB or PostgreSQL
Expertise building and managing robust CI/CD pipelines using GitLab CI and ArgoCD for continuous delivery to Kubernetes
Experience designing AI data platform components (ingestion pipelines, vector stores, retrieval APIs, data preprocessing workflows) and building developer-facing platform APIs consumed by multiple engineering teams
Solid grasp of auth and identity: OAuth 2.0, JWT, token exchange, and secrets management with Vault
History of leading sophisticated technical projects such as migrations or greenfield platform builds, with strong interpersonal skills to drive alignment across teams and write clear design documents

Nice to Have:

Experience building or operating AI agent platforms or agentic workflow systems, with hands-on expertise in agent protocols and frameworks like MCP, A2A, LangChain, or LangGraph
Hands-on experience with RAG architectures, embedding pipelines, and vector databases (Milvus, Pinecone, or Weaviate)
Full-stack skills with React or Vue for building developer portals and dashboards
Contributions to open-source infrastructure or platform tooling

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Staff-Software-Engineer---AI-Agent-Platform_JR2016997