Description
We are looking for a Sr. Engineer to design, build, and scale the infrastructure powering NVIDIA's AI agent ecosystem. You will work at the intersection of distributed systems, developer platforms, and agentic AI , building the foundational services that enable teams across the company to develop, deploy, orchestrate, and operate autonomous AI agents at production scale.
Key Responsibilities:
- Build and develop platform services that own the full agent lifecycle from registration through deployment, execution, and teardown
- Architect Kubernetes-based execution environments with pod lifecycle management, namespace isolation, persistent storage, and identity propagation
- Develop and maintain automated CI/CD pipelines using GitLab CI and ArgoCD, including reusable pipeline templates and deployment blueprints that standardize how agents are built across teams
- Build framework-agnostic infrastructure supporting multiple agent SDKs (Claude Code, OpenAI Codex, LangGraph), with hands-on experience using harnesses, lifecycle hooks, skills configurability, observability (OTEL), and memory services
- Build and operate Kafka-based message pipelines and real-time event streaming using Redis PubSub and SSE
- Develop data ingestion pipelines, access interfaces, and storage layers that power AI agent knowledge and context
- Implement session management for state persistence, conversation history, and agent recovery across sessions
- Develop multi-layer auth using OAuth 2.0, JWT validation, token exchange, and gateway integration, and manage secrets lifecycle with Vault (provisioning, rotation, container injection)
Requirements:
- Bachelor's or Master's degree in Computer Science, Engineering, or related field (or equivalent experience), with 12+ years in software engineering , ideally in platform engineering, infrastructure, or developer tools
- Experience building and scaling AI agents in production using frameworks like Claude Code, Codex, or LangGraph
- Deep Kubernetes expertise including pod orchestration, persistent storage, RBAC, and multi-cluster management
- Strong Python skills with production API experience using FastAPI, Flask, or similar async frameworks
- Proven track record designing distributed systems with Kafka, Redis, and MongoDB or PostgreSQL
- Expertise building and managing robust CI/CD pipelines using GitLab CI and ArgoCD for continuous delivery to Kubernetes
- Experience designing AI data platform components (ingestion pipelines, vector stores, retrieval APIs, data preprocessing workflows) and building developer-facing platform APIs consumed by multiple engineering teams
- Solid grasp of auth and identity: OAuth 2.0, JWT, token exchange, and secrets management with Vault
- History of leading sophisticated technical projects such as migrations or greenfield platform builds, with strong interpersonal skills to drive alignment across teams and write clear design documents
Nice to Have:
- Experience building or operating AI agent platforms or agentic workflow systems, with hands-on expertise in agent protocols and frameworks like MCP, A2A, LangChain, or LangGraph
- Hands-on experience with RAG architectures, embedding pipelines, and vector databases (Milvus, Pinecone, or Weaviate)
- Full-stack skills with React or Vue for building developer portals and dashboards
- Contributions to open-source infrastructure or platform tooling
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Staff-Software-Engineer---AI-Agent-Platform_JR2016997