# Solutions Architect, Inference Deployments

**Company**: NVIDIA
**Location**: Santa Clara
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Solutions-Architect--Inference-Deployments_JR2014105?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_1f2551e9-1a4

## Description

We're forming a team of innovators to roll out and enhance AI inference solutions at scale, demonstrating NVIDIA's GPU technology and Kubernetes. As a Solutions Architect focused on inference, you'll collaborate closely with our engineering, DevOps, and customers to develop enterprise AI solutions. Together, we'll deliver generative AI to production!

**Key Responsibilities:**

- Build inference pipelines with tools like NVIDIA Dynamo, distributing tasks among GPU workers to improve efficiency.

- Collaborate with DevOps teams to orchestrate disaggregated inference using Kubernetes for complex workloads.

- Accelerate inference pipelines using TensorRT-LLM, vLLM, SGLang, and other backends to ensure seamless integration with disaggregated inference.

- Provide mentorship and technical leadership to customers and internal teams, guiding them through the deployment of disaggregated inference systems and resolving complex issues.

**Requirements:**

- 5+ Years in Solutions Architecture with a proven track record of deploying distributed systems and AI inference workloads on Kubernetes.

- Experience with one of NVIDIA Dynamo, Triton Inference Server, or TensorRT-LLM for model optimization and serving.

- GPU orchestration using NVIDIA GPU Operator, NIM Operator, and Multi-Instance GPU (MIG) partitioning.

- Solving sophisticated GPU allocation, memory hierarchies (HBM, DRAM, SSD), and low-latency networking (RDMA, UCX).

- Demonstrated success in tuning large language models for low-latency inference in enterprise environments.

- BS in CS/Engineering or equivalent experience.

**Nice to Have:**

- Prior experience deploying NVIDIA inference technologies such as Dynamo, NIM, NIXL and Grove.

- Deep understanding of transformer neural network, and inference acceleration technologies like quantization, speculative decoding, WideEP etc.

- NVIDIA Certified AI Engineer or similar credentials.

- Contributions to open-source projects including NVIDIA Dynamo, vLLM, KServe, or SGLang.

## Skills

### Required
- NVIDIA Dynamo
- Kubernetes
- TensorRT-LLM
- vLLM
- SGLang
- GPU orchestration
- NVIDIA GPU Operator
- NIM Operator
- Multi-Instance GPU (MIG) partitioning
- GPU allocation
- memory hierarchies
- low-latency networking

### Nice to have
- Prior experience deploying NVIDIA inference technologies
- Deep understanding of transformer neural network
- inference acceleration technologies like quantization, speculative decoding, WideEP etc.
- NVIDIA Certified AI Engineer or similar credentials
- Contributions to open-source projects including NVIDIA Dynamo, vLLM, KServe, or SGLang

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Solutions-Architect--Inference-Deployments_JR2014105?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
