Description
NVIDIA is building a mission-critical Observability and Prediction platform to ensure the seamless operation of AI Factories. We're looking for a Senior Software Engineer to join the AIOps platform team and help build core distributed systems that ingest massive telemetry streams from GPU clusters and operationalize predictive AI models at scale.
Responsibilities:
- Architect and build an agentic AIOps system that autonomously monitors GPU fleet health, aggregates and correlates massive telemetry streams, surfaces intelligent alerts, and orchestrates multi-step diagnostic workflows and corrective actions.
- Research, evaluate, and prototype data storage strategies and data representations across diverse database technologies and modalities.
- Design distributed systems to handle the extreme telemetry density of large-scale AI clusters.
- Instrument services with deep observability to support rapid debugging and continuous performance improvement.
- Build and own the model-serving infrastructure that operationalizes predictive algorithms at scale.
- Contribute to the platform's core libraries and abstractions that accelerate development across the broader AIOps engineering team.
Requirements:
- B.Sc./M.Sc. in Computer Science, Computer Engineering, or a related technical field.
- 8+ years of software engineering experience building production distributed systems.
- Expert-level proficiency in languages such as Go, C++, or Rust.
- Solid understanding of Kubernetes and container-based deployments for production services.
- Experience deploying, monitoring, and maintaining ML models or data-intensive services in a production environment.
Nice to Have:
- Experience building ML model-serving platforms or MLOps tooling at scale.
- A track record of taking systems from prototype to stable, production-grade platform serving real enterprise customers.
- A "Systems" Thinker with practical innovation skills.
Competitive salaries and a generous benefits package are offered.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/Israel-Raanana/Senior-Software-Engineer--AIOps_JR2019710