Description
We're looking for an ambitious Systems / Platform Engineer to join a team at the intersection of SRE and low-latency distributed systems. This team will help power Pinterest's next generation of realtime ML and measurement infrastructure, with a focus on sub-millisecond decisioning, high-throughput data access, and tight integration with Pinterest's core tech stack.
In this role, you'll think about queries and RPCs in terms of syscalls, cache lines, and wire formats, and design systems that stay fast and predictable under load. You'll help define and harden the foundation for our training and serving stack: from storage and indexing strategies, to streaming and fanout, to backpressure and failure handling across services and regions.
You'll work closely with software engineering, data infra, and SRE partners to ensure our systems are observable, debuggable, and operable in production. If topics like IO scheduling and batching, lock-free or low-contention data structures, connection pooling, query planning, kernel and network tuning, on-disk layout and indexing, circuit-breaking, autoscaling, incident response, NixOS, Rust, and robust SLIs/SLOs sound interesting (even if it's just a subset), this role gives you a chance to apply that expertise to business-critical, high-leverage infrastructure at Pinterest scale.
What you'll do:
- Scale the decision making process for tools for the tvScientific AI team, from our workflows to our training infrastructure to our Kubernetes deployments
- Improve the developer experience for the data science team
- Upgrade our observability tooling
- Make every deployment smooth as our infrastructure evolves
What we're looking for:
- Deep understanding of Linux
- Excellent writing skills
- A systems-oriented mindset
- Experience in high-performance software (RTB, HFT, etc.)
- Software engineering experience + reliability (e.g. CI/CD) expertise
- Strong observability instincts
- Demonstrated ability to use AI to improve speed and quality in your day-to-day workflow for relevant outputs
- Strong track record of critical evaluation and verification of AI-assisted work (e.g., testing, source-checking, data validation, peer review)
- High integrity and ownership: you protect sensitive data, avoid over-reliance on AI, and remain accountable for final decisions and deliverables
Nice-To-Haves:
- Reverse-engineering experience
- Terraform, EKS, or MLOps experience
- Python, Scala, or Zig experience
- NixOS experience
- Adtech or CTV experience
- Experience deploying a distributed system across multiple clouds
- Experience in hard real-time low-latency