Description

Site Reliability Engineer, Frontier Systems Infrastructure

Location

San Francisco

Employment Type

Full time

Department

Scaling

Compensation

$255K – $490K • Offers Equity

The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

The Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world that OpenAI uses for its most cutting edge model training.

We take data center designs, turn them into real, working systems and build any software needed for running large-scale frontier model trainings.

Our mission is to bring up, stabilize and keep these hyperscale supercomputers reliable and efficient during the training of the frontier models.

About the Role

We are looking for engineers to operate the next generation of compute clusters that power OpenAI’s frontier research.

This role blends distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale Kubernetes clusters to massive scale, automate bare-metal bring-up, and build the software layer that hides the complexity of a magnitude of nodes across multiple data centers.

You will work at the intersection of hardware and software, where speed and reliability are critical. Expect to manage fast-moving operations, quickly diagnose and fix issues when things are on fire, and continuously raise the bar for automation and uptime.

In this role, you will:

Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management

Build software abstractions that unify multiple clusters and present a seamless interface to training workloads

Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale

Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles

Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure

Develop monitoring and observability systems to detect issues early and keep clusters stable under extreme load

Be expected to execute at the same level as a software engineer

You might thrive in this role if you:

Have deep experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environments

Bring strong programming or scripting skills (Python, Go, or similar) and familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation

Are comfortable with bare-metal Linux environments, GPU hardware, and large-scale networking

Enjoy solving fast-moving, high-impact operational problems and building automation to eliminate manual work

Can balance careful engineering with the urgency of keeping mission-critical systems running

Qualifications

Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments

Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads

Proficiency in cloud infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operations

_Bonus: background with GPU workloads, firmware management, or high-performance computing_

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://jobs.ashbyhq.com/openai/ad2cf782-15a4-48c7-9133-1788e3f33bbb