# Senior HPC Engineer

**Company**: IT Infrastructure
**Location**: New York, New York, United States of America
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Salary**: $175,000 to $250,000
**Category**: Engineering
**Industry**: Technology

**Apply**: https://mlp.eightfold.ai/careers/job/755955818333
**Canonical**: https://yubhub.co/jobs/job_8eda17fb-432

## Description

We are seeking a Senior HPC Engineer to join our team in a senior, hands-on role building and evolving large-scale, high-throughput HPC and GPU platforms that underpin AI- and machine-learning-driven research.

In this role, you will be part of a small, senior HPC team, taking end-to-end ownership of a significant area of the platform while collaborating closely with other subject-matter experts.

You will be a systems-level engineer who is comfortable owning complex technical decisions and designing and building production infrastructure, rather than advising from the sidelines.

We aim to build infrastructure that is reliable, understandable, and adaptable, and we value engineers who care about simplicity, clarity, and maintainability as much as raw performance.

Responsibilities:

- Design, build, and operate large-scale, high-throughput HPC and GPU clusters (for example, tens of thousands of CPU cores and hundreds of GPUs) supporting AI and machine-learning workloads.

- Collaborate with other HPC engineers and subject-matter experts to co-design system architectures, review designs, and share knowledge.

- Partner with storage specialists to architect and maintain high-performance, low-latency storage solutions, including parallel or scale-out file systems.

- Work closely with researchers, data scientists, and engineers to understand computational needs and translate them into effective, scalable system designs.

- Monitor, analyze, and optimize performance across compute, scheduling, networking, and storage layers.

- Build and maintain automation and infrastructure-as-code for provisioning, configuration, monitoring, and lifecycle management, with an emphasis on repeatability and simplicity.

- Participate in design reviews, operational discussions, and post-incident reviews with a focus on learning, collaboration, and system improvement rather than blame.

- Explore alternative approaches to scheduling, data layout, cluster architectures, and GPU utilization through small experiments or prototypes, using data to guide decisions.

- Produce clear documentation, diagrams, and reusable tooling that enable others to operate, debug, and extend the platform.

- Stay current with advancements in HPC, GPU computing, networking, and storage, and help assess where new technologies can add real value.

What you'll bring:

- Bachelor’s degree in Computer Science, Engineering, or a related technical field; a Master’s or PhD is a plus.

- Typically 7+ years of hands-on experience designing, building, and operating HPC or large-scale compute environments.

- Deep, practical experience with at least one major HPC scheduler (such as Slurm), including using it to operate large-scale or high-throughput clusters in production.

- Hands-on experience with GPU-accelerated computing, including NVIDIA GPUs and associated software ecosystems.

- Strong Linux systems engineering skills and comfort working close to the operating system, drivers, and hardware.

- Experience designing or operating high-performance storage systems, including parallel or scale-out file systems.

- Curious, evidence-driven problem solving, including experimenting with different approaches and using data to inform decisions.

- A collaborative working style that values listening, respectful discussion, and incorporating different perspectives , whether you are more quiet and reflective or more vocal in group settings.

- Clear written and verbal communication skills, and an ability to explain complex ideas in a way that works for different audiences.

- A strong sense of ownership for outcomes, paired with openness to feedback, learning, and evolving systems over time.

Additional experience that may be helpful:

- Experience with Kubernetes, Run:ai, or other workload orchestration platforms alongside traditional HPC schedulers.

- Familiarity with Lustre, GPFS / Spectrum Scale, or similar high-performance storage technologies.

- Exposure to cloud-based HPC environments (e.g., GCP or other major cloud providers).

- Experience supporting quantitative research, finance, or other demanding compute-intensive workloads.

- Interest in applying AI or ML techniques to infrastructure (for example, optimization, anomaly detection, or predictive analysis).

The estimated base salary range for this position is $175,000 to $250,000, which is specific to New York and may change in the future. Millennium pays a total compensation package which includes a base salary, discretionary performance bonus, and a comprehensive benefits package.

## Skills

### Required
- HPC
- GPU
- Linux
- Slurm
- NVIDIA
- GPU-accelerated computing
- High-performance storage systems
- Parallel or scale-out file systems
- Automation and infrastructure-as-code
- Provisioning
- Configuration
- Monitoring
- Lifecycle management
