# Senior HPC Software Engineer

**Company**: Ford Motor Company
**Location**: United States
**Work arrangement**: remote
**Experience**: senior
**Job type**: full-time
**Salary**: $113,580-192,900
**Category**: Engineering
**Industry**: Automotive
**Wikidata**: https://www.wikidata.org/wiki/Q44294

**Apply**: https://efds.fa.em5.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1/job/64140?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_ce0ebba5-c8c

## Description

We are seeking a senior technical contributor to help support, modernize, and scale our on-premise high-performance computing platform. This role will work across Linux systems administration, HPC operations, Kubernetes-based services, automation, observability, software tooling, and user-facing platform delivery.

The ideal candidate has deep experience administering RHEL-based systems in complex compute environments and is comfortable troubleshooting issues across operating systems, schedulers, storage, networking, containers, applications, and user workloads.

This person will play a key role in improving the reliability, usability, and operational maturity of the platform. They will help develop and maintain core HPC services, support users running demanding engineering and AI/ML workloads, and create tooling, scripts, APIs, and integrations.

Strong software engineering fundamentals are important, including experience with Python, Go, or similar languages, Git-based development workflows, code reviews, testing practices, CI/CD pipelines, documentation, and maintainable code design. Experience with Slurm or other workload managers is highly valued.

We are looking for someone who can balance strong technical depth with a user-focused delivery mindset. This role requires the ability to work collaboratively with platform engineers, application teams, and technical users to identify pain points, resolve production issues, document repeatable processes, and build durable improvements.

The right candidate will be pragmatic, a team player, comfortable in a fast-moving environment, and motivated by making complex, massive on-prem infrastructure easier to operate, automate, observe, and continuously improve.

Responsibilities: Administer, troubleshoot, and improve RHEL-based high-performance computing environments supporting CPU and GPU workloads. Create and maintain HPC services across compute, storage, networking, scheduling, Kubernetes, and observability. Develop tools, scripts, APIs, integrations, and automation using Python, Go, Bash, or similar languages. Apply software engineering best practices, including Git workflows, code reviews, testing, modular design, and CI/CD. Support and help update HPC scheduling environments, with Slurm experience preferred. Improve monitoring, alerting, dashboards, and operational visibility using Grafana, Prometheus, Dynatrace, and related tools. Partner with users, customers, and internal engineering teams to understand requirements, resolve issues, and improve platform usability. Create and maintain documentation, architecture notes, user guides, and operational procedures. Drive platform modernization focused on reliability, scalability, automation, security, and maintainability.

Qualifications: Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience 10+ years of experience in systems engineering, infrastructure engineering, platform engineering, or a related technical role. Strong Linux systems administration experience, preferably with RHEL. Experience with Slurm, PBS, or another HPC workload manager. Experience creating APIs, applications, and services that support platform operations and user workflows. Experience supporting production compute, infrastructure, and large-scale technical environments. Hands-on experience with scripting and software development using Python, Go, Bash, or similar languages. Familiarity with CI/CD concepts, GitHub, and modern software delivery practices. Strong troubleshooting skills across operating systems, services, networking, storage, and application layers. Ability to write clear documentation and communicate effectively with both technical and non-technical stakeholders. Strong ownership mindset with the ability to drive issues to resolution. Ability to use independent judgment to make sound technical decisions.

## Skills

### Required
- Linux systems administration
- HPC operations
- Kubernetes-based services
- Automation
- Observability
- Software tooling
- User-facing platform delivery
- Python
- Go
- Git-based development workflows
- Code reviews
- Testing practices
- CI/CD pipelines
- Documentation
- Maintainable code design
- Slurm
- Workload managers

---

Source: [Apply at efds.fa.em5.oraclecloud.com](https://efds.fa.em5.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1/job/64140?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)