# Software Engineer, Data Infrastructure

**Company**: Thinking Machines Lab
**Location**: San Francisco
**Work arrangement**: onsite
**Experience**: entry|mid|senior
**Job type**: full-time
**Salary**: $350,000 - $475,000 USD
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/thinkingmachines/jobs/5013919008
**Canonical**: https://yubhub.co/jobs/job_9be280f4-cbc

## Description

We're looking for an engineer to join our small, high-impact team responsible for architecting and scaling the core infrastructure behind distributed training pipelines, multimodal data catalogs, and intelligent processing systems that operate over petabytes of data.

As a software engineer on our data infrastructure team, you'll design, build, and operate scalable, fault-tolerant infrastructure for LLM Research: distributed compute, data orchestration, and storage across modalities. You'll develop high-throughput systems for data ingestion, processing, and transformation , including training data catalogs, deduplication, quality checks, and search. You'll also build systems for traceability, reproducibility, and robust quality control at every stage of the data lifecycle.

You'll collaborate with research teams to unlock new features, improve data quality, and accelerate training cycles. You'll implement and maintain monitoring and alerting to support platform reliability and performance.

If you're excited by distributed systems, large-scale data mining, open-source tools like Spark, Kafka, Beam, Ray, and Delta Lake, and enjoy building from the ground up, we'd love to hear from you.

## Skills

### Required
- backend language (Python or Rust)
- distributed compute frameworks (Apache Spark or Ray)
- cloud infrastructure
- data lake architectures
- batch and streaming pipelines

### Nice to have
- Kafka
- dbt
- Terraform
- Airflow
- web crawler
- deduplication
- data mining
- search
- file formats and storage systems