# Site Reliability Engineer

**Company**: Dropbox
**Location**: Remote - Mexico
**Work arrangement**: remote
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/dropbox/jobs/7539817?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_5532e584-85e

## Description

## Role Description

As a Corporate Site Reliability Engineer (SRE) at Dropbox, you will help lead the infrastructure strategy and technical direction of one of the most innovative technology companies globally. Successful candidates will possess a growth mindset, strong accountability and be passionate about designing, building, and securing scalable infrastructure services in a dynamic environment. You will drive improvement projects in automation and observability and effectively handle incidents that arise in a prompt but measured way. In this role, you'll serve as a technical lead of programs related to monitoring, metrics, alerting and reliability throughout the IT Services organization, and contribute to the evolution of our world-class infrastructure while ensuring utmost security and scalability.

## Responsibilities

- Ensure the reliability, scalability, and performance of Dropbox's infrastructure and services

- Collaborate with cross-functional teams to develop and maintain best practices for monitoring, logging, and incident response

- Build, Implement and maintain automations & infrastructure-as-code tooling, specifically Terraform, Ansible, and Github Actions as well as custom code platforms

- Utilize container orchestration platforms, such as Kubernetes, Amazon ECS and Red Hat Openshift, to manage containers at scale

- Manage and optimize monitoring and logging pipelines using tools like Datadog and Cribl LogStream

- Drive improvement projects related to service health and visibility for our stakeholders, ranging from developers to business service owners to C-level

- Develop and maintain custom tooling and automation scripts in Bash, Python and other scripting languages

On-call work may be necessary occasionally to help address bugs, outages, or other operational issues, with the goal of maintaining a stable and high-quality experience for our customers.

## Requirements

- 5+ years of experience in site reliability engineering or a similar engineering roles with hands-on coding experience

- Strong knowledge of AWS services, including EC2, S3, RDS, R53, Lambda, and others

- Strong knowledge of Linux administration, internals, filesystems, volume management and specific distro's such as Ubuntu, RHEL, DNS, DHCP

- Experience with monitoring and logging tools, Datadog and logging pipeline tools such as Vector or Cribl LogStream

- Experience driving one or more transformational programs related to metrics and observability

- Experience with scripting in a higher level language (Python preferred)

- Experience developing automation to solve infrastructure-related tasks with tools such as Chef/Ansible/Terraform

- Experience with log analysis and building metrics, alerts and visuals from log data

- Strong proficiency in infrastructure-as-code tools, such as Terraform

- Strong Proficiency in Config Management tools specifically Ansible Automation Platform and Chef

- Experience with containerization technologies, such as Docker, and container orchestration platforms like Kubernetes or Amazon ECS

- Knowledge of LDAP, REST API's and current Auth

- Familiarity with GitHub and Git-based workflows

- Understanding of RDS databases and network security technologies, such as WAF

- Strong problem-solving skills and the ability to work well in a fast-paced, collaborative environment

- Excellent written and verbal communication skills

## Preferred Qualifications

- Experience managing large-scale multi-cloud or hybrid infrastructure.

- Strong background in Infrastructure as Code (Terraform, Ansible) and GitOps workflows.

- Familiarity with Kubernetes, Docker, and serverless platforms.

- Proven track record improving observability, reliability, and incident response.

- Understanding of compliance and security frameworks (SOC2, ISO 27001, FedRAMP).

- Experience implementing Zero Trust security and access models.

## Skills

### Required
- AWS
- Linux administration
- Monitoring and logging tools
- Datadog
- Cribl LogStream
- Terraform
- Ansible
- Kubernetes
- Docker
- Python
- Bash
- LDAP
- REST API's
- GitHub

### Nice to have
- Large-scale multi-cloud or hybrid infrastructure
- Infrastructure as Code
- GitOps workflows
- Serverless platforms
- Compliance and security frameworks
- Zero Trust security and access models

---

Source: [Apply at job-boards.greenhouse.io](https://job-boards.greenhouse.io/dropbox/jobs/7539817?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
