Description
We are seeking a Sr Production Engineer to join our team in Virginia. As a Sr Production Engineer, you will be responsible for designing, automating, and operating the IAM, account/subscription, and project lifecycle across AWS, Azure, and GCP, enforcing least-privilege and standardized access patterns at scale.
You will also review, implement, and continuously improve cloud identity and access policies (IAM, Okta, Opal) to align with Databricks security standards and audit requirements.
Additionally, you will build and maintain reliable, observable automation and tooling to apply cloud changes (roles, policies, accounts, networking) safely and repeatedly.
You will treat operational and security issues as software problems: eliminate toil, drive root-cause analysis, and codify fixes into infrastructure and tooling.
You will own and improve security and audit logging data pipelines from cloud providers into our internal systems, ensuring timely, accurate data for detection, investigations, and audits.
You will partner with Security, Compliance, and Audit teams to provide evidence, clarifications, and policy updates that keep our environments aligned with evolving standards.
You will operate and improve specialized, highly regulated environments (e.g., FedRAMP / GovCloud) including release management, patching cadences, and supporting secure access workflows (e.g., SAW).
You will ensure high availability and resiliency for critical security and access infrastructure across these environments.
You will participate in a 24x7 on-call rotation for high-severity incidents impacting cloud accounts, IAM, or security data pipelines.
You will act as a key partner to product engineering, security engineering, and field teams during incidents to restore service and harden systems for the future.
To be successful in this role, you will need to have a strong background in cloud and infrastructure expertise, specifically with AWS, Azure, or GCP, and experience with infrastructure-as-code and automation.
You will also need to have a proven track record of working in or with security-sensitive or regulated environments and translating requirements into concrete technical controls.
You will need to have demonstrated success running high-availability, security-critical services, including on-call responsibilities and incident management.
You will need to have strong debugging and problem-solving skills across distributed systems, with the ability to navigate ambiguous issues spanning multiple teams and platforms.
If you have experience with Okta, Opal, or similar identity/access tooling, background operating secure admin workstations (SAW) or comparable hardened access patterns, and experience migrating cloud accounts or subscriptions during M&A or large-scale reorganizations, that would be a bonus.