About the Role

We’re looking for an experienced Site Reliability Engineer (SRE) to help us scale our platform with reliability, observability, and operational excellence at the core. You’ll partner with engineers and data scientists to build, automate, and maintain the infrastructure that powers our core platform—including data pipelines, ML workloads, and real-time analytics systems.

This is a hands-on, high-impact role with visibility across the stack and the opportunity to shape the future of our infrastructure and operations.

Key Responsibilities

Design, build, and maintain scalable infrastructure to support real-time analytics and machine learning workloads
Improve system reliability and performance through automation, observability, and proactive capacity planning
Own and evolve CI/CD pipelines, deployment automation, rollback mechanisms, and config management
Implement and maintain monitoring, alerting, and incident response processes (SLOs, runbooks, on-call rotations)
Collaborate across engineering and data science teams to drive a culture of performance and reliability
Ensure security, compliance, and operational readiness across our cloud infrastructure
Drive post-incident analysis and continuous improvement initiatives

Must-Have Qualifications

8+ years of experience in SRE, DevOps, or infrastructure engineering roles
5+ years of experience with datacenter operations and/or system and network administration
Experience with containerization (Docker), and orchestration (Kubernetes)
Strong knowledge of Linux systems, networking, and systems performance tuning
Solid understanding of infrastructure-as-code (e.g., Terraform, Ansible)

Alembic is solving one of AI's defining challenges: proving cause and effect in proprietary enterprise data at scale. While generic models converge in capability, we're building the systems that process private corporate data to answer questions ChatGPT cannot—connecting marketing exposure to revenue, operations to outcomes, strategy to results.

Our customers include Nvidia, Delta Air Lines, Mars, and leading Fortune 500 companies making multi-million dollar decisions with our platform. The technical demands were so extreme we literally melted GPUs building this, which is why we're now operating one of the world's fastest private supercomputers. Backed by $145M from Prysm Capital, Accenture, WndrCo, Silver Lake Waterman, and others, we're scaling the team redefining enterprise competitive advantage in the age of AI.

Senior Site Reliability Engineer

About the role

About Alembic

Other roles at Alembic

Job details

Company

Funding

Founders

What happens next.

Confirm the fit

I pitch you to the company

A meeting lands on your calendar