We run one of the largest self-managed ClickHouse installations on AWS, already at petabyte scale, and we're actively preparing it for the next 10–50× of growth. This role sits at the centre of that effort.
You won't be in a typical "keep the lights on" SRE role. The work is about turning a fast-growing, stateful system into a predictable, well-automated platform — provisioning, scaling, rebalancing, recovery. That means reducing operational stress, designing safe automation for data-heavy workloads, and building the tooling and patterns that let the system scale without scaling human effort.
You'll work on the kind of problems that only show up at large scale (petabytes of data, thousands of cores, constant ingestion).
What you'll do: • Managing large fleets of EC2-based VMs, disks, and networking for data-intensive workloads • Improving operational tooling around deploys, schema changes, backups, restores, and incident response • Working closely with ClickHouse engineers to turn database-level needs into infra-level solutions • Reducing operational load by identifying repeat pain points and eliminating them through code and self-healing automation • Participating in on-call and incident response, with a strong focus on making incidents rarer over time • You'll have room to design and automate, not just respond to alerts
Requirements: • Strong experience operating production infrastructure on AWS • Hands-on experience with VM-based systems (EC2), not just managed PaaS • Experience automating infrastructure using tools like Terraform, Ansible, or similar • Solid understanding of Linux systems (disk, memory, networking, failure modes) • Experience supporting stateful systems (databases, queues, storage systems, etc.) • Prior experience with ClickHouse or other analytical databases • Ability to debug and reason about performance and reliability issues in production • Comfortable owning systems end-to-end, including on-call responsibilities
Team mission: Build and maintain a scalable, cost-efficient storage and query engine that meets both current and future product needs — including optimizing ClickHouse, supporting multiple query types with tunable performance, and ensuring data is stored once, durably, and efficiently accessible across tools.
PostHog is an open-source, all-in-one developer platform founded in 2020, designed to help engineering and product teams build successful products. The platform has expanded from a single product analytics tool into a suite of 14+ products including session replay, feature flags, A/B testing, error tracking, LLM observability, a built-in data warehouse, a CDP, and an AI assistant called Max AI. PostHog differentiates on open-source transparency, privacy-friendly self-hosting options, generous free tiers, and a developer-centric culture with no outbound sales. The company achieved unicorn status in September 2025 at a $1.4B valuation.
Salary
$100,000 - $250,000
Location
Remote
Experience
5+ years
Total raised
$182.0M
Last stage
Series E
Investors
No applications, no recruiter spam. Just the intro.
A few questions to make sure this role is the right shape for you. Two minutes.
I write the intro, send it to the founder, and handle the back-and-forth.
Cory Watilo
Lead Designer & PostHog.com Webmaster
If they’re a yes, I book the chat. You show up — that’s the whole job-hunt.