LiteLLM is an open-source AI gateway (36K+ GitHub stars) that routes hundreds of millions of LLM API calls daily for companies like NASA, Adobe, Netflix, Stripe, and Nvidia. We're at $7M ARR, 10 people, YC W23.
When LiteLLM goes down, our customers' entire AI stack goes down. We need someone who makes sure that doesn't happen.
You'd be the first dedicated reliability hire. You'll own reliability, performance, and production stability end-to-end. Nobody will tell you how to do it
We'll be straight with you: this role is roughly 60% operational reliability and 40% deep performance engineering. On any given week you might be:
update_database() does 7 deep copies per request in the spend tracking hot pathIf you're looking for a pure optimization role where you sit in a profiler all day — this isn't it. If you want to own production health for one of the most widely deployed AI infrastructure projects in the world — keep reading.
We route traffic for some of the largest AI deployments on the planet. One customer is scaling from 20M to 200M daily AI calls through our gateway. Another has 150K users hitting us daily. When we ship a bad release, it doesn't just break a dashboard — it breaks production AI systems at companies you've heard of.
Ready to apply? Let us help you stand out.
The problems here are genuinely hard:
You won't run out of interesting problems.
Production reliability
Performance engineering
Observability & release safety
Must have:
Strong signals:
LiteLLM provides an open source Python SDK and Python FastAPI Server that allows calling 100+ LLM APIs (Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic) in the OpenAI format
We have raised $1.6M in Seed funding from top investors (Y Combinator, Gravity Fund and Pioneer Fund), generate $10M+ in ARR and growing exponentially, and are meaningfully profitable.
You can find more information on our website, Github and Technical Documentation.
Salary
$200,000 - $270,000
Equity
0.25% - 0.75%
Location
San Francisco, CA, US
Experience
3+ years