About the role

TLDR

LiteLLM is an open-source AI gateway (36K+ GitHub stars) that routes hundreds of millions of LLM API calls daily for companies like NASA, Adobe, Netflix, Stripe, and Nvidia. We're at $7M ARR, 10 people, YC W23.

When LiteLLM goes down, our customers' entire AI stack goes down. We need someone who makes sure that doesn't happen.

You'd be the first dedicated reliability hire. You'll own reliability, performance, and production stability end-to-end. Nobody will tell you how to do it

What this job actually is

We'll be straight with you: this role is roughly 60% operational reliability and 40% deep performance engineering. On any given week you might be:

Hunting a memory leak in our async streaming handler that causes OOMs after 4 hours under load
Fixing a race condition where PodLockManager releases another pod's lock
Profiling why update_database() does 7 deep copies per request in the spend tracking hot path
Helping a Fortune 500 customer debug why their 20-pod deployment is exhausting Postgres connections
Building soak tests that catch degradation before a release goes out
Reviewing a PR that touches the request hot path and saying "this will add 50ms at P99, here's why"

If you're looking for a pure optimization role where you sit in a profiler all day — this isn't it. If you want to own production health for one of the most widely deployed AI infrastructure projects in the world — keep reading.

Why this matters

We route traffic for some of the largest AI deployments on the planet. One customer is scaling from 20M to 200M daily AI calls through our gateway. Another has 150K users hitting us daily. When we ship a bad release, it doesn't just break a dashboard — it breaks production AI systems at companies you've heard of.

LiteLLM provides an open source Python SDK and Python FastAPI Server that allows calling 100+ LLM APIs (Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic) in the OpenAI format

We have raised $1.6M in Seed funding from top investors (Y Combinator, Gravity Fund and Pioneer Fund), generate $10M+ in ARR and growing exponentially, and are meaningfully profitable.

You can find more information on our website, Github and Technical Documentation.

Founding Reliability & Performance Engineer

About the role

TLDR

What this job actually is

Why this matters

What you'll own

Who you are

Why LiteLLM

About LiteLLM

Required skills

Other roles at LiteLLM

Job details

Company

Founders