At The San Francisco Tensor Company, we believe the future of AI and high-performance computing depends on rethinking the entire software and infrastructure stack. Today's developers face bottlenecks across hardware, cloud, and code optimization that slow progress before ideas can reach their full potential. Our mission is to remove those barriers and make compute faster, cheaper, and universally portable.
We are building a Kernel Optimizer that automatically transforms code into its most efficient form, combined with Tensor Cloud for adaptive, cross-cloud compute and Emma Lang, a new programming language for high-performance, hardware-aware computation. Together, these technologies reinvent the foundations of AI and HPC.
SF Tensor is proudly backed by Susa Ventures and Y Combinator, as well as a group of angels including Max Mullen and Paul Graham as well as founders and executives of NeuraLink, Notion and AMD. We are partnering with researchers, engineers, and organizations who share our belief that the next breakthroughs in AI require breakthroughs in compute.
We're looking for a Founding GPU Kernel Engineer who lives right at the boundary between hardware and software. Someone who thinks in warps, occupancy, and memory hierarchies, and can squeeze every last FLOP out of a GPU.
Your job is to go deeper than anyone else. You'll hand-tune kernels to figure out what's actually possible on the hardware, and then turn that knowledge into compiler optimization passes that help every model we compile.
Write and hand-optimize GPU kernels for ML workloads (matmuls, attention, normalization, etc.) to set the performance ceilings
Profile at the microarchitectural level: look into SM utilization, warp stalls, memory bank conflicts, register pressure, instruction throughput
Debug performance issues by digging deep into things like clock speeds, thermal throttling, driver behavior, hardware errata
Turn your hand-optimization insights into automated compiler passes (working closely with our compiler team)
Develop performance models that predict how kernels will behave across different GPU architectures
Build tools and methods for systematic kernel optimization
Work with NVIDIA, AMD, and emerging AI accelerators - understand the common parts and what's vendor-specific
Deep expertise in GPU architecture
Proven track record of hand-writing kernels that match or beat vendor libraries (cuBLAS, cuDNN, CUTLASS)
Strong skills with low-level profiling tools: Nsight Compute, Nsight Systems, rocprof, or equivalents
Experience reading and reasoning about PTX/SASS or GPU assembly
Solid systems programming in C++ and CUDA (or ROCm/HIP)
Good understanding of how high-level ML operations map to hardware execution
Experience with distributed training systems: collective ops like all-reduce and all-gather, NCCL/RCCL, multi-node communication patterns
HPC background: experience with large-scale scientific computing, MPI, or work in supercomputing
Background in electrical engineering, computer architecture, or hardware design
Driver development experience (NVIDIA, AMD, or other accelerators)
Experience with MLIR, LLVM, or compiler backends
Deep knowledge of distributed ML training: gradient accumulation, activation checkpointing, pipeline/tensor parallelism, ZeRO-style optimizations
Familiarity with custom accelerators: TPUs, Trainium, Inferentia, or similar
Knowledge of high-speed interconnects: NVLink, NVSwitch, InfiniBand, RoCE
Publications or contributions in GPU optimization, HPC, or ML systems
Experience at NVIDIA, AMD, a national lab, or an AI hardware/infrastructure company
This role is for someone who wants to know why things are fast or slow on the hardware. You'll have a direct impact on the performance of large-scale AI training, tackling problems that need real depth. If you've ever been annoyed that your hard-won optimization knowledge is stuck in your head and not baked into a compiler, here's your shot to change that.
We believe in the power of in-person collaboration to solve the hardest problems and foster a strong team culture. We offer relocation assistance and look forward to you joining us in our San Francisco office.
The base salary range for this full-time position is $285,000 - $315,000 + bonus + equity + benefits.
AI researchers should be pushing the boundaries of what's possible with new architectures and training methods. Instead, they waste weeks configuring cloud infrastructure, debugging distributed systems, and optimizing their GPU code. We know because we lived it: While training our own models across thousands of GPUs earlier this year, we spent more time fighting our infrastructure than doing actual research.
That's why we're building two things. First, Elastic Cloud: a managed platform that automatically finds the cheapest GPUs across all providers, handles spot instance preemption, and cuts compute costs by up to 80%. Second, automatic kernel optimization that makes training code run faster by modeling hardware topology, often beating hand-tuned implementations.
The problem is that getting high performance across different hardware is genuinely hard. NVIDIA's CUDA moat exists because writing fast kernels requires deep expertise. Most teams either accept vendor lock-in or hire expensive kernel engineers. Our goal is to break the CUDA moat.
The compute bottleneck is the biggest constraint on AI progress. NVIDIA can't manufacture enough GPUs, and their monopoly keeps prices astronomical. Meanwhile, AMD, Google, and Amazon are shipping capable alternative hardware that nobody uses because the software is too hard. We're breaking that moat. If we succeed, anyone will be able to train state-of-the-art models without thinking past their PyTorch code.
Salary
$285,000 - $315,000
Equity
1.25% - 2%
Location
San Francisco
Last stage
Seed
Investors
No applications, no recruiter spam. Just the intro.
A few questions to make sure this role is the right shape for you. Two minutes.
I write the intro, send it to the founder, and handle the back-and-forth.
If they’re a yes, I book the chat. You show up — that’s the whole job-hunt.