ABOUT THE ROLE

You would be working in our pre-training team focused on building out our distributed training and inference of Large Language Models (LLMs). This is a hands-on role that focuses on software reliability and fault tolerance. You will work on cross-platform checkpointing, NCCL recovery, and hardware fault detection. You will make high-level tools. You will not be afraid of debugging Linux kernel modules. You will have access to thousands of GPUs to test changes.

Strong engineering skills are a prerequisite. We assume good knowledge of Torch, NVIDIA GPU architecture, reliability concepts, distributed systems, and best coding practices. A basic understanding of LLM training and inference principles is required. We look for fast learners who are prepared for a steep learning curve and are not afraid to step out of their comfort zone.

YOUR MISSION

To help train the best foundational models for source code generation in the world

RESPONSIBILITIES

Identify, study, and troubleshoot hardware problems during training at scale
Minimize the GPU idle time during faults, both operationally and strategically
Design and develop tools and add-ons to accelerate the training recovery
Improve the performance and reliability of checkpointing
Write high-quality Python (PyTorch), Cython, C/C++, CUDA API code

SKILLS & EXPERIENCE

Understanding of Large Language Models (LLM)
Basic knowledge of Transformers
Knowledge of deep learning fundamentals
Strong engineering background
Programming experience
Linux API, Linux kernel
Strong algorithmic skills
Python with numpy, PyTorch, or Jax

Member of Engineering, Inference (Remote)*

About the role

About Poolside

Required skills

Other roles at Poolside

Job details

Company

Funding

Founders

What happens next.

Confirm the fit

I pitch you to the company

A meeting lands on your calendar