About the role

About LILT AI is changing how the world communicates — and LILT is leading that transformation. We're on a mission to make the world's information accessible to everyone , regardless of the language they speak. We use cutting-edge AI, machine translation, and human-in-the-loop expertise to translate content faster, more accurately, and more cost-effectively without compromising on brand, voice, or quality. At LILT, we empower our teammates with leading tools, global collaboration, and growth opportunities to do their best work. Our company virtues— Work together, win together; Find a way or make one; Quicker than they expect; Quality is Job 1 —guide everything we do. We are trusted by Intel Corporation , Canva , the United States Department of Defense , the United States Air Force , ASICS , and hundreds of global Enterprises. Backed by Sequoia, Intel Capital, and Redpoint, we’re building a category-defining company in a $50B+ global translation market being redefined by AI.

About

the Role As a Research Engineer focused on Model Evaluation, you are the final arbiter of technical quality for our frontier AI deliverables. You will design sophisticated evaluation suites and serve as the lead calibrator, reviewing and refining the contributions of other engineers to ensure our data samples and model outputs meet the exacting standards of the world’s leading AI labs. This is a highly technical role for someone who enjoys getting in the weeds of model behavior, RAG performance, and RLHF alignment. Key

Responsibilities Eval Architecture & Benchmarking: Design and implement automated and human-in-the-loop evaluation frameworks to measure model performance across multiple modalities (text, code, image, etc.). Calibration & Peer Review: Act as the Gold Standard reviewer for other engineers. You will calibrate their data generation and evaluation contributions, providing technical feedback to ensure scientific consistency and high-fidelity output. Frontier Sample Generation: Write and refine complex prompts and golden response pairs for frontier-model training, specifically focusing on edge cases in reasoning and multilingual contexts. Quality Control (End-to-End): Develop the logic for multi-modal QC checks, ensuring that high-volume data samples are correct across diverse domains and languages. Technical Mentorship: Bring new knowledge and best practices to our established delivery and forward-deployed engineering teams on model evaluations.

Research Engineer, Evaluations, Applied AI

About the role

Other roles at Lilt

Job details

Company

Funding

Founders

What happens next.

Confirm the fit

I pitch you to the company

A meeting lands on your calendar