The intern will design, evaluate and bring to the state of the art the internal Windmill agentic loop for generating scripts, flows and full-stack apps - and build the benchmarking system that measures its progress. The work tackles several open questions: how to objectively evaluate a generated workflow or app beyond "it compiles" (functional tests, end-to-end execution, UX quality, semantic correctness); how an agent should decompose a natural-language specification into coherent atomic steps; how to efficiently inject Windmill-specific context (hub, types, resource schemas) without saturating the context window; how to exploit execution feedback for self-correction; how to keep a dependency graph of scripts, flows and apps coherent across iterative multi-file edits; and how to detect hallucinations, silent regressions and "fake successes" where tests pass for the wrong reasons.
Expected deliverables: the Windmill benchmark (corpus, harness, tracking dashboard); an improved agentic loop shipped to production with documented progression metrics; a weekly lab notebook; the final thesis report; and possibly a publication or open-source release. The intern works directly with Ruben Fiszel (co-founder & CEO) and the Windmill R&D / AI team, with daily interaction, weekly reviews and full access to the codebase, to anonymized usage data, to frontier-model API budgets and to GPU infrastructure for fine-tuning experiments.
Code-generation agents:
Reference benchmarks:
Limitations of these benchmarks for our use case: none covers workflow generation (step composition, branching, parallelism, state management); none tests generation of full-stack apps with interactive UI; none integrates the specifics of Windmill (type system, resources, variables, hub, multi-language runtime).
Scientific and technical locks:
Phase 1 - Mapping & state of the art (weeks 1–3): audit of Windmill's current agentic loop (architecture, prompts, tool-use); systematic review of existing literature and benchmarks; selection / reproduction of 2–3 reference baselines.
Phase 2 - Benchmark (weeks 3–8): design of the evaluation task corpus (isolated scripts, multi-step flows, full-stack apps); design of the evaluation harness (sandboxed execution, multi-criteria scoring); set up continuous regression tracking; open-source release of the benchmark envisioned.
Phase 3 - Improvement of the agentic loop (weeks 8–20): iterative experimentation on prompts, planning strategies, tool design, retrieval, execution feedback; comparison of frontier models vs open-weights; targeted exploration of supervised fine-tuning and RL approaches; progressive production deployment.
Phase 4 - Consolidation & deliverables (weeks 20–24): writing of the thesis / final-year report; internal technical documentation; possible paper submission.
M2 / final-year student in computer science or applied mathematics. Solid programming foundations (Python, TypeScript, bonus Rust), strong interest in LLMs / agents / evaluation methodology, empirical and rigorous approach.
Required skills : proficiency in Python and TypeScript; concrete understanding of how LLMs work (tokenization, context window, prompting, tool use, function calling); hands-on experience with at least one agentic assistant; design of controlled experiments and reproducible metrics; Git, testing, code review, CI; fluent English.
Nice-to-have: Rust; Svelte / modern frontend; fine-tuning & RL experience (SFT, DPO, RLHF, RLAIF); agent/benchmark evaluation experience; prior publication or significant open-source contribution; Docker, PostgreSQL, sandboxing, observability.
Education : Master’s student (M2) or final-year student (PFE) in computer science or applied mathematics: MPRI, École Polytechnique (X), École Normale Supérieure (ENS) (Ulm / Paris-Saclay / Lyon), Télécom Paris, CentraleSupélec, Mines, ENSIMAG, EPITA, 42, EPFL, or equivalent
Windmill is an open-source developer platform that turns scripts into workflows, internal tools, and full-stack apps.
Write scripts in Python, TypeScript, Go, Bash, Rust, SQL - Windmill auto-generates UIs from their parameters, handles dependencies, credentials, permissions, and scheduling so you focus on business logic, not infra.
Open-source alternative to Airflow, Temporal, Retool and n8n. Chain scripts into flows with branching, parallelism, retries. Build dashboards with the app builder. Trigger via cron, webhook, or UI. All-in-one runtime, editor, secret manager, and OAuth platform - enterprise-ready out of the box.
Stack: Rust / TypeScript + Svelte / PostgreSQL. Self-hostable, easy to deploy, built for performance and DX.
Location
Paris, IDF, FR / Paris, Île-de-France, FR
Total raised
$500K
Investors
Ruben Fiszel
Ruben Fiszel
LinkedInNo applications, no recruiter spam. Just the intro.
A few questions to make sure this role is the right shape for you. Two minutes.
I write the intro, send it to the founder, and handle the back-and-forth.
If they’re a yes, I book the chat. You show up — that’s the whole job-hunt.