Hi, I'm Tyler Romero.

I am a Research Engineer at the Allen Institute (Ai2), where I work on open language modeling.

Previously, I was Lead ML Engineer at Groundlight, a startup building multimodal question-answering systems. Prior to that, I worked on large-scale recommender systems at Twitter, where I developed the ML and ran the A/B tests for the experimentally-successful yet short-lived downvote button. I also researched, trained, and shipped model architecture improvements for ranking Twitter's home timeline and conversation reply trees. And some of my work at Twitter is now open-source! Although the git-blame has been sanitized. Earlier in my career, I worked as a Research Scientist at Microsoft, where I built greenfield ML projects.

My academic background includes a Master’s in computer science and machine learning from Stanford, and a Bachelor’s in computer engineering from Texas A&M. As an undergraduate, I performed research on novel implementations of parallel algorithms written in C / Cilk and interned as a Software Engineer at Bloomberg and Microsoft. I made a few contributions to Bloomberg’s Asset and Investment Management function and wrote Microsoft a data retrieval package for R that is still supported 8 years later.

You can reach me via email or LinkedIn message.

Posts

NanoGPT Speedrun Living Worklog
March 8, 2025 - How fast can I train GPT-2 on two RTX 4090 GPUs? This is a living worklog of my progress.
Reducing VRAM Footprint in PPO and GRPO Using Selective Log-Softmax
February 6, 2025 - Reduce VRAM usage by half when computing log probabilities by selectively applying log-softmax to only the necessary tokens. Perfect for many RLHF post-training algorithms (such as PPO and GRPO) where typically only one token's log probability is needed from the entire vocabulary at each sequence position.
An Extension to BADGE Active Learning for Variable-Sized Batches
January 5, 2025 - We show how BADGE's batch selection strategy can be adapted to handle flexible batch sizes without compromising its ability to select diverse, informative samples - enabling more practical active learning workflows.
Direct Preference Optimization Explained In-depth
April 13, 2024 - Covering DPO, a recently-proposed alternative to RLHF for preference tuning.

Projects

Liger-Kernel
Recently I've been contributing to Liger-Kernel, a collection of custom Triton Kernels for efficient LLM training. I've found these kernels very useful for training LLMs/VLMs on my RTX 4090. My contributions, as well as those of other top collaborators, were recently featured in a post on the LinkedIn Engineering Blog.
microR1
A micro-scale DeepSeek-R1 reproduction in the style of Karpathy's nanoGPT. Intended to be easy to understand and to hack on top of.
seahorse
I've also been building seahorse, a small Vision-Language model meant for research. It's early stages, but it is extensible and designed to train quickly on a single RTX 4090.

Favorite Reads

The Ultra-Scale Playbook: Training LLMs on GPU Clusters
A detailed guide to large-scale training of LLMs, covering 1D through 5D training parallelism, GPU fusing and threading, and more.
How to Scale Your Model: A Systems View of LLMs on TPUs
An online book that explains how TPU and GPU hardware works and how the Transformer architecture has evolved to perform well on current hardware.
Making Deep Learning Go Brrrr From First Principles
A great post by Horace He explaining how to speed up single-GPU training based on whether jobs are compute-, bandwidth-, or overhead-bound. See also What Shapes Do Matrix Multiplications Like?
sibohem
An excellent ML engineering blog by Simon Boehm, and a large part of the inspiration for this site. I especially recommend Simon’s posts on optimizing multidimensional matrix multiplication on CPU and pipeline parallelism for distributed training.
Simon Willison’s Weblog
An insightful collection of links, quotes, and short blog posts that helps navigate the firehose of ML news.
Interconnects
A substack with long-form technical posts about AI R&D by Nathan Lambert.
The 37 Implementation Details of Proximal Policy Optimization
A legendary ICLR blog post diving into the (unreported/underreported) implementation details of PPO. Necessary reading for anyone working on LLM post-training with PPO or GRPO.
Learning CUDA by optimizing softmax: A worklog
A nice post by Maharshi Pandya on optimizing a softmax CUDA kernel.

Fun Stuff

Recipe Box
I find great joy in the process of cooking and I like to keep a recipe box of my favorite dishes.
How this startup used AI to keep raccoons from invading my house
My friends and I help a Seattle tech reporter keep some curious raccoons out of his living room. Related: Found a raccoon in the living room — now seeking a tech solution so it doesn’t happen again

Website

This website is made with 11ty, Tufte CSS, and eleventufte. Custom figures are made with Excalidraw. The combination of Tufte CSS and Excalidraw to achieve a notebook-like appearance was borrowed from Simon Boehm's website, because having a visually appealing site helps motivate me to write.