Hi, I'm Tyler Romero.

Previously, I was Lead ML Engineer at Groundlight, a startup building multimodal question-answering systems. Prior to that, I worked on large-scale recommender systems at Twitter, where I developed the ML and ran the A/B tests for the experimentally-successful yet short-lived downvote button. I also researched, trained, and shipped model architecture improvements for ranking Twitter's home timeline and conversation reply trees. And some of my work at Twitter is now open-source! Although the git-blame has been sanitized. Earlier in my career, I worked as a Research Scientist at Microsoft, where I built greenfield ML projects.
My academic background includes a Master’s in computer science and machine learning from Stanford, and a Bachelor’s in computer engineering from Texas A&M. As an undergraduate, I performed research on novel implementations of parallel algorithms written in C / Cilk and interned as a Software Engineer at Bloomberg and Microsoft. I made a few contributions to Bloomberg’s Asset and Investment Management function and wrote Microsoft a data retrieval package for R that is still supported over a decade later.
Posts
-
January 10, 2026
Thinking about Scaling Laws
How can we use scaling laws to train stronger LLMs?
-
March 8, 2025
NanoGPT Speedrun Living Worklog
How fast can I train GPT-2 on two RTX 4090 GPUs? This is a living worklog of my progress.
-
February 6, 2025
Reducing VRAM Footprint in PPO and GRPO Using Selective Log-Softmax
Reduce VRAM usage by half when computing log probabilities by selectively applying log-softmax to only the necessary tokens. Perfect for many RLHF post-training algorithms (such as PPO and GRPO) where typically only one token's log probability is needed from the entire vocabulary at each sequence position.
-
January 5, 2025
An Extension to BADGE Active Learning for Variable-Sized Batches
We show how BADGE's batch selection strategy can be adapted to handle flexible batch sizes without compromising its ability to select diverse, informative samples - enabling more practical active learning workflows.
-
April 13, 2024
Direct Preference Optimization Explained In-depth
Covering DPO, a recently-proposed alternative to RLHF for preference tuning.
Publications & Preprints
-
2025
Olmo 3 — Team Olmo. arXiv preprint arXiv:2512.13961, 2025.
We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Check out Olmo 3 32B Think on HuggingFace and the pretraining codebase I work on: OLMo-core. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 32B Think, is the strongest fully-open thinking model released to-date.
Projects
-
Liger-Kernel
Recently I've been contributing to Liger-Kernel, a collection of custom Triton Kernels for efficient LLM training. I've found these kernels very useful for training LLMs/VLMs on my RTX 4090. My contributions, as well as those of other top collaborators, were recently featured in a post on the LinkedIn Engineering Blog.
-
microR1
A micro-scale DeepSeek-R1 reproduction in the style of Karpathy's nanoGPT. Intended to be easy to understand and to hack on top of.
Favorite Reads
-
The Ultra-Scale Playbook: Training LLMs on GPU Clusters
A detailed guide to large-scale training of LLMs, covering 1D through 5D training parallelism, GPU fusing and threading, and more.
-
How to Scale Your Model: A Systems View of LLMs on TPUs
An online book that explains how TPU and GPU hardware works and how the Transformer architecture has evolved to perform well on current hardware.
-
Making Deep Learning Go Brrrr From First Principles
A great post by Horace He explaining how to speed up single-GPU training based on whether jobs are compute-, bandwidth-, or overhead-bound. See also What Shapes Do Matrix Multiplications Like?
-
siboehm
An excellent ML engineering blog by Simon Boehm, and a large part of the inspiration for this site. I especially recommend Simon’s posts on optimizing multidimensional matrix multiplication on CPU and pipeline parallelism for distributed training.
-
Simon Willison’s Weblog
An insightful collection of links, quotes, and short blog posts that helps navigate the firehose of ML news.
-
Interconnects
A substack with long-form technical posts about AI R&D by Nathan Lambert.
-
The 37 Implementation Details of Proximal Policy Optimization
A legendary ICLR blog post diving into the (unreported/underreported) implementation details of PPO. Necessary reading for anyone working on LLM post-training with PPO or GRPO.
-
Learning CUDA by optimizing softmax: A worklog
A nice post by Maharshi Pandya on optimizing a softmax CUDA kernel.
Fun Stuff
-
Recipe Box
I find great joy in the process of cooking and I like to keep a recipe box of my favorite dishes.
-
How this startup used AI to keep raccoons from invading my house
My friends and I help a Seattle tech reporter keep some curious raccoons out of his living room. Related: Found a raccoon in the living room — now seeking a tech solution so it doesn’t happen again