Colin Raffel

I'm an associate professor at the University of Toronto and an associate research director at the Vector Institute. My lab does research in the field of machine learning, which is in an era of scale — larger models trained on larger datasets are producing major advances across many applications. This scale comes at significant cost, which in turn prevents most researchers from participating in the development of state-of-the-art models. My lab therefore works on the following problems:

Enabling decentralized collaborative development of models, including modular architectures, cheaply-communicable updates, and merging methods
Developing more efficient training recipes
Identifying and mitigating risks associated with large-scale models

Interested in joining our lab? Great! Please fill out this form (there's no need to contact me separately).

Group members

Marco Ciccone, Postdoc at the Vector Institute
Malikeh Eghaghi, PhD student at the University of Toronto
Gyung Hyun Je, PhD student at the University of Toronto
Gül Sena Altintaş, PhD student at the University of Toronto
Brian Lester, PhD student at the University of Toronto
Haokun Liu, PhD student at the University of Toronto
Nikhil Kandpal, PhD student at the University of Toronto
Derek Tam, PhD student at the University of Toronto
Michael Matena, PhD student at UNC
Fengyuan Liu, Undergraduate at the University of Toronto
Yu Xin Li, Undergraduate at the University of Toronto

Recent publications

(full list)

AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution
Fengyuan Liu, Nikhil Kandpal, and Colin Raffel
13th International Conference on Learning Representations, 2025.

ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization
Prateek Yadav, Leshem Choshen, Colin Raffel and Mohit Bansal
Transactions on Machine Learning (TMLR), 2025.

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning
Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, and Alessandro Sordoni
Transactions on Machine Learning (TMLR), 2025.

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, and 18 others including Colin Raffel
arXiv preprint arXiv:2502.02737, 2025.

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf
Neural Information Processing Systems 38 (NeurIPS), 2024.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, and 386 others including Colin Raffel
Journal of Machine Learning Research (JMLR), 2024.

Realistic Evaluation of Model Merging for Compositional Generalization
Derek Tam, Yash Kant, Brian Lester, Igor Gilitschenski, and Colin Raffel
arXiv preprint arXiv:2409.18314, 2024.

DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows
Ajay Patel, Colin Raffel, and Chris Callison-Burch
61st Annual Meeting of the Association for Computational Linguistics (ACL), 2024.

Learning to Route Among Specialized Experts for Zero-Shot Generalization
Mohammed Muqeeth, Haokun Liu, Yufan Liu, and Colin Raffel
41st International Conference on Machine Learning (ICML), 2024.

Soft Merging of Experts with Adaptive Routing
Mohammed Muqeeth, Haokun Liu, and Colin Raffel
Transactions on Machine Learning Research (TMLR), 2024.
Featured Certification

Talks

Merging and MoErging for Compositional Generalization at NeurIPS 2024 Workshop on Compositional Learning and ETH Zurich SPCL_Bcast(), 2024.

The Most Expensive Part of an LLM Should Be Its Training Data at Toronto LLM x Law Hackathon, 2024.

Progress on a Permissively Licensed Text Dataset at Vector ML Security and Privacy Workshop and University of Pennsylvania CLunch, 2024.

Build an Ecosystem, Not a Monolith at Simons Institute Workshop on Large Language Models and Transformers, Google Responsible Machine Learning Reading Group, University of Edinburgh ILCC Seminar, Stanford NLP Seminar, UCSD AI Seminar, Yale CPSC 488/588 Lecture, University of Toronto CL Colloquium, and Open AGI Summit@EthCC, 2023.

Collaborative, Communal, & Continual Machine Learning at Faculty job talk, 2023.

Building Better Language Models: Insights from BigScience at Stanford Center for Research on Foundation Models, 2022.

Weird Things About Professorship at EMNLP Share Stories and Lessons Learned Workshop, 2022.

Building Better Language Models at Johns Hopkins University CSCI 601.771 Lecture, Mosaic.ml, and Vector Institute Research Symposium, 2022.

Infrastructure and Progress Towards the First Community-Built and Continually-Improved Model at Microsoft Research Efficient Large-Scale AI Workshop, 2022.

Building Machine Learning Models Like Open-Source Software at Microsoft Research Summit, World Artificial Intelligence Conference, Technische Universität Darmstadt, UT Austin Forum for Artificial Intelligence, Korea AI Summit, Stanford CS324 Lecture, Stanford MLSys Seminar Series, and MLsys Symposium on Decentralized and Collaborative Learning, 2022.

How to Be an Academic Machine Learning Researcher in the Era of Scale at CIFAR Deep Learning and Reinforcement Learning Summer School, 2022.

Less Data, More ___? Data Augmentation and Semi-Supervised Learning for Natural Language Processing at 60th Annual Meeting of the Association for Computational Linguistics Tutorials, 2022.

A call to build models like we build open-source software at Cornell University Artificial Intelligence Seminar, Georgia Tech NLP Seminar, UMass Amherst Machine Learning & Friends Lunch, UC Santa Barbara NLP Seminar, 2021.

A few possibly controversial opinions about large language models at Carnegie Mellon University Language Technologies Topical Seminar, 2021.

The Sweet Lesson at SustaiNLP Workshop, 2021.

What do language models learn from language modeling? at Stanford University CS 330 Lecture and Advanced Language Processing Winter School, 2021.

How and why should(n't) we scale machine learning? at IBM AI Hardware Forum Keynote, 2021.

A better way to get language models to do what you ask at AKBC 2021 Unstructured and Structured Knowledge Bases Workshop and Cohere.ai, 2021.

Scaling up Models and Data at CIFAR Deep Learning and Reinforcement Learning Summer School, Nepal Winter School in AI, and Advanced Language Processing Winter School, 2021.

Explicit and Implicit Entropy Minimization in Proxy-Label-Based Semi-Supervised Learning at CVPR Workshop on Learning with Limited and Imperfect Data, 2021.

The benefits of unified frameworks for language understanding at Conceptual Understanding of Deep Learning Workshop, 2021.

T5 and large language models: The good, the bad, and the ugly at Stanford University CS 224n Lecture, CU Boulder Applied Mathematics Colloquium, Twitter Machine Learning Seminar, Google Graduate Symposium & TTIC NLP Seminar, 2020.

Responsible publication: NLP case study at Navigating the Broader Impacts of AI Research Workshop Panel, 2020.

What Can MIR Learn From Transfer Learning in NLP? at NLP for Music and Audio Workshop Keynote, 2020.

Transfer Learning for NLP: T5 and Beyond at Montreal Institute for Learning Algorithms Tea Talk & Spotify Research Seminar, 2020.

Answering Questions by Querying the Implicit Knowledge Base Inside T5 at AKBC 2020 Unstructured and Structured Knowledge Bases Workshop, 2020.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer at Allen Institute for Artificial Intelligence & New York University CILVR Seminar, 2019.

Outskirts of Deep Generative Modeling at Faculty Job Talk, 2019.

Why are GANs Interesting? at New York University CILVR Seminar, 2018.

A Few Unusual Autoencoders at Vector Institute, New York University & San Francisco State University, 2018.

Leveraging MIDI Files for Music Information Retrieval at 18th International Society for Music Information Retrieval Conference Tutorials, 2017.

Doing Strange Things with Attention at AI With The Best & 1st USF Data Institute Conference, 2017.

The Lakh MIDI Dataset: How It Was Made, and How to Use It at BISH Bash Meetup, Centre for Digital Music Seminar & Jukedeck Lunch and Learn, 2016.

Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching at 2nd ICML Machine Learning for Music Discovery Workshop, 2016.

Accelerating Large-Scale Sequence Retrieval with Convolutional Networks at IIT Bombay Electrical Engineering Seminar, 2015.

Learning Efficient Representations for Sequence Retrieval at Boston Data Festival, 2015.

Using Convolutional Networks (with Attention) for Orders-of-Magnitude Speedup of DTW-Based Sequence Retrieval at Spotify Machine Learning Seminar, 2015.

Recurrent Networks in Lasagne at Mount Sinai Hammer Lab Seminar, 2015.

Lasagne Tutorial at Next.ml Boston, 2015.

Theano Tutorial at Next.ml Boston, 2015.

mir_eval at Objective Evaluation in Semantic Audio Analysis and Processing Panel at the 138th Convention of the Audio Engineering Society, 2015.

Large-Scale Content-Based Matching of Audio and MIDI Data at Stanford University DSP Seminar, 2015.

Advances and Challenges in Large-Scale Music Information Retrieval at Digital Music Research Network+8, 2013.

Quantifying Rhythmic Synchrony at Midwestern Music Cognition Symposium, 2013.

A Sequential Approach to Musical Event Detection at Carnegie Mellon University Music and Technology Seminar, 2011.

ROW-mp3: An Enhanced MP3-Compatible Audio Codec at Stanford University DSP Seminar, 2010.

An Effective Model of Bucket-Brigade Device-Based Audio Circuits at Stanford University DSP Seminar, 2010.

Voltage-Controlled Resistance: Modulate Anything at Circuitastrophe Circuit Bending Music Festival, 2008.