Metadata Conditioning Accelerates Language Model Pre-training
Summary
This research paper introduces Metadata Conditioning then Cooldown (MeCo), a novel method for improving the efficiency and controllability of large language model pre-training. MeCo incorporates readily available metadata, such as URLs, to enhance the model's understanding of diverse data sources during training, then uses a "cooldown" phase to ensure functionality without metadata during inference. Experiments demonstrate that MeCo significantly accelerates pre-training, achieving comparable performance with less data and enabling better control over model outputs by conditioning inference prompts with metadata. The study explores various metadata types and ablates design choices to understand MeCo's effectiveness, showcasing its potential for creating more capable and steerable language models. Finally, the paper compares MeCo to existing techniques for data selection and metadata conditioning.