Data Ingestion Stage for LLMs
Published:
Data Ingestion & Normalization → Tokenization → Metadata
A survey of how researchers and companies handle this stage in the LLM lifecycle.
1. Normalization
Goal: Standardize raw text into a consistent, model-ready format.
- Common Techniques
- Lowercasing (sometimes preserving case for languages where it matters).
- Unicode normalization (NFC/NFD).
- Removing boilerplate (ads, HTML tags, navigation).
- Handling punctuation, whitespace, accents, emojis.
- Sentence and paragraph segmentation.
- Approaches in Practice
- Web-scale pipelines (C4, The Pile): boilerplate removal with tools like jusText, trafilatura.
- Curated corpora (Phi-3, Dolma): aggressive cleaning for textbook-style clarity.
- Code datasets (CodeParrot, StarCoder): normalize tabs/spaces, comments, and formatting.
- Multilingual datasets (XLM-R, mBERT): Unicode normalization, retain diacritics and scripts.
- Trade-offs
- ✅ Cleaner data → better generalization.
- ❌ Over-cleaning can remove useful signals (capitalization, formatting).
2. Tokenization
Goal: Convert text into tokens that the model understands.
- Methods
- Whitespace/word-based (rare today)
- ✅ Intuitive
- ❌ Huge vocab, poor handling of rare words
- Character-level
- ✅ Infinite coverage
- ❌ Long sequences → inefficient
- Subword-based (BPE, WordPiece, UnigramLM)
- ✅ Balance between coverage & efficiency
- ❌ Can split awkwardly in morphologically rich languages
- Byte-level (GPT-2 BPE, LLaMA, GPT-4)
- ✅ Universal UTF-8 coverage, no OOV
- ❌ Longer sequences for non-Latin scripts
- Morpheme-aware (linguistic segmentation, limited adoption)
- ✅ Good for morphologically rich languages
- ❌ Hard to scale, resource-intensive
- Hybrid (char + subword) (CANINE, ByT5, Charformer)
- ✅ Robust to typos, OCR noise
- ❌ Computationally heavier
- Whitespace/word-based (rare today)
- Trends
- Subword remains dominant.
- Byte-level increasingly popular for robustness.
- Hybrids emerging for noisy/OCR-heavy settings.
3. Metadata
Goal: Attach structured information to each text/document for filtering, weighting, and governance.
- Examples of Metadata
- Source domain, timestamp, language
- Quality score, deduplication hash
- Copyright/licensing info
- Safety labels (toxicity, violence, etc.)
- Approaches in Practice
- C4 (T5): domain + URL metadata for filtering.
- Dolma (AI2): detailed metadata schema (source type, content type, quality).
- Anthropic: safety tags (violence, hate, toxicity categories).
- OpenAI internal pipelines: use metadata for data weighting (e.g., upweight Q&A, textbooks).
- RAG systems (KILT, RA-DIT): fine-grained retrieval indices as metadata.
- Trade-offs
- ✅ Enables flexible sampling, safety filters, reproducibility.
- ❌ Maintaining consistent schemas is costly; storage overhead.
Key Insights
- Normalization: Must balance cleanliness with preserving linguistic signals.
- Tokenization: Subword is standard, byte-level is rising, hybrids for noise robustness.
- Metadata: Moving toward richer, structured metadata for governance and weighted sampling.
