Tokenizer Design: Choosing BPE, Unigram, and Vocabulary Size

Published: October 17, 2025

The tokenizer is the bridge between human-readable text and model-interpretable tokens. Tokenizer choice affects efficiency, cross-lingual performance, generalization, and robustness to noise.

1. Goal

Transform raw text into tokens (subwords, characters, or bytes) in a way that:

Efficiently represents all possible text (no out-of-vocabulary issues).
Balances vocabulary size vs. sequence length.
Supports multilingual text and special domains (code, math, emoji).
Preserves semantic and morphological structure where helpful.

2. Why Tokenizer Design Matters

Determines context window efficiency: smaller tokens → longer sequences; larger tokens → shorter but coarser inputs.
Shapes embedding layer size (directly tied to vocab size).
Affects multilingual handling, robustness to noise, and downstream compression.
Influences training cost per token and output granularity in generation.

3. Major Tokenization Paradigms

Type	Description	Examples	Pros	Cons
Word-level	Split by whitespace and punctuation.	Early NLP models (Word2Vec, GloVe).	Simple, interpretable.	Fails on rare words, inflected forms, or new words.
Character-level	Each character is a token.	CharRNNs, ByT5 variants.	No OOV issues; handles typos/OCR errors.	Long sequences → inefficient.
Subword-based	Split words into frequently occurring subunits.	BPE (GPT), WordPiece (BERT), Unigram (T5).	Compact, efficient, covers rare words.	Language bias (Latin-centric).
Byte-level	Operates directly on UTF-8 bytes.	GPT-2, LLaMA-2, GPT-4.	Universal coverage, no OOV.	Longer sequences for non-Latin scripts.
Morpheme-aware	Linguistically segmented tokens.	M-BERT extensions, PolyLM.	Excellent for morphologically rich languages.	Complex preprocessing; not universal.
Character-subword hybrids	Learn both granular and composed tokens.	CANINE, Charformer, ByT5.	Robust to spelling noise.	Slower training; complex encoding.

4. Dominant Algorithms

1. Byte Pair Encoding (BPE)

Iteratively merges frequent symbol pairs into subwords.
Used by: GPT family, LLaMA-1/2, Falcon, Mixtral.
✅ Pros: Efficient, easy to implement, compact.
❌ Cons: Deterministic merges can overfit to training corpus frequency; hard to adapt multilingual text.

2. WordPiece

Optimizes token likelihood using a unigram language model over subwords.
Used by: BERT, RoBERTa, ALBERT.
✅ Pros: Probabilistic → smoother subword splits.
❌ Cons: Slightly slower training; vocabulary harder to interpret.

3. Unigram Language Model (SentencePiece)

Maintains a set of candidate subwords, prunes low-likelihood ones.
Used by: T5, Flan-T5, Mistral, Gemma, Phi-3.
✅ Pros: Better coverage; probabilistic merges; multilingual friendly.
❌ Cons: More complex training; not easily reversible.

4. Byte-Level BPE

BPE applied directly to raw bytes (0–255).
Used by: GPT-2/3/4, LLaMA-2, Gemini, Claude 3.
✅ Pros: Universal UTF-8 support, no special casing for languages.
❌ Cons: Creates more tokens for non-Latin text (e.g., Chinese, Arabic).

5. SentencePiece (Framework)

Framework supporting Unigram or BPE with built-in normalization and pre/post-processing.
Used by: T5, LLaMA, Gemma, Phi-3.
✅ Pros: Language-independent, reproducible, standalone.
❌ Cons: Requires extra post-processing for detokenization fidelity.

5. Vocabulary Size Selection

Model	Vocab Size	Rationale
GPT-3 (OpenAI)	50,257	Byte-BPE optimized for English + symbols.
LLaMA-3 (Meta)	128,000	Expanded multilingual coverage.
T5 (Google)	32,000	Tradeoff between quality and efficiency.
Mistral (MistralAI)	32,000	Same as LLaMA SentencePiece vocabulary.
Phi-3 (Microsoft)	~64,000	Balanced for code + text + reasoning.
Gemini (Google)	100,000+	Multi-modal and multilingual tokens.

Key Trade-offs:

Smaller vocab → longer sequences, cheaper embeddings.
Larger vocab → shorter sequences, higher embedding/memory cost.
Rule of thumb: Vocabulary ≈ √(dataset tokens) is efficient for web-scale LLMs.

6. Normalization & Preprocessing in Tokenization

Tokenizer design also includes text normalization before token splitting.

Unicode normalization (NFC/NFD).
Lowercasing (if language-safe).
Removing control characters.
Unicode escape handling (\uXXXX).
Whitespace collapsing.
Script detection for multilingual tokenizers.

Examples:

SentencePiece handles normalization internally.
GPT-4 tokenizer normalizes punctuation and accents for consistency.

7. Multilingual and Domain-Specific Tokenization

Multilingual

Mix multiple languages into the same tokenizer training corpus.
Reduce bias toward high-frequency Latin scripts.
Use Unigram or Byte-BPE for open-vocabulary support.
Examples:
- XLM-R uses SentencePiece Unigram across 100 languages.
- LLaMA-3 introduces balanced script coverage and rebalanced merges.

Code and Math Domains

Tokenize identifiers, operators, and whitespace meaningfully.
May use specialized subword vocabularies for programming symbols.
Examples:
- CodeParrot, StarCoder, CodeLLaMA use domain-specific BPE with token-level code comments.
- DeepSeek-Coder integrates AST-level tokenization for structural awareness.

8. Evaluation Criteria for Tokenizers

Criterion	Metric / Approach
Coverage	% of text represented without fallback or rare tokens
Efficiency	Avg. tokens per word or per character
Compression	Total tokens needed per dataset
Reversibility	Can detokenize exactly to original text?
Cross-lingual balance	Equal token efficiency across scripts
Robustness to noise	Tolerance to typos, OCR errors, emoji, code-mixing

Benchmarks: Token-per-word ratios, downstream loss differences, multilingual perplexity comparisons.

9. Emerging Tokenization Directions

Tokenizer-Free Models
- Character or byte embedding models (e.g., ByT5, CANINE, Charformer).
- Goal: eliminate pre-tokenization bias.
- Tradeoff: higher sequence length, more compute.
Learned Tokenizers / Adapters
- Tokenizers co-trained with the model.
- Examples: Megabyte (DeepMind, 2024) trains discrete compression layers end-to-end.
Dynamic Vocabularies
- Vocab evolves as new data appears (domain adaptation, code updates).
- Requires retraining embeddings or using adapters.
Semantic Tokenization
- Future idea: tokenize based on meaning units or dependency trees rather than frequency.
- Promising for reasoning and dialogue tasks.

10. Key Takeaways

BPE and Unigram remain the industry standard, with Byte-BPE dominating new general-purpose LLMs.
Vocabulary size is a performance-efficiency tradeoff; 30k–120k tokens is the current sweet spot.
Multilinguality and domain adaptation drive new tokenizer innovations.
Emerging trend: moving toward byte-level or tokenizer-free models to reduce human preprocessing bias.

Rahat Ibn Rafiq