⚙️ Input & Controls

1000

Smaller vocabulary = more splitting, Larger vocabulary = less splitting

Character-Level
Split every character individually
Word-Level
Keep words intact (fails on rare words)
Subword (BPE)
Optimal balance: common words intact, rare words split
0
Tokens
0
Vocab Used
0%
Efficiency

🔤 Tokenization Results

Tokenized Output:

Common words
Medium frequency
Rare/split tokens

Power Law Distribution

Key Insights

  • Enter text to see tokenization analysis