Multrenizer
A bilingual Turkish-English Unigram tokenizer with lossless round-trip decode, case preservation, and a correct utility-token taxonomy — punctuation, currency, math symbols, and emoji are kept atomic in the vocabulary AND preserved through decode(skip_special_tokens=True).
Links
- Repository: github.com/fzengin19/multrenizer
- Hugging Face: huggingface.co/fzengin18/multrenizer
Why Multrenizer?
Standard multilingual tokenizers commonly fail on Turkish in three ways: (1) they break agglutinative morphemes at suboptimal boundaries, (2) they fold case (İstanbul → istanbul) so a downstream model can never produce proper nouns again, and (3) they misclassify punctuation and emoji as "special tokens", which means decode(skip_special_tokens=True) silently strips them and the model output ends up missing commas, apostrophes, and 😀.
Multrenizer is built to fix all three without giving up Turkish efficiency.
Core design:
- Case preserving.
İstanbulstaysİstanbul,TÜRKİYEstaysTÜRKİYE. The model learns to produce real proper nouns, not lowercased shadows. - Lossless round-trip. Metaspace pipeline with
prepend_scheme="first"plusAddedToken(normalized=True)—decode(encode(x)) == xfor case, punctuation, currency, math, and compound emoji. - Correct utility taxonomy. Only model-control symbols (
<s>,<|user|>,<|reserved_*|>, etc.) are flaggedspecial=True. Punctuation, currency, math symbols, and emoji are atomic vocab entries withspecial=False— they surviveskip_special_tokens=True. - Compact vocabulary.
~26Ktotal budget, Turkish-first morpheme coverage. - Code-switching aware. Trained on a TR/EN/CS interleave; mixed text like
merge'lemek istediğim branchsegments cleanly.
Pipeline
Raw text
-> Quote canonicalization (’ ‘ ʼ ' -> ')
-> NFKC normalization
-> Strip whitespace
-> Pre-tokenizer: Metaspace(replacement="▁", prepend_scheme="first", split=True)
-> Unigram model (~26K vocab)
-> Decoder: Metaspace(prepend_scheme="first") [mirror of pre-tokenizer]
-> Post-processor (<s> ... </s>)
No lowercase. No locale-specific I/i replacement (it would be a workaround for a bug Python introduces only when you call .lower(), which we don't).
Quick Start
Installation
git clone https://github.com/fzengin19/multrenizer.git
cd multrenizer
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Use the shipped tokenizer locally
from tokenizers import Tokenizer
tok = Tokenizer.from_file("multrenizer-tokenizer/tokenizer.json")
encoded = tok.encode("İstanbul'da %50 indirim 😀")
print(encoded.tokens)
# ['<s>', "▁İstanbul'da", '▁%', '50', '▁indirim', '▁', '😀', '</s>']
# Round-trip is lossless — case, punctuation, emoji all survive.
print(tok.decode(encoded.ids))
# "İstanbul'da %50 indirim 😀"
Load from Hugging Face
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("fzengin18/multrenizer")
ids = tok.encode("Türkiye'de %50 indirim, 100₺ ❤️", add_special_tokens=False)
# `skip_special_tokens=True` removes ONLY <...>-formatted control tokens.
# Punctuation, currency, and emoji are preserved.
print(tok.decode(ids, skip_special_tokens=True))
# "Türkiye'de %50 indirim, 100₺ ❤️"
Train from scratch
# 1. Download and prepare corpus (Wikipedia TR+EN, OPUS-100, synthetic CS)
python prepare_data.py --size large
# 2. Train tokenizer
python train_tokenizer.py --data-dir data/
# 3. Optional: push to Hugging Face Hub
python train_tokenizer.py --data-dir data/ \
--repo-id fzengin18/multrenizer \
--hf-token "$HF_TOKEN"
Run benchmarks
python benchmark.py --tr-lines 5000 --en-lines 5000
Architecture
Token Taxonomy
| Tier | Count | special flag |
Behavior on skip_special_tokens=True |
|---|---|---|---|
Named specials (<s>, <|user|>, <think>, ...) |
32 | true |
Removed |
Reserved (<|reserved_0|> ... <|reserved_511|>) |
512 | true |
Removed |
Utility — punctuation/symbols (,, !, ', %, $, →) |
169 | false |
Preserved |
| Utility — emoji (corpus-learned + curated fallback) | ~240 | false |
Preserved |
| Learned subwords (Unigram) | ~24,700 | false |
Preserved |
This is the central correctness fix vs the prior artifact, where 292 utility tokens were incorrectly flagged special=true and erased on every decode.
Vocabulary Budget
Target: 26,000. Actual shipped artifact: 25,679 (Unigram convergence undershoot is normal).
The model trained downstream should set its vocab_size to the tokenizer's actual size, not the 26K target.
Special Tokens
| Category | IDs | Tokens |
|---|---|---|
| Core | 0-3 | <unk> <s> </s> <pad> |
| Chat | 4-8 | <|system|> <|user|> <|assistant|> <|end|> <|sep|> |
| Reasoning | 9-12 | <think> </think> <|step|> <|reflection|> |
| Tool Use | 13-16 | <tool_call> </tool_call> <tool_response> </tool_response> |
| Code/FIM | 17-20 | <|code|> <|fim_prefix|> <|fim_middle|> <|fim_suffix|> |
| Language | 21-22 | <|tr|> <|en|> |
| RAG | 23-24 | <|context|> <|/context|> |
| Multi-modal | 25-28 | <|image|> <|audio|> <|video|> <|file|> |
| Structured | 29-31 | <|json|> <|table|> <|cite|> |
| Reserved | 32-543 | <|reserved_0|> ... <|reserved_511|> |
Look these up at runtime with tokenizer.token_to_id("<...>") — do not hardcode IDs in downstream code.
Utility Tokens (special=False, atomic)
| Category | Approx. count | Examples |
|---|---|---|
| Punctuation | 38 | . , ! ? ; : - ( ) [ ] { } / \ " ' ... – — … ‽ ‼ ⁇ ¿ ¡ |
| Currency & business | 23 | ₺ $ € £ ¥ ₹ ₽ ¢ ฿ ₪ ₸ ₣ ₮ ₩ % ‰ ° § ¶ № @ # & |
| Math & science | 33 | ± × ÷ ≠ ≤ ≥ ≈ ∞ √ ∑ ∫ ∂ Δ π α β γ δ ε θ λ μ σ φ ω ∀ ∃ ∈ ⊂ ⊆ ∪ ∩ ⇔ ↦ ∝ ∇ ∮ ∧ ∨ ¬ Α Β Γ Σ Φ Ω |
| Programming digraphs | 11 | => != == <= >= -> :: ** // && || |
| Arrows & symbols | 15 | → ← ↑ ↓ ↔ ⇒ • · ★ ☆ ✓ ✗ © ® ™ |
| Box drawing & UI | 17 | ─ │ ┌ ┐ └ ┘ ├ ┤ ┬ ┴ ┼ ▪ ▫ ◆ ▶ ◀ ▼ ▲ |
| Typography | 10 | « » " " ' ' ‹ › „ ‚ |
| Emoji — faces | 70 | 😀 😂 🤣 😊 😍 🤔 😭 😡 💀 🤖 |
| Emoji — hands | 28 | 👋 👍 👎 👏 🙏 💪 ✊ ✌️ |
| Emoji — hearts | 18 | ❤️ 💛 💚 💙 💜 🖤 💔 |
| Emoji — symbols | 36 | 🔥 ✨ ⭐ ✅ ❌ ⚠️ 💯 🚀 |
| Emoji — objects | 36 | 💻 📱 🎯 🏆 📊 ☕ 🔗 💰 |
| Emoji — nature & common | 17 | 🌍 🎉 🎁 ⏰ 📅 📌 🚨 🛒 💳 📞 📧 🏻🏼🏽🏾🏿 |
| Emoji — flags | 26 | 🇹🇷 🇺🇸 🇬🇧 🇩🇪 🇫🇷 🇨🇳 🇰🇷 🇮🇳 🇸🇦 🇮🇱 🇷🇺 🇧🇷 🇨🇦 🇲🇽 🇦🇺 🇳🇱 |
| Emoji — compounds (ZWJ/VS16 sequences) | 12 | ❤️ 👨💻 👩💻 🏃♂️ 🏃♀️ 👨👩👧 👨🏫 👩🏫 👍🏻–👍🏿 |
ZWJ (U+200D) and VS16 (U+FE0F) are NOT added as standalone tokens; they exist only inside whole compound-emoji sequences.
Data Mix
The released artifact is trained on the corpus produced by prepare_data.py --size large:
| Stream | Approx. lines | Share | Source |
|---|---|---|---|
| Turkish | 1.30M | ~59% | Wikipedia TR streaming |
| English | 0.71M | ~32% | Wikipedia EN streaming |
| Code-switching | 0.20M | ~9% | OPUS-100 EN-TR pairs + synthetic templates |
An additional 5K synthetic emoji-rich CS lines are appended to ensure curated emojis make it into the learned vocabulary instead of the post-train fallback list.
Correctness Tests
The repo ships a pytest suite (tests/test_tokenizer_correctness.py) covering:
- Round-trip lossless decode —
decode(encode(x)) == normalize(x)over Turkish, English, and code-switching samples. skip_special_tokens=Truepreserves utility — punctuation, currency, emoji must survive; only<...>-formatted control tokens are removed.- Compound emoji round-trip — VS16, ZWJ, regional-indicator pairs decode back to the original sequence.
- Turkish morpheme integrity — words like
şehirdir,kitabımı,evlerindendecode without internal whitespace. - UNK rate < 0.5% on a bilingual bench (line-by-line).
- Special token ID stability — IDs 0-7 (
<unk>,<s>,</s>,<pad>,<|system|>,<|user|>,<|assistant|>,<|end|>). - Case preservation —
İstanbul,TÜRKİYE,ABD,AKPall round-trip without folding.
Current artifact: 60/60 PASS.
Benchmark Results
Evaluated on 5,000 Turkish sentences, 5,000 English sentences, and 500 code-switching sentences from the prepared corpus, against five reference tokenizers.
Compared Tokenizers
| Tokenizer | Source | Vocab Size | Algorithm | Type |
|---|---|---|---|---|
| Multrenizer | This project | 25,679 | Unigram | Bilingual EN-TR, purpose-built |
| Kumru-2B | vngrs-ai/Kumru-2B | 50,176 | BPE | Turkish LLM (VNGRS, Sep 2025, Mistral-based) |
| Turkcell-7B | TURKCELL/Turkcell-LLM-7b-v1 | 48,351 | BPE | Turkish LLM (Turkcell, Apr 2024, Mistral-based) |
| GPT-2 | openai-community/gpt2 | 50,257 | BPE | English-centric baseline (OpenAI, 2019) |
| Qwen-3 | Qwen/Qwen3-0.6B | 151,643 | BPE | Multilingual (Alibaba, 2025) |
| Mistral-3.1 | mistralai/Mistral-Small-3.1-24B-Base-2503 | 131,072 | BPE/SP | Multilingual (Mistral AI, Mar 2025) |
Fertility, Compression, and Token Count
Lower fertility = fewer tokens per word. Higher compression = more characters per token.
| Metric | Multrenizer | Kumru-2B | Turkcell-7B | GPT-2 | Qwen-3 | Mistral-3.1 |
|---|---|---|---|---|---|---|
| Vocab Size | 25,679 | 50,176 | 48,351 | 50,257 | 151,643 | 131,072 |
| TR Fertility | 1.63 | 1.65 | 1.92 | 3.79 | 2.62 | 2.38 |
| EN Fertility | 1.51 | 2.15 | 1.55 | 1.31 | 1.37 | 1.38 |
| TR Compression | 4.79 | 4.72 | 4.06 | 2.06 | 2.98 | 3.27 |
| EN Compression | 4.18 | 2.94 | 4.07 | 4.82 | 4.61 | 4.58 |
| TR Total Tokens (5K sent) | 130,725 | 132,637 | 154,166 | 304,345 | 210,334 | 191,682 |
| EN Total Tokens (5K sent) | 155,848 | 221,420 | 160,121 | 135,235 | 141,275 | 142,196 |
| Round-trip Lossless | 11/11 | 11/11 | 11/11 | 11/11 | 11/11 | 11/11 |
Headline:
- Best Turkish efficiency in this set on every TR metric (fertility, compression, total tokens), with the smallest vocabulary by a wide margin.
- Competitive English — second only to GPT-2 on EN total tokens, and that's an English-native baseline trained on far more English text.
- Round-trip lossless on all 11 cases. This is a minimum-correctness bar; modern BPE tokenizers also pass it. The previous Multrenizer artifact did not (it dropped utility tokens via
skip_special_tokens=True); fixing that is one of the main reasons this version exists.
Morphological Splitting
Total tokens needed to represent ten difficult Turkish words:
| Tokenizer | Vocab Size | Total Tokens | Avg per Word |
|---|---|---|---|
| Kumru-2B | 50,176 | 35 | 3.5 |
| Multrenizer | 25,679 | 36 | 3.6 |
| Turkcell-7B | 48,351 | 38 | 3.8 |
| Mistral-3.1 | 131,072 | 71 | 7.1 |
| Qwen-3 | 151,643 | 73 | 7.3 |
| GPT-2 | 50,257 | 105 | 10.5 |
A few selected words:
İstanbul'da
Multrenizer [1 tok] İstanbul'da ← single token, atomic
Kumru-2B [3 tok] İstanbul + ' + da
Turkcell-7B [3 tok] İstanbul + ' + da
GPT-2 [5 tok] Ä + ° + stanbul + 'd + a
Qwen-3 [4 tok] İ + stanbul + 'd + a
Mistral-3.1 [4 tok] İ + stanbul + 'd + a
Afyonkarahisarlılaştıramadıklarımızdan
Multrenizer [8 tok] Afyonkarahisar + lı + laştı + ram + a + dıkları + mızda + n
Kumru-2B [8 tok] Af + yonkarahisar + lı + laştır + ama + dık + larımız + dan
Turkcell-7B [9 tok] Afyon + kar + ah + is + arlı + laştır + a + madık + larımızdan
Qwen-3 [16 tok] (16-piece byte-level fragmentation)
Mistral-3.1 [16 tok] (16-piece byte-level fragmentation)
GPT-2 [21 tok] (21-piece byte-level fragmentation)
düşünülebileceğini
Multrenizer [2 tok] düşünül + ebileceğini
Kumru-2B [3 tok] düşünül + ebil + eceğini
Turkcell-7B [4 tok] düş + ünü + le + bileceğini
GPT-2 [12 tok] d + ü + ş + ü + n + ü + le + b + ile + ce + ğ + ini
Round-Trip Lossless Decode
11 representative inputs covering case preservation, apostrophe handling, currency, math/measurement symbols, VS16 / ZWJ / regional-indicator emoji, and mixed-content sentences. Multrenizer passes 11/11. All compared modern tokenizers also pass — the value here is being on the right side of the bar (the previous Multrenizer artifact wasn't), not winning a race.
Input Multrenizer ...
İstanbul'da güzel bir gün geçirdim. OK
TÜRKİYE büyük harflerle yazılır. OK
It's a beautiful day, isn't it? OK
Fiyat: 100₺ ve $50, €30 ve £25. OK
Türkiye'de %50 indirim ve %25 KDV. OK
Hava 22°C, yağış %30, basınç 1013hPa. OK
❤️ ile 💔 farklı duygular ifade eder. OK
🇹🇷 ve 🇺🇸 bayrakları yan yana asıldı. OK
Geliştirici 👨💻 olarak çalışıyorum. OK
Selam 👋! Türkiye'de %50 indirim 100₺. OK
Code-Switching Tokenization
"Bu feature'ı implement ederken edge case'leri handle etmeyi unutmayalım."
Multrenizer [12 tok] Bu | feature | 'ı | implement | ederken | edge | case | 'leri | handle | etmeyi | unutmay | alım.
Kumru-2B [20 tok] Bu | fe | ature | ' | ı | imp | lement | ederken | ed | ge | cas | e | ' | leri | hand | le | etmeyi | unutma | yalım | .
Turkcell-7B [15 tok] Bu | feature | ' | ı | implement | ederken | edge | case | ' | leri | handle | etmeyi | unut | mayalım | .
"merge'lemek istediğim branch conflict veriyor."
Multrenizer [ 7 tok] merge'le | mek | istediğim | branch | conflict | veriyor | .
Kumru-2B [14 tok] mer | ge | ' | lemek | istediğim | b | ran | ch | con | f | lic | t | veriyor | .
Turkcell-7B [ 8 tok] merge | ' | lemek | istediğim | branch | conflict | veriyor | .
GPT-2 [16 tok] mer | ge | ' | lem | ek | is | ted | i | ğ | im | branch | conflict | ver | iy | or | .
Full benchmark output (all sections, all sentences, machine-readable) is in
benchmark_results.json. Regenerate withpython benchmark.py.
Project Structure
multrenizer/
├── multrenizer-tokenizer/ # Trained tokenizer artifact
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ └── special_tokens_map.json
├── prepare_data.py # Corpus download (Wikipedia + OPUS-100 + synthetic CS)
├── train_tokenizer.py # Tokenizer training script
├── benchmark.py # Benchmark vs 5 reference tokenizers
├── benchmark_results.json # Full benchmark output
├── tests/
│ └── test_tokenizer_correctness.py # Round-trip + utility + ID-stability suite
├── RETRAIN_PLAN.md # Design / pivot history (audit trail)
├── MINDAI_INTEGRATION.md # Downstream integration recipe
├── requirements.txt
└── pyproject.toml
References
- Tokens with Meaning: A Hybrid Tokenization Approach for Turkish
- Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark
- Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE
- Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration
License
Apache 2.0