Multrenizer

A bilingual Turkish-English Unigram tokenizer with lossless round-trip decode, case preservation, and a correct utility-token taxonomy — punctuation, currency, math symbols, and emoji are kept atomic in the vocabulary AND preserved through decode(skip_special_tokens=True).

Why Multrenizer?

Standard multilingual tokenizers commonly fail on Turkish in three ways: (1) they break agglutinative morphemes at suboptimal boundaries, (2) they fold case (İstanbul → istanbul) so a downstream model can never produce proper nouns again, and (3) they misclassify punctuation and emoji as "special tokens", which means decode(skip_special_tokens=True) silently strips them and the model output ends up missing commas, apostrophes, and 😀.

Multrenizer is built to fix all three without giving up Turkish efficiency.

Core design:

Case preserving. İstanbul stays İstanbul, TÜRKİYE stays TÜRKİYE. The model learns to produce real proper nouns, not lowercased shadows.
Lossless round-trip. Metaspace pipeline with prepend_scheme="first" plus AddedToken(normalized=True) — decode(encode(x)) == x for case, punctuation, currency, math, and compound emoji.
Correct utility taxonomy. Only model-control symbols (<s>, <|user|>, <|reserved_*|>, etc.) are flagged special=True. Punctuation, currency, math symbols, and emoji are atomic vocab entries with special=False — they survive skip_special_tokens=True.
Compact vocabulary. ~26K total budget, Turkish-first morpheme coverage.
Code-switching aware. Trained on a TR/EN/CS interleave; mixed text like merge'lemek istediğim branch segments cleanly.

Pipeline

Raw text
  -> Quote canonicalization (’ ‘ ʼ ＇ -> ')
  -> NFKC normalization
  -> Strip whitespace
  -> Pre-tokenizer: Metaspace(replacement="▁", prepend_scheme="first", split=True)
  -> Unigram model (~26K vocab)
  -> Decoder: Metaspace(prepend_scheme="first")  [mirror of pre-tokenizer]
  -> Post-processor (<s> ... </s>)

No lowercase. No locale-specific I/i replacement (it would be a workaround for a bug Python introduces only when you call .lower(), which we don't).

Quick Start

Installation

git clone https://github.com/fzengin19/multrenizer.git
cd multrenizer
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Use the shipped tokenizer locally

from tokenizers import Tokenizer

tok = Tokenizer.from_file("multrenizer-tokenizer/tokenizer.json")

encoded = tok.encode("İstanbul'da %50 indirim 😀")
print(encoded.tokens)
# ['<s>', "▁İstanbul'da", '▁%', '50', '▁indirim', '▁', '😀', '</s>']

# Round-trip is lossless — case, punctuation, emoji all survive.
print(tok.decode(encoded.ids))
# "İstanbul'da %50 indirim 😀"

Load from Hugging Face

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("fzengin18/multrenizer")
ids = tok.encode("Türkiye'de %50 indirim, 100₺ ❤️", add_special_tokens=False)

# `skip_special_tokens=True` removes ONLY <...>-formatted control tokens.
# Punctuation, currency, and emoji are preserved.
print(tok.decode(ids, skip_special_tokens=True))
# "Türkiye'de %50 indirim, 100₺ ❤️"

Train from scratch

# 1. Download and prepare corpus (Wikipedia TR+EN, OPUS-100, synthetic CS)
python prepare_data.py --size large

# 2. Train tokenizer
python train_tokenizer.py --data-dir data/

# 3. Optional: push to Hugging Face Hub
python train_tokenizer.py --data-dir data/ \
  --repo-id fzengin18/multrenizer \
  --hf-token "$HF_TOKEN"

Run benchmarks

python benchmark.py --tr-lines 5000 --en-lines 5000

Architecture

Token Taxonomy

Tier	Count	`special` flag	Behavior on `skip_special_tokens=True`
Named specials (`<s>`, `<\|user\|>`, `<think>`, ...)	32	`true`	Removed
Reserved (`<\|reserved_0\|>` ... `<\|reserved_511\|>`)	512	`true`	Removed
Utility — punctuation/symbols (`,`, `!`, `'`, `%`, `$`, `→`)	169	`false`	Preserved
Utility — emoji (corpus-learned + curated fallback)	~240	`false`	Preserved
Learned subwords (Unigram)	~24,700	`false`	Preserved

This is the central correctness fix vs the prior artifact, where 292 utility tokens were incorrectly flagged special=true and erased on every decode.

Vocabulary Budget

Target: 26,000. Actual shipped artifact: 25,679 (Unigram convergence undershoot is normal).

The model trained downstream should set its vocab_size to the tokenizer's actual size, not the 26K target.

Special Tokens

Category	IDs	Tokens
Core	0-3	`<unk>` `<s>` `</s>` `<pad>`
Chat	4-8	`<\|system\|>` `<\|user\|>` `<\|assistant\|>` `<\|end\|>` `<\|sep\|>`
Reasoning	9-12	`<think>` `</think>` `<\|step\|>` `<\|reflection\|>`
Tool Use	13-16	`<tool_call>` `</tool_call>` `<tool_response>` `</tool_response>`
Code/FIM	17-20	`<\|code\|>` `<\|fim_prefix\|>` `<\|fim_middle\|>` `<\|fim_suffix\|>`
Language	21-22	`<\|tr\|>` `<\|en\|>`
RAG	23-24	`<\|context\|>` `<\|/context\|>`
Multi-modal	25-28	`<\|image\|>` `<\|audio\|>` `<\|video\|>` `<\|file\|>`
Structured	29-31	`<\|json\|>` `<\|table\|>` `<\|cite\|>`
Reserved	32-543	`<\|reserved_0\|>` ... `<\|reserved_511\|>`

Look these up at runtime with tokenizer.token_to_id("<...>") — do not hardcode IDs in downstream code.

Utility Tokens (special=False, atomic)

Category	Approx. count	Examples
Punctuation	38	`. , ! ? ; : - ( ) [ ] { } / \ " ' ... – — … ‽ ‼ ⁇ ¿ ¡`
Currency & business	23	`₺ $ € £ ¥ ₹ ₽ ¢ ฿ ₪ ₸ ₣ ₮ ₩ % ‰ ° § ¶ № @ # &`
Math & science	33	`± × ÷ ≠ ≤ ≥ ≈ ∞ √ ∑ ∫ ∂ Δ π α β γ δ ε θ λ μ σ φ ω ∀ ∃ ∈ ⊂ ⊆ ∪ ∩ ⇔ ↦ ∝ ∇ ∮ ∧ ∨ ¬ Α Β Γ Σ Φ Ω`
Programming digraphs	11	`=> != == <= >= -> :: ** // && \|\|`
Arrows & symbols	15	`→ ← ↑ ↓ ↔ ⇒ • · ★ ☆ ✓ ✗ © ® ™`
Box drawing & UI	17	`─ │ ┌ ┐ └ ┘ ├ ┤ ┬ ┴ ┼ ▪ ▫ ◆ ▶ ◀ ▼ ▲`
Typography	10	`« » " " ' ' ‹ › „ ‚`
Emoji — faces	70	`😀 😂 🤣 😊 😍 🤔 😭 😡 💀 🤖`
Emoji — hands	28	`👋 👍 👎 👏 🙏 💪 ✊ ✌️`
Emoji — hearts	18	`❤️ 💛 💚 💙 💜 🖤 💔`
Emoji — symbols	36	`🔥 ✨ ⭐ ✅ ❌ ⚠️ 💯 🚀`
Emoji — objects	36	`💻 📱 🎯 🏆 📊 ☕ 🔗 💰`
Emoji — nature & common	17	`🌍 🎉 🎁 ⏰ 📅 📌 🚨 🛒 💳 📞 📧 🏻🏼🏽🏾🏿`
Emoji — flags	26	`🇹🇷 🇺🇸 🇬🇧 🇩🇪 🇫🇷 🇨🇳 🇰🇷 🇮🇳 🇸🇦 🇮🇱 🇷🇺 🇧🇷 🇨🇦 🇲🇽 🇦🇺 🇳🇱`
Emoji — compounds (ZWJ/VS16 sequences)	12	`❤️ 👨‍💻 👩‍💻 🏃‍♂️ 🏃‍♀️ 👨‍👩‍👧 👨‍🏫 👩‍🏫 👍🏻–👍🏿`

ZWJ (U+200D) and VS16 (U+FE0F) are NOT added as standalone tokens; they exist only inside whole compound-emoji sequences.

Data Mix

The released artifact is trained on the corpus produced by prepare_data.py --size large:

Stream	Approx. lines	Share	Source
Turkish	1.30M	~59%	Wikipedia TR streaming
English	0.71M	~32%	Wikipedia EN streaming
Code-switching	0.20M	~9%	OPUS-100 EN-TR pairs + synthetic templates

An additional 5K synthetic emoji-rich CS lines are appended to ensure curated emojis make it into the learned vocabulary instead of the post-train fallback list.

Correctness Tests

The repo ships a pytest suite (tests/test_tokenizer_correctness.py) covering:

Round-trip lossless decode — decode(encode(x)) == normalize(x) over Turkish, English, and code-switching samples.
skip_special_tokens=True preserves utility — punctuation, currency, emoji must survive; only <...>-formatted control tokens are removed.
Compound emoji round-trip — VS16, ZWJ, regional-indicator pairs decode back to the original sequence.
Turkish morpheme integrity — words like şehirdir, kitabımı, evlerinden decode without internal whitespace.
UNK rate < 0.5% on a bilingual bench (line-by-line).
Special token ID stability — IDs 0-7 (<unk>, <s>, </s>, <pad>, <|system|>, <|user|>, <|assistant|>, <|end|>).
Case preservation — İstanbul, TÜRKİYE, ABD, AKP all round-trip without folding.

Current artifact: 60/60 PASS.

Benchmark Results

Evaluated on 5,000 Turkish sentences, 5,000 English sentences, and 500 code-switching sentences from the prepared corpus, against five reference tokenizers.

Compared Tokenizers

Tokenizer	Source	Vocab Size	Algorithm	Type
Multrenizer	This project	25,679	Unigram	Bilingual EN-TR, purpose-built
Kumru-2B	vngrs-ai/Kumru-2B	50,176	BPE	Turkish LLM (VNGRS, Sep 2025, Mistral-based)
Turkcell-7B	TURKCELL/Turkcell-LLM-7b-v1	48,351	BPE	Turkish LLM (Turkcell, Apr 2024, Mistral-based)
GPT-2	openai-community/gpt2	50,257	BPE	English-centric baseline (OpenAI, 2019)
Qwen-3	Qwen/Qwen3-0.6B	151,643	BPE	Multilingual (Alibaba, 2025)
Mistral-3.1	mistralai/Mistral-Small-3.1-24B-Base-2503	131,072	BPE/SP	Multilingual (Mistral AI, Mar 2025)

Fertility, Compression, and Token Count

Lower fertility = fewer tokens per word. Higher compression = more characters per token.

Metric	Multrenizer	Kumru-2B	Turkcell-7B	GPT-2	Qwen-3	Mistral-3.1
Vocab Size	25,679	50,176	48,351	50,257	151,643	131,072
TR Fertility	1.63	1.65	1.92	3.79	2.62	2.38
EN Fertility	1.51	2.15	1.55	1.31	1.37	1.38
TR Compression	4.79	4.72	4.06	2.06	2.98	3.27
EN Compression	4.18	2.94	4.07	4.82	4.61	4.58
TR Total Tokens (5K sent)	130,725	132,637	154,166	304,345	210,334	191,682
EN Total Tokens (5K sent)	155,848	221,420	160,121	135,235	141,275	142,196
Round-trip Lossless	11/11	11/11	11/11	11/11	11/11	11/11

Headline:

Best Turkish efficiency in this set on every TR metric (fertility, compression, total tokens), with the smallest vocabulary by a wide margin.
Competitive English — second only to GPT-2 on EN total tokens, and that's an English-native baseline trained on far more English text.
Round-trip lossless on all 11 cases. This is a minimum-correctness bar; modern BPE tokenizers also pass it. The previous Multrenizer artifact did not (it dropped utility tokens via skip_special_tokens=True); fixing that is one of the main reasons this version exists.

Morphological Splitting

Total tokens needed to represent ten difficult Turkish words:

Tokenizer	Vocab Size	Total Tokens	Avg per Word
Kumru-2B	50,176	35	3.5
Multrenizer	25,679	36	3.6
Turkcell-7B	48,351	38	3.8
Mistral-3.1	131,072	71	7.1
Qwen-3	151,643	73	7.3
GPT-2	50,257	105	10.5

A few selected words:

İstanbul'da
  Multrenizer  [1 tok]   İstanbul'da                     ← single token, atomic
  Kumru-2B     [3 tok]   İstanbul + ' + da
  Turkcell-7B  [3 tok]   İstanbul + ' + da
  GPT-2        [5 tok]   Ä + ° + stanbul + 'd + a
  Qwen-3       [4 tok]   İ + stanbul + 'd + a
  Mistral-3.1  [4 tok]   İ + stanbul + 'd + a

Afyonkarahisarlılaştıramadıklarımızdan
  Multrenizer  [8 tok]   Afyonkarahisar + lı + laştı + ram + a + dıkları + mızda + n
  Kumru-2B     [8 tok]   Af + yonkarahisar + lı + laştır + ama + dık + larımız + dan
  Turkcell-7B  [9 tok]   Afyon + kar + ah + is + arlı + laştır + a + madık + larımızdan
  Qwen-3      [16 tok]   (16-piece byte-level fragmentation)
  Mistral-3.1 [16 tok]   (16-piece byte-level fragmentation)
  GPT-2       [21 tok]   (21-piece byte-level fragmentation)

düşünülebileceğini
  Multrenizer  [2 tok]   düşünül + ebileceğini
  Kumru-2B     [3 tok]   düşünül + ebil + eceğini
  Turkcell-7B  [4 tok]   düş + ünü + le + bileceğini
  GPT-2       [12 tok]   d + ü + ş + ü + n + ü + le + b + ile + ce + ğ + ini

Round-Trip Lossless Decode

11 representative inputs covering case preservation, apostrophe handling, currency, math/measurement symbols, VS16 / ZWJ / regional-indicator emoji, and mixed-content sentences. Multrenizer passes 11/11. All compared modern tokenizers also pass — the value here is being on the right side of the bar (the previous Multrenizer artifact wasn't), not winning a race.

Input                                                                Multrenizer  ...
İstanbul'da güzel bir gün geçirdim.                                    OK
TÜRKİYE büyük harflerle yazılır.                                       OK
It's a beautiful day, isn't it?                                        OK
Fiyat: 100₺ ve $50, €30 ve £25.                                        OK
Türkiye'de %50 indirim ve %25 KDV.                                     OK
Hava 22°C, yağış %30, basınç 1013hPa.                                  OK
❤️ ile 💔 farklı duygular ifade eder.                                   OK
🇹🇷 ve 🇺🇸 bayrakları yan yana asıldı.                                   OK
Geliştirici 👨‍💻 olarak çalışıyorum.                                    OK
Selam 👋! Türkiye'de %50 indirim 100₺.                                  OK

Code-Switching Tokenization

"Bu feature'ı implement ederken edge case'leri handle etmeyi unutmayalım."

  Multrenizer  [12 tok]  Bu | feature | 'ı | implement | ederken | edge | case | 'leri | handle | etmeyi | unutmay | alım.
  Kumru-2B     [20 tok]  Bu | fe | ature | ' | ı | imp | lement | ederken | ed | ge | cas | e | ' | leri | hand | le | etmeyi | unutma | yalım | .
  Turkcell-7B  [15 tok]  Bu | feature | ' | ı | implement | ederken | edge | case | ' | leri | handle | etmeyi | unut | mayalım | .

"merge'lemek istediğim branch conflict veriyor."

  Multrenizer  [ 7 tok]  merge'le | mek | istediğim | branch | conflict | veriyor | .
  Kumru-2B     [14 tok]  mer | ge | ' | lemek | istediğim | b | ran | ch | con | f | lic | t | veriyor | .
  Turkcell-7B  [ 8 tok]  merge | ' | lemek | istediğim | branch | conflict | veriyor | .
  GPT-2        [16 tok]  mer | ge | ' | lem | ek | is | ted | i | ğ | im | branch | conflict | ver | iy | or | .

Full benchmark output (all sections, all sentences, machine-readable) is in benchmark_results.json. Regenerate with python benchmark.py.

Project Structure

multrenizer/
├── multrenizer-tokenizer/     # Trained tokenizer artifact
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── special_tokens_map.json
├── prepare_data.py            # Corpus download (Wikipedia + OPUS-100 + synthetic CS)
├── train_tokenizer.py         # Tokenizer training script
├── benchmark.py               # Benchmark vs 5 reference tokenizers
├── benchmark_results.json     # Full benchmark output
├── tests/
│   └── test_tokenizer_correctness.py   # Round-trip + utility + ID-stability suite
├── RETRAIN_PLAN.md            # Design / pivot history (audit trail)
├── MINDAI_INTEGRATION.md      # Downstream integration recipe
├── requirements.txt
└── pyproject.toml

References

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for fzengin18/multrenizer

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Paper • 2508.08424 • Published Nov 10, 2025

fzengin18
/

multrenizer