Multrenizer

A bilingual Turkish-English Unigram tokenizer with lossless round-trip decode, case preservation, and a correct utility-token taxonomy — punctuation, currency, math symbols, and emoji are kept atomic in the vocabulary AND preserved through decode(skip_special_tokens=True).

Links

Why Multrenizer?

Standard multilingual tokenizers commonly fail on Turkish in three ways: (1) they break agglutinative morphemes at suboptimal boundaries, (2) they fold case (İstanbulistanbul) so a downstream model can never produce proper nouns again, and (3) they misclassify punctuation and emoji as "special tokens", which means decode(skip_special_tokens=True) silently strips them and the model output ends up missing commas, apostrophes, and 😀.

Multrenizer is built to fix all three without giving up Turkish efficiency.

Core design:

  • Case preserving. İstanbul stays İstanbul, TÜRKİYE stays TÜRKİYE. The model learns to produce real proper nouns, not lowercased shadows.
  • Lossless round-trip. Metaspace pipeline with prepend_scheme="first" plus AddedToken(normalized=True)decode(encode(x)) == x for case, punctuation, currency, math, and compound emoji.
  • Correct utility taxonomy. Only model-control symbols (<s>, <|user|>, <|reserved_*|>, etc.) are flagged special=True. Punctuation, currency, math symbols, and emoji are atomic vocab entries with special=False — they survive skip_special_tokens=True.
  • Compact vocabulary. ~26K total budget, Turkish-first morpheme coverage.
  • Code-switching aware. Trained on a TR/EN/CS interleave; mixed text like merge'lemek istediğim branch segments cleanly.

Pipeline

Raw text
  -> Quote canonicalization (’ ‘ ʼ ' -> ')
  -> NFKC normalization
  -> Strip whitespace
  -> Pre-tokenizer: Metaspace(replacement="▁", prepend_scheme="first", split=True)
  -> Unigram model (~26K vocab)
  -> Decoder: Metaspace(prepend_scheme="first")  [mirror of pre-tokenizer]
  -> Post-processor (<s> ... </s>)

No lowercase. No locale-specific I/i replacement (it would be a workaround for a bug Python introduces only when you call .lower(), which we don't).

Quick Start

Installation

git clone https://github.com/fzengin19/multrenizer.git
cd multrenizer
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Use the shipped tokenizer locally

from tokenizers import Tokenizer

tok = Tokenizer.from_file("multrenizer-tokenizer/tokenizer.json")

encoded = tok.encode("İstanbul'da %50 indirim 😀")
print(encoded.tokens)
# ['<s>', "▁İstanbul'da", '▁%', '50', '▁indirim', '▁', '😀', '</s>']

# Round-trip is lossless — case, punctuation, emoji all survive.
print(tok.decode(encoded.ids))
# "İstanbul'da %50 indirim 😀"

Load from Hugging Face

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("fzengin18/multrenizer")
ids = tok.encode("Türkiye'de %50 indirim, 100₺ ❤️", add_special_tokens=False)

# `skip_special_tokens=True` removes ONLY <...>-formatted control tokens.
# Punctuation, currency, and emoji are preserved.
print(tok.decode(ids, skip_special_tokens=True))
# "Türkiye'de %50 indirim, 100₺ ❤️"

Train from scratch

# 1. Download and prepare corpus (Wikipedia TR+EN, OPUS-100, synthetic CS)
python prepare_data.py --size large

# 2. Train tokenizer
python train_tokenizer.py --data-dir data/

# 3. Optional: push to Hugging Face Hub
python train_tokenizer.py --data-dir data/ \
  --repo-id fzengin18/multrenizer \
  --hf-token "$HF_TOKEN"

Run benchmarks

python benchmark.py --tr-lines 5000 --en-lines 5000

Architecture

Token Taxonomy

Tier Count special flag Behavior on skip_special_tokens=True
Named specials (<s>, <|user|>, <think>, ...) 32 true Removed
Reserved (<|reserved_0|> ... <|reserved_511|>) 512 true Removed
Utility — punctuation/symbols (,, !, ', %, $, ) 169 false Preserved
Utility — emoji (corpus-learned + curated fallback) ~240 false Preserved
Learned subwords (Unigram) ~24,700 false Preserved

This is the central correctness fix vs the prior artifact, where 292 utility tokens were incorrectly flagged special=true and erased on every decode.

Vocabulary Budget

Target: 26,000. Actual shipped artifact: 25,679 (Unigram convergence undershoot is normal).

The model trained downstream should set its vocab_size to the tokenizer's actual size, not the 26K target.

Special Tokens

Category IDs Tokens
Core 0-3 <unk> <s> </s> <pad>
Chat 4-8 <|system|> <|user|> <|assistant|> <|end|> <|sep|>
Reasoning 9-12 <think> </think> <|step|> <|reflection|>
Tool Use 13-16 <tool_call> </tool_call> <tool_response> </tool_response>
Code/FIM 17-20 <|code|> <|fim_prefix|> <|fim_middle|> <|fim_suffix|>
Language 21-22 <|tr|> <|en|>
RAG 23-24 <|context|> <|/context|>
Multi-modal 25-28 <|image|> <|audio|> <|video|> <|file|>
Structured 29-31 <|json|> <|table|> <|cite|>
Reserved 32-543 <|reserved_0|> ... <|reserved_511|>

Look these up at runtime with tokenizer.token_to_id("<...>") — do not hardcode IDs in downstream code.

Utility Tokens (special=False, atomic)

Category Approx. count Examples
Punctuation 38 . , ! ? ; : - ( ) [ ] { } / \ " ' ... – — … ‽ ‼ ⁇ ¿ ¡
Currency & business 23 ₺ $ € £ ¥ ₹ ₽ ¢ ฿ ₪ ₸ ₣ ₮ ₩ % ‰ ° § ¶ № @ # &
Math & science 33 ± × ÷ ≠ ≤ ≥ ≈ ∞ √ ∑ ∫ ∂ Δ π α β γ δ ε θ λ μ σ φ ω ∀ ∃ ∈ ⊂ ⊆ ∪ ∩ ⇔ ↦ ∝ ∇ ∮ ∧ ∨ ¬ Α Β Γ Σ Φ Ω
Programming digraphs 11 => != == <= >= -> :: ** // && ||
Arrows & symbols 15 → ← ↑ ↓ ↔ ⇒ • · ★ ☆ ✓ ✗ © ® ™
Box drawing & UI 17 ─ │ ┌ ┐ └ ┘ ├ ┤ ┬ ┴ ┼ ▪ ▫ ◆ ▶ ◀ ▼ ▲
Typography 10 « » " " ' ' ‹ › „ ‚
Emoji — faces 70 😀 😂 🤣 😊 😍 🤔 😭 😡 💀 🤖
Emoji — hands 28 👋 👍 👎 👏 🙏 💪 ✊ ✌️
Emoji — hearts 18 ❤️ 💛 💚 💙 💜 🖤 💔
Emoji — symbols 36 🔥 ✨ ⭐ ✅ ❌ ⚠️ 💯 🚀
Emoji — objects 36 💻 📱 🎯 🏆 📊 ☕ 🔗 💰
Emoji — nature & common 17 🌍 🎉 🎁 ⏰ 📅 📌 🚨 🛒 💳 📞 📧 🏻🏼🏽🏾🏿
Emoji — flags 26 🇹🇷 🇺🇸 🇬🇧 🇩🇪 🇫🇷 🇨🇳 🇰🇷 🇮🇳 🇸🇦 🇮🇱 🇷🇺 🇧🇷 🇨🇦 🇲🇽 🇦🇺 🇳🇱
Emoji — compounds (ZWJ/VS16 sequences) 12 ❤️ 👨‍💻 👩‍💻 🏃‍♂️ 🏃‍♀️ 👨‍👩‍👧 👨‍🏫 👩‍🏫 👍🏻–👍🏿

ZWJ (U+200D) and VS16 (U+FE0F) are NOT added as standalone tokens; they exist only inside whole compound-emoji sequences.

Data Mix

The released artifact is trained on the corpus produced by prepare_data.py --size large:

Stream Approx. lines Share Source
Turkish 1.30M ~59% Wikipedia TR streaming
English 0.71M ~32% Wikipedia EN streaming
Code-switching 0.20M ~9% OPUS-100 EN-TR pairs + synthetic templates

An additional 5K synthetic emoji-rich CS lines are appended to ensure curated emojis make it into the learned vocabulary instead of the post-train fallback list.

Correctness Tests

The repo ships a pytest suite (tests/test_tokenizer_correctness.py) covering:

  1. Round-trip lossless decodedecode(encode(x)) == normalize(x) over Turkish, English, and code-switching samples.
  2. skip_special_tokens=True preserves utility — punctuation, currency, emoji must survive; only <...>-formatted control tokens are removed.
  3. Compound emoji round-trip — VS16, ZWJ, regional-indicator pairs decode back to the original sequence.
  4. Turkish morpheme integrity — words like şehirdir, kitabımı, evlerinden decode without internal whitespace.
  5. UNK rate < 0.5% on a bilingual bench (line-by-line).
  6. Special token ID stability — IDs 0-7 (<unk>, <s>, </s>, <pad>, <|system|>, <|user|>, <|assistant|>, <|end|>).
  7. Case preservationİstanbul, TÜRKİYE, ABD, AKP all round-trip without folding.

Current artifact: 60/60 PASS.

Benchmark Results

Evaluated on 5,000 Turkish sentences, 5,000 English sentences, and 500 code-switching sentences from the prepared corpus, against five reference tokenizers.

Compared Tokenizers

Tokenizer Source Vocab Size Algorithm Type
Multrenizer This project 25,679 Unigram Bilingual EN-TR, purpose-built
Kumru-2B vngrs-ai/Kumru-2B 50,176 BPE Turkish LLM (VNGRS, Sep 2025, Mistral-based)
Turkcell-7B TURKCELL/Turkcell-LLM-7b-v1 48,351 BPE Turkish LLM (Turkcell, Apr 2024, Mistral-based)
GPT-2 openai-community/gpt2 50,257 BPE English-centric baseline (OpenAI, 2019)
Qwen-3 Qwen/Qwen3-0.6B 151,643 BPE Multilingual (Alibaba, 2025)
Mistral-3.1 mistralai/Mistral-Small-3.1-24B-Base-2503 131,072 BPE/SP Multilingual (Mistral AI, Mar 2025)

Fertility, Compression, and Token Count

Lower fertility = fewer tokens per word. Higher compression = more characters per token.

Metric Multrenizer Kumru-2B Turkcell-7B GPT-2 Qwen-3 Mistral-3.1
Vocab Size 25,679 50,176 48,351 50,257 151,643 131,072
TR Fertility 1.63 1.65 1.92 3.79 2.62 2.38
EN Fertility 1.51 2.15 1.55 1.31 1.37 1.38
TR Compression 4.79 4.72 4.06 2.06 2.98 3.27
EN Compression 4.18 2.94 4.07 4.82 4.61 4.58
TR Total Tokens (5K sent) 130,725 132,637 154,166 304,345 210,334 191,682
EN Total Tokens (5K sent) 155,848 221,420 160,121 135,235 141,275 142,196
Round-trip Lossless 11/11 11/11 11/11 11/11 11/11 11/11

Headline:

  • Best Turkish efficiency in this set on every TR metric (fertility, compression, total tokens), with the smallest vocabulary by a wide margin.
  • Competitive English — second only to GPT-2 on EN total tokens, and that's an English-native baseline trained on far more English text.
  • Round-trip lossless on all 11 cases. This is a minimum-correctness bar; modern BPE tokenizers also pass it. The previous Multrenizer artifact did not (it dropped utility tokens via skip_special_tokens=True); fixing that is one of the main reasons this version exists.

Morphological Splitting

Total tokens needed to represent ten difficult Turkish words:

Tokenizer Vocab Size Total Tokens Avg per Word
Kumru-2B 50,176 35 3.5
Multrenizer 25,679 36 3.6
Turkcell-7B 48,351 38 3.8
Mistral-3.1 131,072 71 7.1
Qwen-3 151,643 73 7.3
GPT-2 50,257 105 10.5

A few selected words:

İstanbul'da
  Multrenizer  [1 tok]   İstanbul'da                     ← single token, atomic
  Kumru-2B     [3 tok]   İstanbul + ' + da
  Turkcell-7B  [3 tok]   İstanbul + ' + da
  GPT-2        [5 tok]   Ä + ° + stanbul + 'd + a
  Qwen-3       [4 tok]   İ + stanbul + 'd + a
  Mistral-3.1  [4 tok]   İ + stanbul + 'd + a

Afyonkarahisarlılaştıramadıklarımızdan
  Multrenizer  [8 tok]   Afyonkarahisar + lı + laştı + ram + a + dıkları + mızda + n
  Kumru-2B     [8 tok]   Af + yonkarahisar + lı + laştır + ama + dık + larımız + dan
  Turkcell-7B  [9 tok]   Afyon + kar + ah + is + arlı + laştır + a + madık + larımızdan
  Qwen-3      [16 tok]   (16-piece byte-level fragmentation)
  Mistral-3.1 [16 tok]   (16-piece byte-level fragmentation)
  GPT-2       [21 tok]   (21-piece byte-level fragmentation)

düşünülebileceğini
  Multrenizer  [2 tok]   düşünül + ebileceğini
  Kumru-2B     [3 tok]   düşünül + ebil + eceğini
  Turkcell-7B  [4 tok]   düş + ünü + le + bileceğini
  GPT-2       [12 tok]   d + ü + ş + ü + n + ü + le + b + ile + ce + ğ + ini

Round-Trip Lossless Decode

11 representative inputs covering case preservation, apostrophe handling, currency, math/measurement symbols, VS16 / ZWJ / regional-indicator emoji, and mixed-content sentences. Multrenizer passes 11/11. All compared modern tokenizers also pass — the value here is being on the right side of the bar (the previous Multrenizer artifact wasn't), not winning a race.

Input                                                                Multrenizer  ...
İstanbul'da güzel bir gün geçirdim.                                    OK
TÜRKİYE büyük harflerle yazılır.                                       OK
It's a beautiful day, isn't it?                                        OK
Fiyat: 100₺ ve $50, €30 ve £25.                                        OK
Türkiye'de %50 indirim ve %25 KDV.                                     OK
Hava 22°C, yağış %30, basınç 1013hPa.                                  OK
❤️ ile 💔 farklı duygular ifade eder.                                   OK
🇹🇷 ve 🇺🇸 bayrakları yan yana asıldı.                                   OK
Geliştirici 👨‍💻 olarak çalışıyorum.                                    OK
Selam 👋! Türkiye'de %50 indirim 100₺.                                  OK

Code-Switching Tokenization

"Bu feature'ı implement ederken edge case'leri handle etmeyi unutmayalım."

  Multrenizer  [12 tok]  Bu | feature | 'ı | implement | ederken | edge | case | 'leri | handle | etmeyi | unutmay | alım.
  Kumru-2B     [20 tok]  Bu | fe | ature | ' | ı | imp | lement | ederken | ed | ge | cas | e | ' | leri | hand | le | etmeyi | unutma | yalım | .
  Turkcell-7B  [15 tok]  Bu | feature | ' | ı | implement | ederken | edge | case | ' | leri | handle | etmeyi | unut | mayalım | .

"merge'lemek istediğim branch conflict veriyor."

  Multrenizer  [ 7 tok]  merge'le | mek | istediğim | branch | conflict | veriyor | .
  Kumru-2B     [14 tok]  mer | ge | ' | lemek | istediğim | b | ran | ch | con | f | lic | t | veriyor | .
  Turkcell-7B  [ 8 tok]  merge | ' | lemek | istediğim | branch | conflict | veriyor | .
  GPT-2        [16 tok]  mer | ge | ' | lem | ek | is | ted | i | ğ | im | branch | conflict | ver | iy | or | .

Full benchmark output (all sections, all sentences, machine-readable) is in benchmark_results.json. Regenerate with python benchmark.py.

Project Structure

multrenizer/
├── multrenizer-tokenizer/     # Trained tokenizer artifact
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── special_tokens_map.json
├── prepare_data.py            # Corpus download (Wikipedia + OPUS-100 + synthetic CS)
├── train_tokenizer.py         # Tokenizer training script
├── benchmark.py               # Benchmark vs 5 reference tokenizers
├── benchmark_results.json     # Full benchmark output
├── tests/
│   └── test_tokenizer_correctness.py   # Round-trip + utility + ID-stability suite
├── RETRAIN_PLAN.md            # Design / pivot history (audit trail)
├── MINDAI_INTEGRATION.md      # Downstream integration recipe
├── requirements.txt
└── pyproject.toml

References

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for fzengin18/multrenizer