A newer version of this model is available: diffutron/DiffutronLM-0.3B-Instruct

DiffutronLM-0.3B-1st-Stage

DiffutronLM-0.3B-1st-Stage is an intermediate checkpoint of the Diffutron series, a parameter-efficient, Masked Diffusion Language Model (MDLM) designed for the Turkish language.

This specific model represents the completion of the first stage of instruction fine-tuning. It has been trained to grasp the fundamentals of instruction-following in Turkish, serving as a robust foundation before more complex, domain-specific specialization (which is handled in the final Instruct model).

📌 Model Details

Model Type: Masked Diffusion Language Model (MDLM)
Base Architecture: jhu-clsp/mmBERT-base (Multilingual Encoder)
Language: Turkish
Parameter Count: 307M (0.3B)
Context Length: 256 tokens
Training Libraries: dllm, PyTorch
Status: Intermediate Checkpoint (Stage 1 SFT)

🚀 Training Pipeline for This Checkpoint

Diffutron replaces traditional next-token autoregressive generation with a discrete diffusion process, generating text by iteratively refining sequences in parallel. To reach this checkpoint, the model underwent two main phases:

1. Continual Pre-training (CPT)

The multilingual backbone was adapted to Turkish using a high-rank LoRA strategy (r=256, α=256) on ~2 million sequences sourced from Havadis, Temiz-OSCAR, and Turkish Wikipedia. This effectively modeled Turkish morphological nuances without catastrophic forgetting.

2. Stage 1: Foundational Instruction Tuning

Following CPT, the model underwent full supervised fine-tuning (SFT) to align it with human intent.

Dataset: metunlp/LlamaTurk-Instruction-Set
Objective: Introduce the model to a broad range of general instructions and establish basic response coherence.
Hyperparameters: 20 Epochs, Batch Size 16, AdamW optimizer (lr=1e-4), Max Sequence Length 256.

(Note: For the most advanced instruction-following capabilities, including complex reasoning, we recommend using the final DiffutronLM-0.3B-Instruct model, which includes a second stage of tuning on InstrucTurca.)

📊 Evaluation Results

Despite being an intermediate checkpoint, the 1st-Stage model demonstrates highly competitive performance against much larger autoregressive baselines on the CETVEL Benchmark Suite.

Benchmark	Diffutron-1st (0.3B)-Stage	Diffutron-2nd-Stage (0.3B)	TURNA (1.1B)	Kumru (2B)	Kanarya (2B)	Llama-3.2 (3B)	Trendyol (7B)	Aya-101 (13B)
Belebele_TR	22.22	27.00	22.56	29.00	28.11	55.78	36.22	22.89
EXAMS_TR	25.95	27.74	23.66	30.03	30.03	26.21	28.50	22.90
IronyTR	50.67	52.00	48.33	51.00	50.00	50.17	50.00	52.17
News_Cat	23.20	32.40	32.80	26.40	66.80	64.00	81.20	20.00
MNLI_TR	33.29	32.81	34.94	36.42	33.40	34.76	35.19	27.90
STS_TR	17.77	18.78	14.21	11.75	12.91	12.91	15.52	16.97
XCOPA_TR	53.80	52.00	55.80	54.00	64.20	54.60	61.00	59.60
Average	32.41	34.68	33.19	34.09	40.78	42.63	43.95	31.78

💻 Usage

Inference requires generating text via a discrete diffusion process rather than causal next-token prediction. We recommend using the dllm library.

Recommended Generation Parameters:

Steps: 64 to 128
Temperature: 0.1
Block Length: 32
Repetition Penalty: 1.2
Remask Strategy: low_conf

⚠️ Limitations

Intermediate State: This model has not undergone the final specialization phase and may struggle with highly complex or multi-turn instructions compared to the final Instruct model.
Context Window: Restricted to a 256-token context window.
Multilingual Backbone: Inherits representations from a multilingual encoder, not a natively trained Turkish foundation model.

📝 Citation

@misc{diffutron2026,
  author = {Kocabay, Şuayp Talha and Akkuş, Talha Rüzgar},
  title = {Diffutron: A Masked Diffusion Language Model for Turkish Language},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/collections/diffutron/diffutronlm](https://huggingface.co/collections/diffutron/diffutronlm)}}
}