Baseer-Nakba HTR: A State-of-the-Art VLM for Arabic Handwritten Text Recognition

Overview

This repository contains the model weights and inference pipeline for our submission to the NAKBA NLP 2026 Arabic Handwritten Text Recognition (HTR) competition.

Our approach adapts the 3B-parameter Baseer Vision-Language Model (VLM) to effectively parse and recognize highly cursive, historical Arabic manuscripts. Through a progressive training pipeline, domain-matched data augmentation, and advanced checkpoint merging, this unified model mitigates the challenges of varying writer styles, age-related document degradation, and morphological complexity.

To try our Baseer model for document extraction, please visit: Baseer β€” Baseer is the SOTA model on Arabic Document Extraction.


πŸ† Competition Results

Our final model (Misraj AI) secured 1st place on the official Nakba hidden test set leaderboard.

Rank Team CER WER
πŸ₯‡ 1st Misraj AI 0.0790 0.2440
πŸ₯ˆ 2nd Oblevit 0.0925 0.3268
πŸ₯‰ 3rd 3reeq 0.0938 0.2996
4th Latent Narratives 0.1050 0.3106
5th Al-Warraq 0.1142 0.3780
6th Not Gemma 0.1217 0.3063
7th NAMAA-Qari 0.1950 0.5194
8th Fahras 0.2269 0.5223
β€” Baseline 0.3683 0.6905

Training Methodology

Our model was trained using a multi-stage Supervised Fine-Tuning (SFT) curriculum.

  1. Data Augmentation: The Muharaf enhancement dataset was converted to grayscale to match the visual complexity and tonal distribution of the Nakba competition data.
  2. Decoder-Only SFT: We first trained the text decoder autoregressively on the structurally similar Muharaf dataset to condition the language modeling head.
  3. Full Encoder-Decoder Tuning: We subsequently unfroze the vision encoder and trained the full architecture on the Nakba dataset using differential learning rates β€” a key step that yielded a >5% improvement in WER over decoder-only tuning.
  4. Checkpoint Merging: To stabilize predictions and maximize generalization, we merged our top-performing checkpoints (Epoch 1 and Epoch 5) using SLERP interpolation.

Training Hyperparameters

All supervised experiments were conducted with standardized hyperparameters across configurations.

Parameter Value
Hardware 2Γ— NVIDIA H100 GPUs
Base Model 3B-parameter Baseer
Epochs 5
Optimizer AdamW
Weight Decay 0.01
Learning Rate Schedule Cosine
Batch Size 128
Max Sequence Length 1200 tokens
Input Image Resolution 644 Γ— 644 pixels
Decoder-Only Learning Rate 1e-4
Encoder Learning Rate 9e-6
Decoder Learning Rate (Full Tuning) 1e-4

Image Examples

The model works reliably on images from the Nakba dataset and visually similar historical manuscripts.

image (1) image (2) image (3)


Merge Method

This model was merged using the SLERP merge method.

Models Merged

  • Baseer_Nakba_ep_1
  • Baseer_Nakba_ep_5

Configuration

merge_method: slerp
base_model: Baseer_Nakba_ep_1
models:
  - model: Baseer_Nakba_ep_1
  - model: Baseer_Nakba_ep_5
parameters:
  t:
    - value: 0.50
dtype: bfloat16

Citation

If you use this model or find our work helpful, please consider citing our paper:

@inproceedings{misrajai2026nakba,
  title     = {Adapting Vision-Language Models for Historical Arabic Handwritten Text Recognition},
  author    = {Misraj AI},
  booktitle = {Nakba OCR Competition, NLP 2026},
  year      = {2026}
}

Links

Downloads last month
199
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Misraj/Baseer__Nakba