lfm2_350m_commit_diff_summarizer (LoRA)

A lightweight helper model that turns Git diffs into Conventional Commit–style messages. It outputs strict JSON with a short title (≤ 65 chars) and up to 3 bullets, so your CLI/agents can parse it deterministically.

Model Details

Model Description

Purpose: Summarize git diff patches into concise, Conventional Commit–compliant titles with optional bullets.
I/O format:
- Input: prompt containing the diff (plain text).
- Output: JSON object: {"title": "...", "bullets": ["...", "..."]}.
Model type: LoRA adapter for causal LM (text generation)
Language(s): English (commit message conventions)
Finetuned from: unsloth/LFM2-350M-unsloth-bnb-4bit (4-bit quantized base, trained with QLoRA)

Model Sources

Repository: This model card + adapter on the Hub under ethanke/lfm2_350m_commit_diff_summarizer

Uses

Direct Use

Convert patch diffs into Conventional Commit messages for PR titles, commits, and changelogs.
Provide human-readable summaries in agent UIs with guaranteed JSON structure.

Recommendations

Enforce JSON validation; if invalid, retry with a JSON-repair prompt.
Keep a regex gate for Conventional Commit titles in your pipeline.

How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch, json

BASE = "unsloth/LFM2-350M-unsloth-bnb-4bit"
ADAPTER = "ethanke/lfm2_350m_commit_diff_summarizer"  # replace with your repo id

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16)

tok = AutoTokenizer.from_pretrained(BASE, use_fast=True)
mdl = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto")
mdl = PeftModel.from_pretrained(mdl, ADAPTER)

diff = "...your git diff text..."
prompt = (
  "You are a commit message summarizer.\n"
  "Return a concise JSON object with fields 'title' (<=65 chars) and 'bullets' (0-3 items).\n"
  "Follow the Conventional Commit style for the title.\n\n"
  "### DIFF\n" + diff + "\n\n### OUTPUT JSON\n"
)

inputs = tok(prompt, return_tensors="pt").to(mdl.device)
with torch.no_grad():
    out = mdl.generate(**inputs, max_new_tokens=200, do_sample=False)
text = tok.decode(out[0], skip_special_tokens=True)

# naive JSON extraction
js = text[text.rfind("{"): text.rfind("}")+1]
obj = json.loads(js)
print(obj)

Training Details

Training Data

Dataset: Maxscha/commitbench (diff → commit message).
Filtering: kept only samples whose first non-empty line of the message matches Conventional Commits: ^(feat|fix|docs|style|refactor|perf|test|build|ci|chore|revert)($[^)]+$)?(!)?:\s.+$
Note: The dataset card indicates non-commercial licensing. Confirm before commercial deployment.

Training Procedure

Method: Supervised fine-tuning (SFT) with TRL SFTTrainer + QLoRA (PEFT).
Prompting: Instruction + ### DIFF + ### OUTPUT JSON target (title/bullets).
Precision: fp16 compute on 4-bit base.
Hyperparameters (v0.1):
- max_length=2048, per_device_train_batch_size=2, grad_accum=4
- lr=2e-4, scheduler=cosine, warmup_ratio=0.03
- epochs=1 over capped subset
- LoRA: r=16, alpha=32, dropout=0.05, targets: q/k/v/o + MLP proj

Evaluation

Validation: filtered split from CommitBench.
Metrics (example run):
- eval_loss ≈ 1.18 → perplexity ≈ 3.26
- eval_mean_token_accuracy ≈ 0.77
- Suggested task metrics: JSON validity rate, CC-title compliance, title length ≤ 65 chars, bullets ≤ 3.

Environmental Impact

Hardware: 1× NVIDIA GTX 3060 12 GB (local)
Hours used: ~2 h (prototype)

Technical Specifications

Architecture: LFM2-350M (decoder-only) + LoRA adapter
Libraries: transformers, trl, peft, bitsandbytes, datasets, unsloth

Contact

Open an issue on the Hub repo or message ethanke on Hugging Face.

Framework versions

PEFT 0.17.1
TRL (SFTTrainer)
Transformers (recent version)

Downloads last month: 1

Model tree for ethanker/lfm2_350m_commit_diff_summarizer

Base model

LiquidAI/LFM2-350M

Quantized

unsloth/LFM2-350M-unsloth-bnb-4bit

Adapter

(3)

this model