Liquid AI
Try LFM β€’ Documentation β€’ LEAP

LFM2-2.6B-Transcript-ONNX

ONNX export of LFM2-2.6B-Transcript for cross-platform inference.

LFM2-2.6B-Transcript is optimized for processing and summarizing meeting transcripts, extracting key points, action items, and decisions from conversational text.

Recommended Variants

Precision Size Platform Use Case
Q4 ~2.0GB WebGPU, Server Recommended for most uses
FP16 ~4.8GB WebGPU, Server Higher quality
Q8 ~3.0GB Server only Balance of quality and size
  • WebGPU: Use Q4 or FP16 (Q8 not supported)
  • Server: All variants supported

Model Files

onnx/
β”œβ”€β”€ model.onnx              # FP32 model graph
β”œβ”€β”€ model.onnx_data*        # FP32 weights
β”œβ”€β”€ model_fp16.onnx         # FP16 model graph
β”œβ”€β”€ model_fp16.onnx_data*   # FP16 weights
β”œβ”€β”€ model_q4.onnx           # Q4 model graph (recommended)
β”œβ”€β”€ model_q4.onnx_data      # Q4 weights
β”œβ”€β”€ model_q8.onnx           # Q8 model graph
└── model_q8.onnx_data      # Q8 weights

* Large models (>2GB) split weights across multiple files:
  model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc.
  All data files must be in the same directory as the .onnx file.

Python

Installation

pip install onnxruntime transformers numpy huggingface_hub
# or with GPU support:
pip install onnxruntime-gpu transformers numpy huggingface_hub

Inference

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

# Download model (Q4 recommended)
model_id = "LiquidAI/LFM2-2.6B-Transcript-ONNX"
model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")

# Download all data files (handles multiple splits for large models)
from huggingface_hub import list_repo_files
for f in list_repo_files(model_id):
    if f.startswith("onnx/model_q4.onnx_data"):
        hf_hub_download(model_id, f)

# Load model and tokenizer
session = ort.InferenceSession(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Prepare chat input
messages = [{"role": "user", "content": "Summarize this meeting transcript: ..."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = np.array([tokenizer.encode(prompt, add_special_tokens=False)], dtype=np.int64)

# Initialize KV cache
ONNX_DTYPE = {"tensor(float)": np.float32, "tensor(float16)": np.float16, "tensor(int64)": np.int64}
cache = {}
for inp in session.get_inputs():
    if inp.name in {"input_ids", "attention_mask", "position_ids"}:
        continue
    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
    for i, d in enumerate(inp.shape):
        if isinstance(d, str) and "sequence" in d.lower():
            shape[i] = 0
    cache[inp.name] = np.zeros(shape, dtype=ONNX_DTYPE.get(inp.type, np.float32))

# Check if model uses position_ids
input_names = {inp.name for inp in session.get_inputs()}
use_position_ids = "position_ids" in input_names

# Generate tokens
seq_len = input_ids.shape[1]
generated_tokens = []

for step in range(100):  # max tokens
    if step == 0:
        ids = input_ids
        pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
    else:
        ids = np.array([[generated_tokens[-1]]], dtype=np.int64)
        pos = np.array([[seq_len + len(generated_tokens) - 1]], dtype=np.int64)

    attn_mask = np.ones((1, seq_len + len(generated_tokens)), dtype=np.int64)
    feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
    if use_position_ids:
        feed["position_ids"] = pos

    outputs = session.run(None, feed)
    next_token = int(np.argmax(outputs[0][0, -1]))
    generated_tokens.append(next_token)

    # Update cache
    for i, out in enumerate(session.get_outputs()[1:], 1):
        name = out.name.replace("present_conv", "past_conv").replace("present.", "past_key_values.")
        if name in cache:
            cache[name] = outputs[i]

    if next_token == tokenizer.eos_token_id:
        break

print(tokenizer.decode(generated_tokens, skip_special_tokens=True))

WebGPU (Browser)

Installation

npm install @huggingface/transformers

Enable WebGPU

WebGPU is required for browser inference. To enable:

  1. Chrome/Edge: Navigate to chrome://flags/#enable-unsafe-webgpu, enable, and restart
  2. Verify: Check chrome://gpu for "WebGPU" status
  3. Test: Run navigator.gpu.requestAdapter() in DevTools console

Inference

import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";

const modelId = "LiquidAI/LFM2-2.6B-Transcript-ONNX";

// Load model and tokenizer
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model = await AutoModelForCausalLM.from_pretrained(modelId, {
  device: "webgpu",
  dtype: "q4",  // or "fp16"
});

// Prepare input
const messages = [{ role: "user", content: "Summarize this meeting transcript: ..." }];
const input = tokenizer.apply_chat_template(messages, {
  add_generation_prompt: true,
  return_dict: true,
});

// Generate with streaming
const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
const output = await model.generate({
  ...input,
  max_new_tokens: 256,
  do_sample: false,
  streamer,
});

console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));

WebGPU Notes

  • Supported: Q4, FP16 (Q8 not supported on WebGPU)

License

This model is released under the LFM 1.0 License.

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for LiquidAI/LFM2-2.6B-Transcript-ONNX

Base model

LiquidAI/LFM2-2.6B
Quantized
(9)
this model

Collection including LiquidAI/LFM2-2.6B-Transcript-ONNX