Apple MLX for AI/Large Language Models—Day One

Community Article Published December 12, 2025

Author: Uche Ogbuji

Rev 1, Dec 2025 ← Rev 0, Apr 2024

When I wrote the first revision of this article I had been using llama.cpp on Mac Silicon for some six months, and my brother, Chimezie had been nudging me to give MLX a go. When I finally set aside time to get started, my initial goal was to add support for MLX model loading & usage in OgbujiPT. Back then it was rough around the edges, and yet unmistakably exciting. I kept notes as I went along, hoping they might help anyone else looking to get into MLX.

MLX is very interesting because honestly, Apple has the most coherently engineered consumer and small-business-level hardware for AI workloads, with Apple Silicon (the M1 through M5 chips) and its unified memory. The news lately is all about Apple's AI fumbles, but I suspect their clever plan is to empower a community of developers to take the arrows in their back and build things out for them. The MLX community is brigth and active, a fact Chimezie spotted early on. If like me you're trying to develop products on this new GenAI frontier without leaving far-flung, black-box providers to handle the most important bits, MLX is a compelling avenue.

My initial notes are just on inferencing. There are other community write-ups and projects on fine-tuning and other more advanced topics in MLX, and I'll get to my own take on those in time. There's plenty to dig into just on the inference side, though.

The velocity of change in MLX has slowed as it's matured, but all contemporary AI topics are subject to flux, so bits of this article might go out of date, but I'll try to make revisions when I can.

First steps with inference

I installed the mlx and mlx-lm Python packages. The former are the core array processing tools, and the latter language model-specific features. After switching to a suitable Python virtual environment I ran the following.

pip install mlx mlx-lm

For more information, see the mlx-lm README.

Navigating the model garden maze

Now thousands of open-weight AI models are available, especially form HuggingFace, which has become the most popular home for these assets. When I was primarily using llama.cpp I would find a model, usually quantized to the GGUF format, which helps squeeze models so that they can fit in reduced VRAM. I like to use OpenAI's GPT OSS as an example when advocating locally hosted. It's provided by OpenAI, whichoffers a reputational boost, and it has the added advantage of being a pretty good model, as well.

The vendor (e.g. OpenAI) usually provides their own full-sized version. You can then find quantized versions, which might be provided by third parties, and come in different flavors and parameterizations from different processes. The Unsloth 4-bit quantizations are a good example. A side benefit is that though many vendor model versions make you log in and accept an agreement to use the model, variant versions such as quantizations allow you to bypass that. Almost certainly a legal grey area, but certainly a handy convenience when quickly trying out various models.

MLX-LM has its own format for model weights, including quantized model weights. The easiest way to get an MLX-compatible version of a model you want to try is to check the thousands which have been contributed by the MLX community. You can use MLX-LM to convert other model weight formats for MLX, but I'll leave that topic for another article.

For example a community MLX version for GPT-OSS's 20b (20 billion parameter) variant, the smaller one: mlx-community/gpt-oss-20b-MXFP4-Q4.

I've glossed over a lot of key concepts for working with local LLMs. I have a Concepts Primer section at the end of this article which you might find useful, including as touchpoints for further learning.

Loading and prompting a model

You can just prompt a model using the MLX-LM CLI. Notice the model selection arg:

mlx_lm.generate --model mlx-community/gpt-oss-20b-MXFP4-Q4 --prompt "Do you have any advice for a fresh graduate?"

This will take a while the first time as the gpt-oss weights are downloaded to your machine. The repository will be cached, by default in ~/.cache/huggingface/hub, so subsequent loads will be much faster.

There's a lot going on in that output, isn't there? The first thing to be aware of is that GPT-OSS is a reasoning model. This means that it goes into a "thinking" phase before delivering its final response. Most new LLMs are reasoning models, though many of the tools you might have used to access them might omit the reasoning output and just give you the final response. We're at a lower level, here, so we see everything. <|channel|> is a special token that basically routes LLM output. The analysis channel is part of the reasoning. Later on you see the final channel, for the final response.

But that final response appears to be truncated, and indeed it is. By default this CLI will only generate 100 tokens before cutting off. We can adjust that with the --max-tokens arg.

mlx_lm.generate --model mlx-community/gpt-oss-20b-MXFP4-Q4 --prompt "Do you have any concise advice for a fresh graduate?" --max-tokens 1024

Notice I added "concise" to the prompt because at first the LLM shoveled over 1000 tokens at me!

Similarly, in Python you can run:

from mlx_lm import load, generate
model, tokenizer = load('mlx-community/Llama-3.2-3B-Instruct-4bit')

You might notice I changed the model selection. It's easier for now to not have to worry about the modeling's reasoning behavior, so I went to good old Llama 3.2, a decent and small non-reasoning model.

Thye basic, underlying behavior of an LLM is to generate tokens to complete the prompt. There ae many other features added to this foundation to create well-known AI chat bots such as ChatCPT, but it's useful to never lose site of the basic nadure of LLM generation. The following example illustrates.

response = generate(model, tokenizer, prompt="The capital of Nigeria:", verbose=True, max_tokens=128)

You can see the completion response being streamed. 107 tokens out of the 128 we allowed.

The first step from basic completion to a chat bot is to apply what's called a chat template, which allows you to use simple Python constructs to work with the low-level way modern LLMs are trained to follow user prompt with assistant response.

messages = [
  {'role': 'system', 'content': 'You are a friendly chatbot who always responds in the style of a cave man'},
  {'role': 'user', 'content': 'Where do you live?'}]
chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
response = generate(model, tokenizer, prompt=chat_prompt, verbose=True, max_tokens=128)

response is the plain old string with the LLM completion/response. It will already have been streamed to the console thanks to verbose=True, appearing right after the converted prompt. This is useful so you can see how the ChatML format has been converted using special, low-level LLM tokens such as <|start_header_id|> and <|end_header_id|>. More on LLM tokens in a future article. Obviously I used an outlandish system prompt, but using a chat template and a well-considered system message goes a long way towards making LLMs more coherent.

Sum-up and Upcoming

I've always thought locally hosted, or at least locally-hostable GenAI models are the future. Not only do they avoid the privacy problems of shoveling all your data at a third party, but it's also how we learn to engineer more efficient AI, which is an important consideration. With all the talk about the alarming greed of the AI industry for electricity and water, it's good to know that so much of my own usage consumes on the order of my pretty efficient MacBook Pro laptop. Apple, which has been sometimes cited for being an AI laggard, has at least done an impotant service in equipping engineers and more sophisticated practitioners with this powerful and flexible set of tools.

LLM inference is just the a spoonful of the yummy soup in the MLX pot. More servings coming right up in subsequent articles.

Plug: As I've suggested, Chimezie has blazed this trail before me, and was quite helpful. He's one of those few software engineers who juggles as many projects as I do, and you can check out his work on MLX and other topics at HuggingFace and GitHub. See also the Oori Data HuggingFace org page.

Concepts primer

A paragraph each on concepts mentioned in this article. Nothing in-depth, but enough to help you find resources for further study.

HuggingFace: A platform that hosts thousands of open-source AI models and has become the de facto model repository for the AI community. When looking for models to run locally, you'll typically search HuggingFace for your desired model, then find a version converted to your preferred format (GGUF for llama.cpp, MLX format for Apple's MLX framework).

Model Weights: The learned parameters that define how an AI model behaves. Think of them as the "knowledge" the model gained during training. These weights are stored in files that you download and load to run the model locally. Different frameworks use different formats for these weight files.

Open-Weight Models: AI models whose weights are publicly available for download and use, unlike proprietary models from companies like OpenAI or Anthropic. "Open-weight" is distinct from "open-source" because while you can use the weights, you may not have access to the training code or data. These models enable private, local AI development without sending data to third parties.

Model Families and Variants: Large language models often come in multiple sizes (e.g., 7B, 13B, 20B, 70B parameters) and versions. The "B" stands for billion parameters—more parameters generally mean better capabilities but higher hardware requirements. "Instruct" versions are fine-tuned to follow instructions better and are usually what you want for practical applications.

Reasoning Models: A newer category of LLMs specifically designed to generate tokens to express a step-by-step thinking process before generating tokens representing a final response. Unlike standard models that generate responses directly, reasoning models explicitly work through problems, breaking them down into logical steps. This makes them particularly useful for complex tasks like mathematics, coding, and multi-step analysis where you want to verify the model's logic. OpenAI's o1 and o3 models are well-known examples, though most major models have at least optional reasoning capability. DeepSeek was famous for bringing high-quality reasoning to open-weights models. Such models are trained with chain-of-thought or reasoning-focused approaches.

Quantization: A technique that reduces the precision of model weights to decrease memory requirements and improve inference speed. Instead of using full 32-bit or 16-bit floating point numbers, quantized models use 4-bit or 8-bit representations. This makes it possible to run large models on consumer hardware with limited VRAM/RAM. For example, a 4-bit quantized version of a 20 billion parameter model can fit on devices that couldn't handle the full-precision version.

llama.cpp: A highly optimized C++ implementation of Meta's LLaMA architecture that has become the de facto standard for running LLMs efficiently on consumer hardware. Originally designed for LLaMA models, it now supports dozens of model architectures. Its Python bindings make it accessible for developers while maintaining the performance benefits of the C++ core.

GGUF Format: A file format designed specifically for efficient loading and inference of LLMs, used primarily by llama.cpp (and named after its author). Models in GGUF format are optimized for running on consumer hardware and are available in various quantization levels. If you're using llama.cpp, you'll want models in this format.

Inference vs. Fine-tuning: Inference is using a pre-trained model to generate outputs (like answering questions or writing text). Fine-tuning is the process of further training a model on your specific data. For most applications, you'll start with inference using existing models before considering fine-tuning.

Context Window: The maximum amount of text (measured in tokens) that a model can process at once. This includes both your prompt and the model's response. Larger context windows (like 32k tokens) allow you to work with longer documents or conversations without losing information. One token is roughly 3-4 characters of English text.

Token: The basic unit that LLMs process and generate. A token can be a word, part of a word, or even a single character, depending on how the model was trained. In English text, one token typically represents about 3-4 characters or four word generally comprise around three tokens. When you see specifications like max_tokens=1000 or `32k context window," these refer to token counts, not word or character counts. Understanding tokens is important for estimating costs, setting limits, and working within model constraints.

Chat Template: The specific formatting structure that different models expect for conversational interactions. Each model family has its own template syntax for distinguishing between system instructions, user messages, and assistant responses. For example, Mixtral uses <s> [INST] Instruction [/INST] Model answer</s> while other models might use different delimiters. Using the wrong chat template can significantly degrade model performance, so it's important to match your prompt format to what the model was trained with.

If you're curious about how LLM chat templates work, read the HuggingFace doc.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote