MLX · Edge AI · On-Device ML

Running LLMs on iPhone with Apple's MLX Framework

June 2, 2026·11 min read·By Samith Wijesinghe
MLXEdge AIiOSCore ML

Two years ago, "run a language model on a phone" was a research demo. Today, with Apple's MLX framework and 8-bit quantization, a 3-billion-parameter model runs comfortably on an iPhone 15 Pro — fully offline, with no cloud round-trip. Here's how it actually works, and what I've learned shipping it.

Why on-device, and why MLX

MLX is Apple's open-source array framework built specifically for Apple silicon. Unlike most ML stacks, it leans on the unified memory architecture of the M-series and A-series chips: the CPU, GPU and Neural Engine share one pool of RAM, so tensors don't have to be copied across a PCIe bus the way they are on discrete GPUs. For edge AI that's the whole game — it's what makes running an LLM on a phone feasible at all.

The case for keeping inference on-device rather than calling a cloud API comes down to four things: privacy (user data never leaves the phone), latency (no network round-trip), offline availability, and cost (no per-token server bill). For a large class of features — summarization, classification, autocomplete, on-device search — that trade is overwhelmingly worth it.

The memory math you can't ignore

The first thing to internalize is that an LLM's footprint is dominated by its weights. A 3B-parameter model in 16-bit precision is ~6 GB — too much for an iPhone. The lever is quantization:

MLX has 4-bit quantization built in. On top of the weights you need headroom for the KV cache, which grows linearly with context length — at long contexts this can rival the model itself, so budget for it explicitly.

Quantizing and loading a model

The MLX ecosystem (via mlx-lm) makes conversion a one-liner. You quantize on a Mac, then bundle the result with your app:

# On your Mac — convert + 4-bit quantize a Hugging Face model
python -m mlx_lm.convert \
    --hf-path mlx-community/Qwen2.5-3B-Instruct \
    --quantize --q-bits 4 \
    --mlx-path ./Qwen2.5-3B-4bit

On the device, the Swift side loads the weights into unified memory and streams tokens. With the mlx-swift bindings the core loop is small:

import MLX
import MLXLLM

let model = try await loadModel(directory: bundledModelURL)
let prompt = "Summarize this review in one sentence: \(reviewText)"

for await token in generate(model: model, prompt: prompt, maxTokens: 128) {
    output += token            // stream straight into the UI
}

Because everything lives in unified memory, there's no explicit .to(device) dance — the array is on the GPU already.

Keeping the UI alive

On a phone, the thing that makes or breaks the experience isn't tokens-per-second — it's never blocking the main thread. Three rules I follow:

  1. Stream tokens as they're generated rather than waiting for the full completion. Perceived latency drops dramatically.
  2. Run generation off the main actor and hop back to @MainActor only to publish UI updates.
  3. Watch thermals. Sustained generation heats the SoC; iOS will throttle. Cap context length and token counts to what the feature actually needs.

MLX or Core ML?

This is the question I get most. Short version: Core ML is the better fit for fixed, ahead-of-time-compiled models (vision, audio, classic transformers) where you want the Neural Engine and tight OS integration. MLX shines for generative LLMs where you want dynamic shapes, custom sampling, KV-cache control, and the flexibility to quantize and swap models without recompiling a .mlpackage. I cover the full trade-off in MLX vs Core ML: choosing an on-device inference stack.

What this unlocks

Once inference is local, you can build features that were previously impossible or too expensive: private on-device chat, smart reply, semantic search over a user's own data, and offline assistants. That's the promise of edge AI — and on Apple silicon, MLX is the most direct path to it today. If you want the bigger argument for why this matters, read Edge AI on iOS: why on-device beats the cloud.

The future of AI isn't only in massive data centers — a lot of it runs silently on the device in your pocket.
Written by Samith Wijesinghe — iOS engineer & ML researcher working on on-device ML, MLX and edge AI in Colombo, Sri Lanka.

Keep reading

MLX vs Core ML: choosing an on-device inference stack
When to reach for each, with a decision checklist.
Edge AI on iOS: why on-device beats the cloud
Privacy, latency, cost — the case for local inference.