MLX · Edge AI · On-Device ML
Running LLMs on iPhone with Apple's MLX Framework
Two years ago, "run a language model on a phone" was a research demo. Today, with Apple's MLX framework and 8-bit quantization, a 3-billion-parameter model runs comfortably on an iPhone 15 Pro — fully offline, with no cloud round-trip. Here's how it actually works, and what I've learned shipping it.
Why on-device, and why MLX
MLX is Apple's open-source array framework built specifically for Apple silicon. Unlike most ML stacks, it leans on the unified memory architecture of the M-series and A-series chips: the CPU, GPU and Neural Engine share one pool of RAM, so tensors don't have to be copied across a PCIe bus the way they are on discrete GPUs. For edge AI that's the whole game — it's what makes running an LLM on a phone feasible at all.
The case for keeping inference on-device rather than calling a cloud API comes down to four things: privacy (user data never leaves the phone), latency (no network round-trip), offline availability, and cost (no per-token server bill). For a large class of features — summarization, classification, autocomplete, on-device search — that trade is overwhelmingly worth it.
The memory math you can't ignore
The first thing to internalize is that an LLM's footprint is dominated by its weights. A 3B-parameter model in 16-bit precision is ~6 GB — too much for an iPhone. The lever is quantization:
- FP16 → ~2 bytes/param → 3B ≈ 6 GB. Not shippable on mobile.
- INT8 → ~1 byte/param → 3B ≈ 3 GB. Borderline.
- INT4 → ~0.5 byte/param → 3B ≈ 1.6 GB. This is the sweet spot for phones.
MLX has 4-bit quantization built in. On top of the weights you need headroom for the KV cache, which grows linearly with context length — at long contexts this can rival the model itself, so budget for it explicitly.
Quantizing and loading a model
The MLX ecosystem (via mlx-lm) makes conversion a one-liner. You quantize on a Mac, then bundle the result with your app:
# On your Mac — convert + 4-bit quantize a Hugging Face model
python -m mlx_lm.convert \
--hf-path mlx-community/Qwen2.5-3B-Instruct \
--quantize --q-bits 4 \
--mlx-path ./Qwen2.5-3B-4bit
On the device, the Swift side loads the weights into unified memory and streams tokens. With the mlx-swift bindings the core loop is small:
import MLX
import MLXLLM
let model = try await loadModel(directory: bundledModelURL)
let prompt = "Summarize this review in one sentence: \(reviewText)"
for await token in generate(model: model, prompt: prompt, maxTokens: 128) {
output += token // stream straight into the UI
}
Because everything lives in unified memory, there's no explicit .to(device) dance — the array is on the GPU already.
Keeping the UI alive
On a phone, the thing that makes or breaks the experience isn't tokens-per-second — it's never blocking the main thread. Three rules I follow:
- Stream tokens as they're generated rather than waiting for the full completion. Perceived latency drops dramatically.
- Run generation off the main actor and hop back to
@MainActoronly to publish UI updates. - Watch thermals. Sustained generation heats the SoC; iOS will throttle. Cap context length and token counts to what the feature actually needs.
MLX or Core ML?
This is the question I get most. Short version: Core ML is the better fit for fixed, ahead-of-time-compiled models (vision, audio, classic transformers) where you want the Neural Engine and tight OS integration. MLX shines for generative LLMs where you want dynamic shapes, custom sampling, KV-cache control, and the flexibility to quantize and swap models without recompiling a .mlpackage. I cover the full trade-off in MLX vs Core ML: choosing an on-device inference stack.
What this unlocks
Once inference is local, you can build features that were previously impossible or too expensive: private on-device chat, smart reply, semantic search over a user's own data, and offline assistants. That's the promise of edge AI — and on Apple silicon, MLX is the most direct path to it today. If you want the bigger argument for why this matters, read Edge AI on iOS: why on-device beats the cloud.
The future of AI isn't only in massive data centers — a lot of it runs silently on the device in your pocket.