Lablup Releases 'mlxcel,' an Open-Source AI Inference Engine Optimized for Apple Silicon

Releases

Release and updates of Backend.AI

May 18, 2026

Releases

Lablup Releases 'mlxcel,' an Open-Source AI Inference Engine Optimized for Apple Silicon

Lablup
Lablup

May 18, 2026

Releases

Lablup Releases 'mlxcel,' an Open-Source AI Inference Engine Optimized for Apple Silicon

Lablup
Lablup

Lablup Releases mlxcel, an Open-Source AI Inference Engine Optimized for Apple Silicon

Lablup is releasing mlxcel as an open-source AI inference engine built for Apple Silicon. mlxcel is a proprietary AI inference engine that Lablup has been developing since 2025 to maximize LLM and VLM (vision-language model) inference performance on Apple Silicon devices (M1 through M5 series) as well as Linux-based CUDA environments. Its most distinctive feature is that it is written entirely in Rust on top of Apple's MLX C++ bindings, so it runs without a Python runtime. It achieves an average decode speed of 119% compared to mlx-lm and delivers state-of-the-art performance by outperforming mlx-lm on 95% of comparable models. Run large language models effortlessly on your MacBook or Mac Studio with mlxcel.

Key Technical Features

80+ model architecture support: Covers transformers, MoE, SSM/RNN, and hybrid models including Llama, Qwen, Gemma, Phi, Mistral/Mixtral, DeepSeek, and Mamba
Multimodal support: Supports 20+ VLMs including Gemma 4, Llama 4, and Qwen3-VL, with audio and video input processing
TurboQuant KV cache compression: Compresses the KV cache to 3 to 4.25 bits per value, reducing memory usage to approximately 74%
Distributed inference: Supports in-process tensor parallelism and multi-machine pipeline parallelism with mDNS-based auto-discovery
Speculative decoding: Accelerates inference using MTP and DFlash drafter models
OpenAI-compatible server: Provides REST API and SSE streaming as a drop-in replacement for llama-server

Performance Benchmarks

As of May 19, 2026, we measured the actual inference performance of mlxcel on a MacBook Pro M5 Max and a Mac Studio M1 Ultra (each with 128GB of integrated memory). We directly compared the text models (mlx-lm) and vision-language models (VLM) (mlx-vlm) on the same host using the Python base stack.

Host	Mode	Baseline	Pairs	Prefill Median	Decode Median	Decode Parity Achieved
M5 Max	Text	mlx-lm	66	2.70x	99%	62 of 66 (≥90% threshold)
M1 Ultra	Text	mlx-lm	74	1.76x	99%	64 of 74 (≥90% threshold)
M5 Max	VLM	mlx-vlm	22	0.94x	101%	18 of 22 (≥90% threshold)
M1 Ultra	VLM	mlx-vlm	18	1.33x	98%	12 of 18 (≥90% threshold)

mlxcel consistently achieves higher inference throughput than mlx-lm while sharing the same MLX backend. On the SmolLM-135M-4bit model, it reached 905 tok/s compared to mlx-lm's 712 tok/s, which is 127% of the baseline. Across diverse architectures including Qwen2.5-0.5B, it consistently delivers the performance. Notably, MoE-based models such as Phi-3.5-MoE-4bit and MiniMax-M2-3bit also show meaningful speed improvements. See the table below for representative model results.

Host	Model	Class	mlxcel prefill	mlxcel decode	mlx-lm decode	vs mlx-lm
M5 Max	smollm-135m-4bit	Small dense	6,058.41	905.24	711.54	127%
M5 Max	qwen2.5-7b-4bit	Dense 7B	917.38	126.36	123.59	102%
M5 Max	gpt-oss-120b-4bit	Large MoE	334.68	114.03	110.35	103%
M5 Max	solar-open-100b-4bit	Large MoE	210.91	65.36	66.30	99%
M5 Max	qwen3.5-35b-a3b-4bit	Hybrid MoE	480.89	151.63	152.96	99%
M5 Max	nemotron-h-30b-4bit	Hybrid SSM/MoE	414.31	177.18	178.80	99%
M1 Ultra	phi-3.5-moe-4bit	MoE	112.10	77.71	69.28	112%
M1 Ultra	minicpm3-4b-4bit	MLA	241.44	80.78	73.26	110%
M1 Ultra	qwen2.5-0.5b-4bit	Small dense	1,243.98	349.52	315.48	111%
M1 Ultra	gpt-oss-120b-4bit	Large MoE	114.12	61.19	57.58	106%
M1 Ultra	command-r7b-4bit	Dense 7B	81.17	114.34	107.75	106%
M1 Ultra	solar-open-100b-4bit	Large MoE	75.37	36.26	35.69	102%

VLM decode benchmarks on representative models also show consistent gains over mlx-lm.

Host	Model	Class	mlxcel prefill	mlxcel decode	mlx-vlm decode	vs mlx-vlm
M5 Max	qwen3.5-0.8b-4bit	Hybrid GatedDeltaNet VLM	1,294.94	505.94	410.96	123%
M5 Max	qwen3.5-35b-a3b-4bit	Hybrid MoE VLM	355.32	151.34	128.80	117%
M5 Max	gemma-4-e2b-it-4bit	Gemma 4 VLM	2,787.47	217.32	201.70	108%
M5 Max	gemma3n-e2b-4bit	Gemma 3n VLM	2,893.48	151.36	124.63	121%
M5 Max	molmo2-4b	Molmo2 vision encoder	2,512.31	64.01	66.80	96%
M1 Ultra	llava-interleave-qwen-0.5b-bf16	SigLIP + Qwen2	3,961.62	265.57	225.15	118%
M1 Ultra	aya-vision-8b	SigLIP + Cohere2	444.01	113.59	103.74	110%
M1 Ultra	molmo2-4b	Molmo2 vision encoder	727.26	60.31	60.87	99%
M1 Ultra	phi-3.5-vision-4bit	CLIP + HD tiling	991.67	122.63	92.53	133%
M1 Ultra	pixtral-12b-4bit	Pixtral ViT + Mistral	447.97	60.25	—	—

Why Lablup Is Open-Sourcing mlxcel

High-performance AI inference infrastructure has long been concentrated among big tech companies that can secure NVIDIA GPU servers and large-scale cloud resources. As of 2026, the supply imbalance is severe: H100 GPUs require a 4 to 6 month wait after ordering, and Blackwell (B200) units take more than 12 months. Against this backdrop, the inference software ecosystem for individuals and small organizations has remained limited. If the Apple devices that many developers and researchers already own, such as MacBooks and Mac Studios, can be turned into capable AI inference platforms, meaningful AI research and development becomes possible without the high barrier of GPU availability.

Lablup is open-sourcing mlxcel in pursuit of a broader goal: the decentralization of AI research capacity. We plan to continuously develop mlxcel so that startups, students, and individual developers can freely build, train, and deploy AI on Apple hardware.

Supported Models and Environments

As of May 2026, mlxcel supports 89 or more model architectures.

Text families: Llama, Qwen, Gemma, Phi, Mistral/Mixtral, DeepSeek, Cohere, InternLM, GLM, ExaOne, OLMo, ERNIE, Hunyuan, Mamba, RWKV, Jamba, Nemotron-H, MiniMax, Step, Kimi, and more

VLMs: Gemma 3/4 VLM, LLaVA, Llama 4, MiniCPM-O, Molmo, Moondream, Phi3-Vision, Phi4MM, Pixtral, Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3.5-VL, and more

System Requirements

Apple device with Apple Silicon (M1/M2/M3/M4/M5 series) running macOS
NVIDIA CUDA-capable device running Linux (such as NVIDIA DGX Spark) For detailed installation requirements, refer to the installation guide.

Getting Started

On macOS, you can get started with the following commands.

Source build (macOS Apple Silicon)
git clone [https://github.com/lablup/mlxcel.git](https://github.com/lablup/mlxcel.git)
cd mlxcel
cargo build --release --features metal,accelerate
Model download
mlxcel download mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

Text generation
mlxcel generate -m ./models/Meta-Llama-3.1-8B-Instruct-4bit -p "Hello, world!" -n 100

Run OpenAI compatible server
mlxcel serve -m ./models/Meta-Llama-3.1-8B-Instruct-4bit --port 8080

Pre-built binaries are also available for Linux CUDA environments. Build requirements on macOS are Rust 1.85 or later, Xcode Command Line Tools, and the Metal toolchain. On Linux with CUDA, CUDA Toolkit 13.0 or later and cuDNN 9 or later are required. If you prefer a GUI, you can download AI:GO and connect it to mlxcel as a backend.

Using mlxcel with AI:GO

mlxcel can also serve as the backend inference server for Lablup's local AI platform AI:GO. AI:GO is a desktop-based AI model management platform that runs on macOS, Windows, and Linux. By connecting mlxcel as the backend server, you can handle model management, chat, and API serving all in one place through a GUI. Users who are not comfortable with the CLI can still take full advantage of mlxcel's high-performance inference through AI:GO, making it straightforward to set up an on-premises AI environment on a MacBook or Mac Studio. Download and installation instructions are available on the official AI:GO manual page.

View detailed benchmark results

mlxcel GitHub: github.com/lablup/mlxcel | License: Apache License 2.0

AI:GO Manual: go.backend.ai/en/manual

Blog

Releases

Lablup Releases 'mlxcel,' an Open-Source AI Inference Engine Optimized for Apple Silicon

Lablup Releases 'mlxcel,' an Open-Source AI Inference Engine Optimized for Apple Silicon

Lablup Releases mlxcel, an Open-Source AI Inference Engine Optimized for Apple Silicon

Key Technical Features

Performance Benchmarks

Why Lablup Is Open-Sourcing mlxcel

Supported Models and Environments

System Requirements

Getting Started

Using mlxcel with AI:GO

We value your privacy