Releases

Release and updates of Backend.AI

May 18, 2026

Releases

Lablup Releases 'mlxcel,' an Open-Source AI Inference Engine Optimized for Apple Silicon

  • Lablup

    Lablup

    Lablup

May 18, 2026

Releases

Lablup Releases 'mlxcel,' an Open-Source AI Inference Engine Optimized for Apple Silicon

  • Lablup

    Lablup

    Lablup

Lablup Releases mlxcel, an Open-Source AI Inference Engine Optimized for Apple Silicon

Lablup is releasing mlxcel as an open-source AI inference engine built for Apple Silicon. mlxcel is a proprietary AI inference engine that Lablup has been developing since 2025 to maximize LLM and VLM (vision-language model) inference performance on Apple Silicon devices (M1 through M5 series) as well as Linux-based CUDA environments. Its most distinctive feature is that it is written entirely in Rust on top of Apple's MLX C++ bindings, so it runs without a Python runtime. It achieves an average decode speed of 119% compared to mlx-lm and delivers state-of-the-art performance by outperforming mlx-lm on 95% of comparable models. Run large language models effortlessly on your MacBook or Mac Studio with mlxcel.

Key Technical Features

  • 80+ model architecture support: Covers transformers, MoE, SSM/RNN, and hybrid models including Llama, Qwen, Gemma, Phi, Mistral/Mixtral, DeepSeek, and Mamba
  • Multimodal support: Supports 20+ VLMs including Gemma 4, Llama 4, and Qwen3-VL, with audio and video input processing
  • TurboQuant KV cache compression: Compresses the KV cache to 3 to 4.25 bits per value, reducing memory usage to approximately 74%
  • Distributed inference: Supports in-process tensor parallelism and multi-machine pipeline parallelism with mDNS-based auto-discovery
  • Speculative decoding: Accelerates inference using MTP and DFlash drafter models
  • OpenAI-compatible server: Provides REST API and SSE streaming as a drop-in replacement for llama-server

Performance Benchmarks

mlxcel consistently achieves higher inference throughput than mlx-lm while sharing the same MLX backend. On the SmolLM-135M-4bit model, it reached 918.55 tok/s compared to mlx-lm's 711.54 tok/s, which is 129% of the baseline. Across diverse architectures including Qwen2.5-0.5B and Phi-3.5-MoE, it consistently delivers 104 to 107% of mlx-lm performance. Notably, MoE-based models such as Phi-3.5-MoE-4bit and MiniMax-M2-3bit also show meaningful speed improvements. See the table below for representative model results.

ModelImprovementmlxcelmlx-lm
1smollm-135m-4bit+29.1%918.55711.54
2qwen2.5-0.5b-4bit+7.4%684.57637.17
3phi-3.5-moe-4bit+6.5%114.60107.56
4minimax-m2-3bit+5.7%72.8768.94
5qwen3-30b-a3b-4bit+3.6%152.54147.22
6qwen3-moe-4bit+3.5%151.66146.51
7command-r7b-4bit+3.5%114.53110.67
8gpt-oss-120b-4bit+2.2%112.83110.35
9qwen2.5-7b-8bit+1.7%68.6267.44
10qwen2.5-7b-4bit+1.6%125.55123.59
11minicpm-2b-4bit+1.3%231.47228.46
12qwen2.5-0.5b-bf16+0.7%405.73402.73
13deepseek-r1-distill-7b-4bit+0.4%126.13125.63

VLM decode benchmarks on representative models also show consistent gains over mlx-lm.

ModelImprovementmlxcelmlx-lm
1gemma-4-e2b-it-4bit+6.9%215.53201.70
2qwen3.6-35b-a3b-4bit+3.1%127.58123.70
3qwen3.5-35b-a3b-4bit+2.1%131.46128.80
4gemma-4-e4b-it-4bit+1.7%133.42131.24

In terms of compatibility, mlxcel also runs models that mlx-lm does not support. These include multimodal, vision, and SSM-based models such as ERNIE-4.5-0.3B, ExaOne4-1.2B, LLaVA-Interleave, OLMo, Gemma-4, Mamba2, and Phi-3.5-Vision. Their benchmark throughput values are shown below.

Modelmlxcel decode
1ERNIE-4.5-0.3B-4bit1056.88
2ExaOne4-1.2B-4bit417.34
3LLaVA-Interleave-0.5B-bf16395.35
4OLMo-1B-4bit237.50
5Gemma-4-E2B-4bit221.62
6DeepSeek-Coder-1.3B-4bit189.58
7Mamba2-1.3B-4bit171.69
8Phi-3.5-Vision-4bit160.68

Why Lablup Is Open-Sourcing mlxcel

High-performance AI inference infrastructure has long been concentrated among big tech companies that can secure NVIDIA GPU servers and large-scale cloud resources. As of 2026, the supply imbalance is severe: H100 GPUs require a 4 to 6 month wait after ordering, and Blackwell (B200) units take more than 12 months. Against this backdrop, the inference software ecosystem for individuals and small organizations has remained limited. If the Apple devices that many developers and researchers already own, such as MacBooks and Mac Studios, can be turned into capable AI inference platforms, meaningful AI research and development becomes possible without the high barrier of GPU availability.

Lablup is open-sourcing mlxcel in pursuit of a broader goal: the decentralization of AI research capacity. We plan to continuously develop mlxcel so that startups, students, and individual developers can freely build, train, and deploy AI on Apple hardware.

Supported Models and Environments

As of May 2026, mlxcel supports 89 or more model architectures.

Text families: Llama, Qwen, Gemma, Phi, Mistral/Mixtral, DeepSeek, Cohere, InternLM, GLM, ExaOne, OLMo, ERNIE, Hunyuan, Mamba, RWKV, Jamba, Nemotron-H, MiniMax, Step, Kimi, and more

VLMs: Gemma 3/4 VLM, LLaVA, Llama 4, MiniCPM-O, Molmo, Moondream, Phi3-Vision, Phi4MM, Pixtral, Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3.5-VL, and more

System Requirements

  • Apple device with Apple Silicon (M1/M2/M3/M4/M5 series) running macOS
  • NVIDIA CUDA-capable device running Linux (such as NVIDIA DGX Spark) For detailed installation requirements, refer to the installation guide.

Getting Started

On macOS, you can get started with the following commands.

Source build (macOS Apple Silicon) git clone [https://github.com/lablup/mlxcel.git](https://github.com/lablup/mlxcel.git) cd mlxcel cargo build --release --features metal,accelerate Model download mlxcel download mlx-community/Meta-Llama-3.1-8B-Instruct-4bit Text generation mlxcel generate -m ./models/Meta-Llama-3.1-8B-Instruct-4bit -p "Hello, world!" -n 100 Run OpenAI compatible server mlxcel serve -m ./models/Meta-Llama-3.1-8B-Instruct-4bit --port 8080

Pre-built binaries are also available for Linux CUDA environments. Build requirements on macOS are Rust 1.85 or later, Xcode Command Line Tools, and the Metal toolchain. On Linux with CUDA, CUDA Toolkit 13.0 or later and cuDNN 9 or later are required. If you prefer a GUI, you can download AI:GO and connect it to mlxcel as a backend.

Using mlxcel with AI:GO

mlxcel can also serve as the backend inference server for Lablup's local AI platform AI:GO. AI:GO is a desktop-based AI model management platform that runs on macOS, Windows, and Linux. By connecting mlxcel as the backend server, you can handle model management, chat, and API serving all in one place through a GUI. Users who are not comfortable with the CLI can still take full advantage of mlxcel's high-performance inference through AI:GO, making it straightforward to set up an on-premises AI environment on a MacBook or Mac Studio. Download and installation instructions are available on the official AI:GO manual page.

View detailed benchmark results

mlxcel GitHub: github.com/lablup/mlxcel | License: Apache License 2.0

AI:GO Manual: go.backend.ai/manual

We're here for you!

Complete the form and we'll be in touch soon

Contact Us

Headquarter & HPC Lab

KR Office: 8F, 577, Seolleung-ro, Gangnam-gu, Seoul, Republic of Korea US Office: 3003 N First st, Suite 221, San Jose, CA 95134

© Lablup Inc. All rights reserved.

We value your privacy

We use cookies to enhance your browsing experience, analyze site traffic, and understand where our visitors are coming from. By clicking "Accept All", you consent to our use of cookies. Learn more