Unified GPU/NPU Monitoring for Enterprise AI Infrastructure

All-SMI is a system visibility tool that supports diverse AI accelerator hardware, including NVIDIA, AMD, Apple Silicon, Intel Gaudi, Google TPU, and more. Precisely monitor heterogeneous data centers through thermal monitoring of each node and chassis.

View on GitHub →

all-smi — Cluster Overview

Cluster Overview

Nodes200/200

Total RAM320.00TB

GPU Cores1600

Total VRAM220.31TB

Avg. Temp68°C

Total Power683.5kW

GPU Util

50.3%

GPU Mem

49.8%

Temp

68°C

Tabs node-0001 node-0002 node-0003 ... node-0200

GPU	Model	Util	VRAM	Temp	Power
GPU	H200 141	79.1%	84.2/141GB	78°C	511/700W
GPU	H200 141	28.8%	90.0/141GB	63°C	420/700W
GPU	H200 141	36.1%	61.6/141GB	63°C	385/700W
GPU	H200 141	66.2%	69.3/141GB	75°C	502/700W

GPU	Model	Util	VRAM	Temp	Power
GPU	H200 141	91.3%	120.4/141GB	82°C	645/700W
GPU	H200 141	45.6%	72.1/141GB	59°C	378/700W
GPU	H200 141	53.2%	95.8/141GB	67°C	462/700W
GPU	H200 141	12.4%	38.5/141GB	51°C	285/700W

h:Help q:Exit c:CPU →:Tabs s:Scroll

Key features

Control your entire data center from a single interface

Integrated solution for efficient data center monitoring

Monitor GPU, CPU, memory, and chassis thermal data from each node for precise control. View total nodes, GPU cores, total VRAM, average temperature, and total power consumption at a glance with Cluster Overview.

Unified metrics for simultaneous multi-platform operation

Manage 9+ platforms through a single UI including NVIDIA, AMD, Apple Silicon, Intel Gaudi, and Google TPU. No need to run separate monitoring tools for each vendor.

Resource visibility across your system

Simultaneously monitor 256+ remote systems, providing real-time GPU utilization, CPU load, system memory, disk usage, temperature, and power consumption. Exports 100+ Prometheus metrics for observability stack integration.

Enterprise-grade stable remote monitoring

Connection pooling, concurrent connection limits, automatic retry, and TCP Keep-alive provide stable monitoring even in large distributed environments.

Detailed process tracking to identify resource waste

Track GPU memory usage, CPU utilization, and process status to quickly diagnose performance bottlenecks. Verify per-PID which process occupies which GPU.

Intuitive interactive UI for rapid response

Intuitive color-coded accelerator and chassis status indicators with real-time graphs help operations teams make rapid decisions.

Why All-SMI

All-SMI sees every AI accelerator

Eliminates the hassle of running nvidia-smi, rocm-smi, and hl-smi separately for different accelerators, comparing outputs in different formats, and managing nodes individually.

Feature	nvidia-smi	rocm-smi	hl-smi	All-SMI
NVIDIA GPU	✓	—	—	✓
AMD GPU	—	✓	—	✓
Intel Gaudi	—	—	✓	✓
Google TPU	—	—	—	✓
Domestic NPU (Rebellions, FuriosaAI)	—	—	—	✓
Remote cluster monitoring	—	—	—	✓ 256+ nodes
Built-in Prometheus metrics	—	—	—	✓ 100+ metrics
CPU / Memory / Chassis integration	—	—	—	✓
Per-process GPU tracking	✓ NVIDIA only	✓ AMD only	—	✓ All accelerators
Color-coded interactive UI	—	—	—	✓

Supported accelerators

Broad AI accelerator hardware support

Supports GPU, NPU, and TPU alike. As new AI semiconductors arrive, All-SMI coverage grows with them.

GPU

B200 · H200 · H100 · A100 · V100 · Jetson

GPU

MI300X · MI325X · Radeon Instinct

NPU

Gaudi 1 · Gaudi 2 · Gaudi 3 · PCIe · OAM · UBB

TPU

v2 · v3 · v4 · v5e · v5p · v6 · v7 (Ironwood)

Silicon

M1 · M2 · M3 · M4 · M5

NPU

Grayskull · Wormhole · Blackhole

NPU

ATOM · ATOM+ · ATOM Max

NPU

Warboy · RNGD

and more

POSIX · OpenCL · SYCL …

Operating modes

Three operating modes

From single-node inspection to large-scale cluster monitoring to observability stack integration

MODE 01

Local

Terminal-based real-time monitoring. Instantly check GPU, CPU, and memory status of your local system. Think of it as nvidia-smi for all accelerators.

$ all-smi

MODE 02

API

Provides Prometheus-compatible metric endpoints. Connect to existing observability stacks like Grafana and Alertmanager.

$ all-smi api --port 9100

MODE 03

View

Remote cluster dashboard. View nodes running in API mode from a single screen.

$ all-smi view node1,node2,...

Quick Start

One-line install

Supports macOS, Linux, and Windows. Choose Homebrew, pip, APT, Cargo, or binary download.

Also available as a Rust crate for building custom monitoring applications.

Installation

# macOS / Linux (Homebrew) $ brew install lablup/tap/all-smi # pip (PyPI) $ pip install all-smi # Ubuntu (APT) $ sudo add-apt-repository ppa:lablup/all-smi $ sudo apt install all-smi # Cargo (Rust) $ cargo install all-smi # Run $ all-smi

Enterprise Products

Lablup enterprise products using All-SMI

Open-source All-SMI monitoring capabilities are embedded in Lablup's commercial products, providing consistent accelerator visibility from desktop to data center.

DESKTOP

Backend.AI:GO

Lightweight AI platform for desktop and AI PCs. Based on All-SMI's local monitoring engine, check GPU/NPU status of your personal workstation in real time.

All-SMI based local accelerator monitoring
Real-time NVIDIA, AMD, Apple Silicon status
GPU utilization, temperature, memory dashboard
Per-process GPU tracking
Free distribution

DATA CENTER

Backend.AI Monitoring Dashboard

Backend.AI web-based integrated monitoring dashboard. Based on All-SMI metric collection engine and Prometheus integration, visualize accelerator status across the entire cluster.

All-SMI + Prometheus metric pipeline
Heterogeneous accelerator cluster dashboard
Per-node / per-GPU utilization, temp, power
Grafana custom dashboard integration
Anomaly alerts and history tracking

All AI accelerators in one interface

All-SMI is enough for monitoring. If you need operations too, there's Backend.AI.

Install from GitHub →

backend.ai