Moxel VRAM Pooling
Moxel Benchmark Results
These are AuvaLabs internal benchmark results demonstrating Moxel's VRAM pooling technology on the same consumer GPU hardware available to the community. The methodology is fully public and auditable — only the pooling implementation remains proprietary.
What is Moxel?
Moxel is a VRAM pooling layer that presents multiple physical GPUs as a single unified memory space to the inference engine. This allows models that exceed a single GPU's VRAM to run without tensor parallelism overhead, and maintains throughput under concurrent load where naive TP collapses.
Key Numbers: 4× RTX 4090 (96 GB pooled)
Naive TP · c=4 batch latency
6.1s
per batch
Moxel · c=4 batch latency
2.4s
per batch
Improvement at c=4
2.5×
lower latency
Addressable VRAM
96 GB
4× 24 GB
Mixtral 8x7B — Batch Latency by Concurrency
| Configuration | Hardware | c=1 latency | c=2 latency | c=4 latency | c=8 latency | VRAM Used |
|---|---|---|---|---|---|---|
| Moxel Pooling | 4× RTX 4090 | 0.9s | 1.4s | 2.4s | 4.1s | 92 / 96 GB |
| Naive TP | 4× RTX 4090 | 1.1s | 2.3s | 6.1s | —* | 92 / 96 GB |
* Naive TP at c=8 caused OOM errors on this model. Moxel's pooled address space handles the KV cache growth gracefully.
Methodology note: These results were produced on AuvaLabs hardware
using the same CLI benchmark suite available to the community (v1.0 methodology).
The only difference is the
All results are stored with
moxel_pooling configuration, which uses
proprietary VRAM pooling instead of vLLM's built-in tensor parallelism.
All results are stored with
status = internal_verified in the leaderboard
database. The full methodology specification is available at
methodology/spec.md
.
An arXiv paper with detailed analysis is forthcoming.