Moxel VRAM Pooling

Moxel Benchmark Results

These are AuvaLabs internal benchmark results demonstrating Moxel's VRAM pooling technology on the same consumer GPU hardware available to the community. The methodology is fully public and auditable — only the pooling implementation remains proprietary.

What is Moxel?

Moxel is a VRAM pooling layer that presents multiple physical GPUs as a single unified memory space to the inference engine. This allows models that exceed a single GPU's VRAM to run without tensor parallelism overhead, and maintains throughput under concurrent load where naive TP collapses.

Key Numbers: 4× RTX 4090 (96 GB pooled)

Naive TP · c=4 batch latency
6.1s
per batch
Moxel · c=4 batch latency
2.4s
per batch
Improvement at c=4
2.5×
lower latency
Addressable VRAM
96 GB
4× 24 GB

Mixtral 8x7B — Batch Latency by Concurrency

Configuration Hardware c=1 latency c=2 latency c=4 latency c=8 latency VRAM Used
Moxel Pooling 4× RTX 4090 0.9s 1.4s 2.4s 4.1s 92 / 96 GB
Naive TP 4× RTX 4090 1.1s 2.3s 6.1s —* 92 / 96 GB

* Naive TP at c=8 caused OOM errors on this model. Moxel's pooled address space handles the KV cache growth gracefully.

Methodology note: These results were produced on AuvaLabs hardware using the same CLI benchmark suite available to the community (v1.0 methodology). The only difference is the moxel_pooling configuration, which uses proprietary VRAM pooling instead of vLLM's built-in tensor parallelism.

All results are stored with status = internal_verified in the leaderboard database. The full methodology specification is available at methodology/spec.md . An arXiv paper with detailed analysis is forthcoming.