Academic

The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths

arXiv:2603.10030v1 Announce Type: cross Abstract: AI transport libraries move bytes efficiently, but they commonly assume that buffers are already correctly allocated, placed, shared, registered, and safe under completion and teardown pressure. This paper presents dmaplane, a Linux kernel module that makes this missing layer explicit as buffer orchestration. dmaplane exposes a stable kernel UAPI via /dev/dmaplane and composes ring-based command channels, DMA buffer lifecycle management, dma-buf export for cross-device sharing, a kernel-space RDMA engine, NUMA-aware allocation and verification, credit-based flow control, low-overhead observability, and GPU memory integration via PCIe BAR pinning. We evaluate orchestration sensitivity with measurements of NUMA cross-node penalties at DRAM scale, completion-safe flow control under sustained RDMA load, and GPU BAR mapping tiers versus cudaMemcpy. We also demonstrate end-to-end disaggregated inference by transferring KV-cache chunks betwee

M
Marco Graziano
· · 1 min read · 17 views

arXiv:2603.10030v1 Announce Type: cross Abstract: AI transport libraries move bytes efficiently, but they commonly assume that buffers are already correctly allocated, placed, shared, registered, and safe under completion and teardown pressure. This paper presents dmaplane, a Linux kernel module that makes this missing layer explicit as buffer orchestration. dmaplane exposes a stable kernel UAPI via /dev/dmaplane and composes ring-based command channels, DMA buffer lifecycle management, dma-buf export for cross-device sharing, a kernel-space RDMA engine, NUMA-aware allocation and verification, credit-based flow control, low-overhead observability, and GPU memory integration via PCIe BAR pinning. We evaluate orchestration sensitivity with measurements of NUMA cross-node penalties at DRAM scale, completion-safe flow control under sustained RDMA load, and GPU BAR mapping tiers versus cudaMemcpy. We also demonstrate end-to-end disaggregated inference by transferring KV-cache chunks between two machines using RDMA WRITE WITH IMMEDIATE and reconstructing tensor views on the receiver. RDMA measurements use Soft-RoCE; we distinguish measured results from provider-independent properties by construction.

Executive Summary

The article introduces dmaplane, a Linux kernel module designed to address a critical gap in AI data path infrastructure by formalizing buffer orchestration at the kernel level. Traditional AI transport libraries assume buffer allocation, placement, sharing, and safety are already resolved, but dmaplane explicitly enables these operations via a UAPI, integrating ring-based command channels, DMA buffer lifecycle management, cross-device sharing via dma-buf, kernel-space RDMA, NUMA-aware allocation, credit-based flow control, low-overhead observability, and PCIe BAR pinning for GPU memory. The evaluation demonstrates measurable impacts on NUMA penalties, completion-safe flow control under sustained RDMA, GPU BAR mapping efficacy, and end-to-end disaggregated inference via RDMA WRITE WITH IMMEDIATE. This work fills a significant architectural void by enabling deterministic, observable, and scalable buffer management for heterogeneous workloads.

Key Points

  • Introduction of dmaplane as a kernel-level buffer orchestration layer
  • Integration of UAPI via /dev/dmaplane for stable kernel interface
  • Evaluation of performance impacts on NUMA, RDMA, and GPU memory sharing

Merits

Architectural Completeness

dmaplane fills a critical architectural gap by making buffer orchestration explicit, enabling deterministic and observable data path behavior previously assumed implicitly.

Performance Validation

The evaluation quantifies real-world impacts on NUMA cross-node penalties, flow control under RDMA load, and GPU BAR mapping efficiency, lending empirical credibility to the implementation.

Heterogeneous Support

Support for GPU memory via PCIe BAR pinning and cross-device sharing via dma-buf demonstrates adaptability across compute tiers and device architectures.

Demerits

Implementation Complexity

The addition of a new kernel module introduces potential maintenance overhead and complexity for distribution and compatibility across kernel versions.

Scope Limitation

While evaluation is comprehensive, the focus on RDMA and GPU integration may underrepresent applicability to non-RDMA or CPU-only workloads.

Expert Commentary

dmaplane represents a paradigm shift in kernel-level abstraction for AI data paths by codifying buffer orchestration as a first-class concern. Historically, buffer management has been treated as a black box—assumed correct by transport libraries without explicit support. This paper’s contribution is not merely architectural; it is pedagogical. By exposing a stable UAPI and integrating NUMA awareness, RDMA, and GPU BAR pinning into a unified orchestration framework, dmaplane elevates the conversation from performance optimization to architectural integrity. The evaluation’s focus on measurable outcomes—specifically NUMA penalties and completion-safe flow control—provides compelling evidence that buffer orchestration directly affects end-to-end latency and reliability. Furthermore, the use of Soft-RoCE to isolate provider-independent effects demonstrates methodological rigor. This is not a incremental improvement; it is a foundational enhancement that may inspire similar abstractions in other domains, such as HPC or edge AI. The potential for dmaplane to become a standard component in AI-optimized kernel stacks is substantial.

Recommendations

  • Integrate dmaplane into mainstream AI kernel distributions as a default buffer orchestration module for high-performance workloads.
  • Publish benchmark suites tailored to dmaplane’s UAPI to enable standardized evaluation of buffer orchestration impacts across diverse AI stacks.

Sources