> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/nd4j/overview-1/hardware-backends.md).

# Hardware Backends (1.0.0-rewrite)

Deeplearning4j 1.0.0-rewrite extends ND4J's backend system well beyond the original CPU and CUDA pairing. The rewrite introduces dedicated backends for Google Cloud TPUs, Qualcomm Hexagon DSPs, AMD and Intel GPUs via ZLUDA, Snapdragon X, and a substantially expanded set of CPU acceleration libraries. It also ships a new multi-backend infrastructure layer that unifies device switching, memory tracking, and workspace management across all of these targets.

This page documents every new and expanded backend, their configuration, and the shared infrastructure that coordinates them.

## 1. Overview and Backend Selection

ND4J uses the standard Java SPI mechanism to discover backends at startup. Each backend JAR ships a `META-INF/services/org.nd4j.linalg.factory.Nd4jBackend` registration file. When `Nd4j` is first referenced, `ServiceLoader` collects all registered backends, calls `isAvailable()` on each, and selects the one with the highest `getPriority()` return value that reports itself as available.

The priority ladder in the rewrite:

| Backend                  | Priority | Notes                                                      |
| ------------------------ | -------- | ---------------------------------------------------------- |
| `nd4j-native` (CPU)      | 0        | Always available; baseline fallback                        |
| `nd4j-tpu`               | 50       | Selected over CPU when TPU hardware is detected            |
| `nd4j-hexagon`           | 60       | Selected on Snapdragon SoCs when QNN runtime is present    |
| `nd4j-cuda` (CUDA/ZLUDA) | 100      | Highest priority; used when NVIDIA or ZLUDA GPU is present |

When multiple backends are on the classpath but only one is desired, override selection with:

```
-Dbackend.type=CPU        # force CPU regardless of available hardware
-Dbackend.type=TPU        # force TPU
-Dbackend.type=HEXAGON    # force Hexagon DSP
```

Or programmatically before the first `Nd4j` call:

```java
System.setProperty("backend.type", "CPU");
```

The new `DeviceType` enum enumerates every supported target:

```java
DeviceType.CPU
DeviceType.CUDA
DeviceType.ROCM
DeviceType.TPU
DeviceType.HEXAGON
DeviceType.OPENCL
DeviceType.METAL
DeviceType.VULKAN
```

## 2. Classifier Variants: Base vs. `-compile`

Every ND4J backend ships in two classifier variants that control how much of the DSP (Dynamic Shape Plan) compilation stack is bundled into the native binary:

| Variant        | Example Classifier     | What It Includes                                                                                                                                                                                                 |
| -------------- | ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Base**       | `linux-x86_64`         | Standard ND4J ops, OpenBLAS/MKL, CUDA kernels (for `nd4j-cuda`), cuBLAS. Graph execution runs slot-by-slot or with CUDA graph capture/replay, but **no JIT kernel fusion**.                                      |
| **`-compile`** | `linux-x86_64-compile` | Everything in base **plus** the Triton MLIR GPU JIT compiler, NVRTC runtime compiler, PTX string-template backend, and the MLIR CPU JIT backend. Enables full DSP kernel fusion and graph-level JIT compilation. |

### When to Use Each Variant

**Use the base classifier when:**

* You want the **smallest possible binary size**. The base native library excludes the Triton/LLVM compiler stack, which adds substantial weight to the binary.
* Your workload does **not** benefit from JIT kernel fusion — for example, classical ML pipelines, small models, or workloads dominated by a few large BLAS operations where cuBLAS/OpenBLAS already achieves peak throughput.
* You need **simpler deployment** with fewer native dependencies. The base classifier has no dependency on LLVM, Triton, or MLIR libraries.
* You are deploying to **resource-constrained environments** (edge devices, containers with tight image size budgets).

**Use the `-compile` classifier when:**

* You are running **transformer models or LLMs** where the DSP graph optimizer and Triton kernel fusion deliver significant speedups (often 2–5x for inference at low batch sizes due to eliminated kernel launch overhead and fused element-wise chains).
* You want the **full DSP execution mode hierarchy**: `TRITON → NVRTC → PTX → CUDA_GRAPHS → SLOT_BY_SLOT` on CUDA, or `oneDNN Graph → MLIR CPU → SLOT_BY_SLOT` on Intel CPUs.
* You are using the **DSP Runtime SDK** (`sdx` bindings) and want all backend targets available.
* **Maximum performance** is more important than binary size.

### Trade-Off Summary

```
┌──────────────────────────────────────────────────────────────────┐
│                        Base Classifier                           │
│  ✓ Smaller binary        ✓ Simpler deployment                   │
│  ✓ Fewer dependencies    ✓ Faster startup (no JIT warmup)       │
│  ✗ No Triton fusion      ✗ No MLIR CPU JIT                     │
│  ✗ No NVRTC/PTX JIT      ✗ Slot-by-slot or CUDA graphs only    │
├──────────────────────────────────────────────────────────────────┤
│                      -compile Classifier                         │
│  ✓ Triton kernel fusion  ✓ NVRTC + PTX JIT fallbacks           │
│  ✓ MLIR CPU JIT          ✓ Full DSP optimization (26 passes)   │
│  ✓ Maximum throughput    ✓ All GraphExecutionMode values        │
│  ✗ Larger binary         ✗ More native dependencies (LLVM)     │
│  ✗ Longer first-call     ✗ Higher memory footprint              │
│    (JIT compilation)                                             │
└──────────────────────────────────────────────────────────────────┘
```

### Maven Configuration

To use the base classifier (default — no extra configuration needed):

```xml
<!-- CPU base: standard ops, no JIT -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>${dl4j.version}</version>
    <classifier>linux-x86_64</classifier>
</dependency>
```

To use the `-compile` classifier:

```xml
<!-- CPU with Triton/MLIR JIT -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>${dl4j.version}</version>
    <classifier>linux-x86_64-compile</classifier>
</dependency>
```

```xml
<!-- CUDA with Triton JIT -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-cuda-12.9</artifactId>
    <version>${dl4j.version}</version>
    <classifier>linux-x86_64-cuda-12.9-compile</classifier>
</dependency>
```

When using the `-platform` artifact, steer JavaCPP to the `-compile` variant via a system property:

```
-Djavacpp.platform.extension=-compile
```

### Available `-compile` Classifiers

| Artifact         | Classifier                       | Platform                                    |
| ---------------- | -------------------------------- | ------------------------------------------- |
| `nd4j-native`    | `linux-x86_64-compile`           | Linux x86-64 with Triton + MLIR + oneDNN    |
| `nd4j-native`    | `linux-arm64-compile`            | Linux ARM64 with MLIR                       |
| `nd4j-native`    | `macosx-arm64-compile`           | macOS Apple Silicon with MLIR + MLX         |
| `nd4j-native`    | `android-arm64-compile`          | Android ARM64 with MLIR                     |
| `nd4j-native`    | `android-arm64-compile-nnapi`    | Android ARM64 with MLIR + NNAPI             |
| `nd4j-cuda-12.9` | `linux-x86_64-cuda-12.9-compile` | Linux x86-64 CUDA with Triton + NVRTC + PTX |

### What Happens Without `-compile`

Without the `-compile` classifier, DSP still operates — it compiles the graph into a `DynamicShapePlan`, runs the 26-pass optimizer, freezes shapes, and captures CUDA graphs for replay. The difference is that JIT kernel fusion (Triton, NVRTC, PTX, MLIR) is unavailable:

* `GraphExecutionMode.TRITON` falls back to `CUDA_GRAPHS`
* `GraphExecutionMode.NVRTC` falls back to `CUDA_GRAPHS`
* `GraphExecutionMode.PTX` falls back to `CUDA_GRAPHS`
* `GraphExecutionMode.MLIR_CPU` falls back to `SLOT_BY_SLOT`
* `GraphExecutionMode.AUTO` selects the best *available* mode — on CUDA this means `CUDA_GRAPHS`, on CPU this means `SLOT_BY_SLOT` (or `oneDNN Graph` if helpers are present)

CUDA graph capture/replay alone still provides substantial speedups over pure slot-by-slot execution by eliminating per-kernel launch overhead. The additional JIT fusion from `-compile` provides further gains by fusing element-wise op chains into single kernels — reducing global memory traffic between ops.

### Choosing the Right Variant: Decision Guide

| Scenario                                           | Recommended Variant                                     |
| -------------------------------------------------- | ------------------------------------------------------- |
| Training or inference with CNNs / classical models | Base — cuBLAS and cuDNN dominate; JIT adds little       |
| LLM inference at batch size 1–8                    | `-compile` — Triton fusion dramatically reduces latency |
| LLM inference at large batch sizes (32+)           | Either — compute-bound; CUDA graphs alone may suffice   |
| Edge deployment (ARM, Android)                     | Base — minimize binary size                             |
| Edge deployment needing NNAPI                      | `-compile` (with `-nnapi` on Android)                   |
| Server-side model serving                          | `-compile` — maximize throughput per dollar             |
| CI/CD test pipelines                               | Base — faster dependency resolution, smaller images     |
| Development / prototyping                          | Base — faster builds, simpler debugging                 |

## 3. CUDA Backend (with cuDNN Expansion)

The existing `nd4j-cuda` backend is unchanged in its public API. The rewrite adds 20 new and updated cuDNN helper files under `deeplearning4j-cuda`, along with structural changes for stream-capture safety.

### New cuDNN Operations

The following cuDNN-backed op implementations are new in this release:

| File                            | Op                                                                       |
| ------------------------------- | ------------------------------------------------------------------------ |
| `CudnnFlashAttentionHelper`     | Flash Attention stub (multi-head attention with memory-efficient kernel) |
| `CudnnBiasAddHelper`            | Bias-add fused with activation                                           |
| `CudnnConv1dHelper`             | 1-D convolution                                                          |
| `CudnnDeconv2dHelper`           | 2-D transposed convolution (deconvolution)                               |
| `CudnnDeconv3dHelper`           | 3-D transposed convolution                                               |
| `CudnnDropoutHelper`            | Stateful cuDNN dropout                                                   |
| `CudnnGlobalPoolingHelper`      | Global average/max pooling                                               |
| `CudnnGruHelper`                | GRU cell forward and backward                                            |
| `CudnnInstanceNormHelper`       | Instance normalization                                                   |
| `CudnnLayerNormHelper`          | Layer normalization                                                      |
| `CudnnLogSoftmaxHelper`         | Log-softmax                                                              |
| `CudnnLrnHelper`                | Local response normalization                                             |
| `CudnnOpTensorHelper`           | Pointwise tensor operations (add, mul, min, max)                         |
| `CudnnReduceHelper`             | Reduce over arbitrary axes                                               |
| `CudnnSimpleRnnHelper`          | Simple RNN cell                                                          |
| `CudnnSpatialTransformerHelper` | Spatial transformer network                                              |

### Updated cuDNN Operations

The following existing helpers were updated for CUDA stream-capture safety and DSP compatibility:

* `CudnnBatchNormHelper` — per-stream handle caching; correct behavior when CUDA graph capture is active
* `CudnnConv2dHelper` and `CudnnConv3dHelper` — stream-safe workspace allocation
* `CudnnCtcHelper` — CTC loss; rewritten to avoid illegal API calls inside CUDA graph capture
* `CudnnLSTMHelper` — updated cuDNN RNN API v8
* `CudnnDepthwiseConv2dHelper` — aligned with updated depthwise conv semantics

### Per-Stream cuDNN Handle Caching

The rewrite introduces centralized cuDNN handle caching keyed by CUDA stream. Each time a cuDNN operation is dispatched, the infrastructure checks whether a handle already exists for the current stream; if not, it creates and registers one. This eliminates handle creation overhead in tight loops and is the underlying change that makes stream capture safe across all cuDNN helpers.

### Maven Setup

```xml
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-cuda-12.9-platform</artifactId>
  <version>1.0.0-rewrite</version>
</dependency>

<!-- cuDNN helpers (optional, for neural network layer acceleration) -->
<dependency>
  <groupId>org.deeplearning4j</groupId>
  <artifactId>deeplearning4j-cuda-12.9</artifactId>
  <version>1.0.0-rewrite</version>
</dependency>
```

See [CUDA Backend (nd4j-cuda)](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/nd4j/backends/cuda/README.md) for full CUDA setup, multi-GPU configuration, and memory management.

## 4. TPU Backend (nd4j-tpu)

`nd4j-tpu` is a new backend targeting Google Cloud TPU v4 and v5 hardware. It uses Google's **PJRT** (Portable JIT Runtime) API through JNI, so the Java layer never calls XLA or HLO directly.

### Java Components (7 files)

**`JTpuBackend`** — The `Nd4jBackend` subclass registered with the SPI. On `isAvailable()`, it fires a JNI probe that calls `PjrtClientManager::HasTpuDevice()` on the native side. If that returns true and a PJRT client can be created, the backend is considered available. Priority is 50, placing it above the CPU backend and below CUDA.

**`JTpuNDArray`** — The `INDArray` implementation for TPU. Array data is held in XLA buffer handles allocated through PJRT rather than in CPU or GPU memory. Operations on `JTpuNDArray` are routed through `TpuExecutioner` which compiles and dispatches HLO programs.

**`TpuEnvironment`** — Holds TPU-wide configuration. Key defaults:

* Data type default: `bfloat16` (the native TPU format; training in bfloat16 is strongly recommended)
* Compilation cache: HLO programs are cached by signature to avoid recompilation per step
* Device count: read from the PJRT client at startup

**`TpuExecutioner`** — Routes ND4J op calls to PJRT XLA execution. Each op call results in an HLO program fragment that is compiled (or retrieved from cache) and executed on a TPU device via `PjrtClientManager`.

### Native Components

**`TpuGraphBackend`** — The C++ entry point called from `TpuExecutioner` via JNI. Manages the top-level execution pipeline.

**`HloIRBuilder`** — Translates op descriptors into XLA HLO (High Level Optimizer) programs. Each op generates the appropriate HLO computation; multiple ops in a SameDiff graph can be fused into a single HLO program before dispatch.

**`PjrtClientManager`** — Manages the PJRT client lifecycle: device enumeration, memory allocation on TPU HBM (High Bandwidth Memory), and execution submission. The manager is a singleton per process.

### Setup

TPU support requires:

1. A Google Cloud VM with a TPU v4 or v5 pod slice attached, or a Cloud TPU node accessible via PJRT network endpoint.
2. The `libtpu.so` shared library, available from the Google Cloud TPU apt repository or bundled inside `nd4j-tpu`.
3. The environment variable `TPU_NAME` set to the TPU resource name (e.g., `local` for a TPU VM, or `projects/PROJECT/locations/ZONE/nodes/NODE_NAME` for a TPU node).

```xml
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-tpu</artifactId>
  <version>1.0.0-rewrite</version>
</dependency>
```

```bash
export TPU_NAME=local
java -jar myapp.jar
```

### Configuration

```java
import org.nd4j.linalg.tpu.TpuEnvironment;

// Check detected TPU device count
System.out.println(TpuEnvironment.getInstance().getDeviceCount());

// Override default data type (bfloat16 by default)
TpuEnvironment.getInstance().setDefaultDataType(DataType.FLOAT);

// Flush the HLO compilation cache (useful during development)
TpuEnvironment.getInstance().clearCompilationCache();
```

The `JTpuBackend` backend class name as reported at runtime:

```java
System.out.println(Nd4j.getBackend().getClass().getName());
// org.nd4j.linalg.tpu.JTpuBackend
```

### HLO Compilation and bfloat16

TPUs execute XLA HLO programs compiled ahead of execution. The first call to an op compiles an HLO program and caches it; subsequent calls with the same shapes hit the cache. Shape changes invalidate the cache entry and trigger recompilation.

bfloat16 is the recommended data type for TPU. It has the same dynamic range as float32 (8-bit exponent) but reduces mantissa precision to 7 bits. TPU matrix units run bfloat16 multiplications natively. Accumulations inside the matrix unit use float32, so effective precision for large matrix multiplications is higher than the storage format implies.

```java
// Create a bfloat16 array on TPU
INDArray x = Nd4j.rand(DataType.BFLOAT16, 1024, 1024);
INDArray y = Nd4j.rand(DataType.BFLOAT16, 1024, 1024);
INDArray z = x.mmul(y);  // dispatched through PJRT as HLO dot_general
```

## 5. Hexagon DSP Backend (nd4j-hexagon)

`nd4j-hexagon` targets Qualcomm Hexagon DSPs available on Snapdragon SoCs. It dispatches through the **Qualcomm Neural Network (QNN)** runtime, which in turn can use SNPE (Snapdragon Neural Processing Engine) or the newer QNN SDK.

### Java Components (6 files)

**`HexagonBackend`** — `Nd4jBackend` subclass. `isAvailable()` probes for the QNN shared libraries (`libQnnHtp.so`, `libQnnSystem.so`) via JNI. Returns available when running on a Snapdragon device with Hexagon support and the QNN runtime installed.

**`HexagonExecutioner`** — Dispatches ops to the QNN runtime. Converts ND4J op calls into Hexagon network graph operations and submits them for DSP execution.

### Native Components

**`HexagonGraphBackend`** — C++ entry point. Coordinates graph-level compilation and execution.

**`HexagonIRBuilder`** — Generates Hexagon network graph descriptors from op calls. Each ND4J op is translated into the corresponding QNN graph node type.

**`HexagonRuntimeManager`** — Manages QNN context handles, Hexagon DSP session lifecycle, and memory handles for DSP-accessible buffers.

### Setup

QNN setup requires:

1. A Snapdragon 8 Gen 2 or later SoC (or compatible Hexagon DSP).
2. Qualcomm QNN SDK installed, with `libQnnHtp.so` on `LD_LIBRARY_PATH`.
3. The `nd4j-hexagon` artifact on the classpath.

```xml
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-hexagon</artifactId>
  <version>1.0.0-rewrite</version>
</dependency>
```

```bash
export LD_LIBRARY_PATH=/opt/qcom/qnn/lib:$LD_LIBRARY_PATH
java -jar myapp.jar
```

Check the active backend:

```java
System.out.println(Nd4j.getBackend().getClass().getName());
// org.nd4j.linalg.hexagon.HexagonBackend
```

### Quantization

Hexagon DSPs deliver peak performance on INT8 and INT16 fixed-point operations. The QNN backend supports PTQ (post-training quantization) directly in the `HexagonIRBuilder` layer. Inputs are quantized per-tensor; the quantization parameters (scale and zero-point) are derived from calibration data passed before compilation.

## 6. ZLUDA (AMD and Intel GPU Support)

ZLUDA is a drop-in CUDA compatibility layer that translates CUDA API calls at runtime to AMD HIP/ROCm (for AMD GPUs) or Intel Level Zero (for Intel GPUs). The rewrite integrates ZLUDA support into the `nd4j-cuda` backend so that AMD and Intel GPUs become supported targets without requiring a separate backend JAR.

### How ZLUDA Works

ZLUDA intercepts calls to the CUDA runtime library (`libcuda.so`, `nvcuda.dll`) and redirects them to the appropriate native GPU SDK. From ND4J's perspective, the CUDA backend loads and operates normally; ZLUDA handles the translation transparently.

* **AMD GPUs (HIP/ROCm):** cuDNN calls are translated to **MIOpen** equivalents. cuBLAS calls are translated to **rocBLAS**.
* **Intel GPUs (Level Zero):** cuDNN calls are translated to **oneDNN** equivalents.

### Auto-Download

When the CUDA backend initializes and detects an AMD or Intel GPU, the native side (`ZludaConfiguration.cmake`) automatically downloads the appropriate ZLUDA build for the detected hardware. No manual installation of ZLUDA is required.

### Build Configuration

```cmake
# ZludaConfiguration.cmake — automatically included when targeting AMD/Intel
# Sets:
#   ZLUDA_ENABLED=ON
#   ZLUDA_TARGET=HIP    # or LEVEL_ZERO
#   CUDNN_SUBSTITUTE=MIOpen  # or oneDNN
```

From the Java side there is no configuration change; just add the CUDA backend dependency and target an AMD or Intel GPU machine.

```xml
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-cuda-12.9-platform</artifactId>
  <version>1.0.0-rewrite</version>
</dependency>
```

### Limitations

ZLUDA translation is not zero-overhead. Workloads that are heavily bottlenecked on cuBLAS or cuDNN will see near-native performance because ROCm and oneDNN are mature. Workloads that use custom CUDA kernels (some advanced sampler or attention kernels) may fall back to a slower translated path.

## 7. Snapdragon X (SDX) Cross-Device Dispatch

The Snapdragon X backend (`nd4j-sdx`) is a cross-device dispatch backend for Snapdragon X Elite and Snapdragon X Plus platforms. Rather than implementing a new execution engine, SDX routes ops to the most appropriate available device on the SoC: the ARM CPU, the Hexagon DSP, or the Adreno GPU, based on op type and tensor size heuristics.

Build support is provided by `BuildSDX.cmake`. No separate Java configuration is required; the SDX backend registers itself and its routing logic is internal.

```xml
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-sdx</artifactId>
  <version>1.0.0-rewrite</version>
</dependency>
```

## 8. ARM Compute Library (ACL) Backend

ARM Compute Library is a highly optimized collection of functions for ARM CPUs (Cortex-A) and Mali GPUs. The rewrite adds approximately 124 new op implementations under the ACL platform backend. These are registered through the `DECLARE_PLATFORM` / `PLATFORM_IMPL` / `PLATFORM_CHECK` macro system and dispatch on `ENGINE_CPU` when running on ARM hardware.

### Op Coverage

#### Activations

| Op       | Notes                                 |
| -------- | ------------------------------------- |
| relu     | Standard and leaky variants           |
| elu      | Exponential linear unit               |
| gelu     | Gaussian error linear unit            |
| selu     | Scaled exponential linear unit        |
| sigmoid  | Logistic sigmoid                      |
| silu     | Sigmoid linear unit (x \* sigmoid(x)) |
| softmax  | Row-wise softmax                      |
| softplus | log(1 + exp(x))                       |
| swish    | x \* sigmoid(beta \* x)               |
| tanh     | Hyperbolic tangent                    |

#### Reductions

| Op           | Notes                              |
| ------------ | ---------------------------------- |
| reduce\_max  | Reduce to max along specified axes |
| reduce\_mean | Reduce to mean                     |
| reduce\_min  | Reduce to min                      |
| reduce\_prod | Reduce to product                  |
| reduce\_sum  | Reduce to sum                      |

#### Convolutions and Attention

| Op                        | Notes                                           |
| ------------------------- | ----------------------------------------------- |
| conv1d                    | 1-D convolution                                 |
| depthwiseConv2d           | Depthwise separable 2-D convolution             |
| grouped\_query\_attention | Multi-head attention with grouped queries (GQA) |

#### Normalization

| Op             | Notes                                         |
| -------------- | --------------------------------------------- |
| batchnorm      | Batch normalization (inference and training)  |
| instance\_norm | Instance normalization                        |
| layer\_norm    | Layer normalization                           |
| l2\_normalize  | L2 normalization along specified axis         |
| rms\_norm      | Root mean square normalization (LLM-specific) |

#### LLM-Specific

| Op        | Notes                      |
| --------- | -------------------------- |
| rope      | Rotary position embeddings |
| rms\_norm | See Normalization above    |

#### Embeddings and Gather

| Op                | Notes                                      |
| ----------------- | ------------------------------------------ |
| embedding\_lookup | Embedding table lookup                     |
| gather            | Gather slices along an axis                |
| gather\_nd        | Gather slices at multi-dimensional indices |

#### Scatter and Shape Ops

All scatter variants (`scatter_add`, `scatter_update`, `scatter_mul`, etc.) and all shape manipulation ops (`reshape`, `transpose`, `squeeze`, `unsqueeze`, `tile`, `repeat`, `stack`, `unstack`, `split`, `concat`) are covered.

#### Binary and Comparison Ops

All arithmetic binary ops and all comparison ops (`equal`, `not_equal`, `greater`, `greater_equal`, `less`, `less_equal`) are covered by ACL implementations.

### Maven Setup

ACL support is included in the ARM64 variant of `nd4j-native`. On AArch64 Linux or macOS Apple Silicon, ACL ops are used automatically when ARM Compute Library is detected.

```xml
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-native</artifactId>
  <version>1.0.0-rewrite</version>
  <classifier>linux-arm64</classifier>
</dependency>
```

## 9. Apple Accelerate Backend

The Apple Accelerate framework provides hardware-optimized math routines on macOS and iOS. The rewrite adds 28 new op implementations using Accelerate APIs. These are active on the `macosx-arm64` and `macosx-x86_64` classifiers of `nd4j-native`.

### Op Coverage

#### BLAS

| Op                   | Accelerate API |
| -------------------- | -------------- |
| mmul (matrix-matrix) | `cblas_sgemm`  |
| mmul (matrix-vector) | `cblas_sgemv`  |
| dot                  | `cblas_sdot`   |
| nrm2                 | `cblas_snrm2`  |
| scale                | `cblas_sscal`  |

#### FFT

| Op                  | Accelerate API  |
| ------------------- | --------------- |
| fft (real, radix-2) | `vDSP_fft_zrip` |

#### Convolutions

| Op     | Accelerate API           |
| ------ | ------------------------ |
| conv1d | `vDSP_conv`              |
| conv2d | `vDSP_conv` (via tiling) |

#### Normalization

| Op          | Accelerate API                      |
| ----------- | ----------------------------------- |
| layer\_norm | `vDSP` vector mean and variance ops |
| batchnorm   | `vDSP` vector mean and variance ops |

#### Element-Wise Math

| Op   | Accelerate API |
| ---- | -------------- |
| sin  | `vvsin`        |
| cos  | `vvcos`        |
| exp  | `vvexp`        |
| log  | `vvlog`        |
| sqrt | `vvsqrt`       |
| pow  | `vvpow`        |

#### Additional Coverage

Pooling operations (max pool, avg pool), comparison ops, cumulative sum (`cumsum`), cumulative product (`cumprod`), rounding ops (`floor`, `ceil`, `round`), conditional selection (`where`), gradient accumulation, and linear algebra operations (`svd`, `solve`) are all provided by Accelerate-backed implementations.

### Maven Setup

Accelerate support is included in the macOS classifier variants automatically. No additional dependency is required beyond `nd4j-native-platform` or the `macosx-arm64` / `macosx-x86_64` classifier.

## 10. llama.cpp / GGML Backend

The rewrite introduces a 60-file native backend for executing GGML (the tensor library underlying llama.cpp) models directly from ND4J. This backend enables loading and running quantized LLM weights (GGUF format) on CPU, Metal, and CUDA without converting them to ND4J's native format first.

The GGML backend sits alongside the standard `nd4j-native` execution path. When a GGUF model is loaded, ops that GGML can handle natively (matrix multiplication with quantized weights, attention, feed-forward blocks) are dispatched to the GGML execution path; the result arrays are then materialized as standard `INDArray` instances for the rest of the DL4J graph.

```xml
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-ggml</artifactId>
  <version>1.0.0-rewrite</version>
</dependency>
```

## 11. MLIR JIT, Apple MPS, MIOpen, and oneDNN

### MLIR JIT

`MlirCpuGraphBackend` is a new native backend module that compiles SameDiff graphs to MLIR (Multi-Level Intermediate Representation) and executes them via the MLIR Linalg and arith dialects. This path is used when the native side detects that JIT compilation via MLIR would be advantageous (e.g., operator fusion across a large subgraph).

The MLIR JIT path is transparent to the Java layer. Ops dispatched through SameDiff may be compiled into MLIR programs and executed; the results are returned as standard `INDArray` values.

### Apple Metal Performance Shaders (MPS)

`nd4j-mps` targets Apple Silicon GPU via Metal Performance Shaders. MPS provides GPU-accelerated matrix operations and neural network primitives on M1/M2/M3 Macs. The backend uses the Metal command queue for dispatch and shares the no-copy zero-copy buffer model with the CPU backend on unified-memory Apple Silicon systems.

```xml
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-mps</artifactId>
  <version>1.0.0-rewrite</version>
</dependency>
```

MPS is selected automatically on `macosx-arm64` when the MPS framework is available and the backend JAR is on the classpath.

### MIOpen (AMD GPU)

MIOpen is AMD's alternative to cuDNN. When ZLUDA routes CUDA traffic to HIP/ROCm, cuDNN calls are translated to MIOpen. The `nd4j-cuda` backend plus ZLUDA is the supported path; there is no separate `nd4j-miopen` artifact.

### oneDNN (Intel, formerly MKL-DNN)

oneDNN provides optimized operator implementations for Intel CPUs and Intel GPUs. In the rewrite, oneDNN is updated for DSP integration — the oneDNN execution path can be called from the Hexagon SDX dispatch layer when the target is an Intel CPU on a mixed platform. On x86 Intel CPUs, oneDNN is accessed through the existing MKL integration in `nd4j-native`; see [CPU Backend](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/nd4j/backends/cpu/README.md) for setup.

### OpenVINO

OpenVINO integration allows `nd4j-native` on Intel hardware to dispatch inference graphs through the OpenVINO runtime. This is activated when `libopenvino.so` is detected on `LD_LIBRARY_PATH` and the model has been exported in a compatible format.

## 12. Multi-Backend Infrastructure

PR #10447 introduces a shared infrastructure layer used by all backends. These classes are in `nd4j-api` and are implemented by each backend.

### DeviceType and DeviceDescriptor

`DeviceType` is an enum with values for every supported target:

```java
DeviceType.CPU
DeviceType.CUDA
DeviceType.ROCM
DeviceType.TPU
DeviceType.HEXAGON
DeviceType.OPENCL
DeviceType.METAL
DeviceType.VULKAN
```

`DeviceDescriptor` is the base interface for describing a specific device. The concrete implementations are:

* **`CudaDeviceDescriptor`** — wraps a CUDA device index and CUDA stream handle
* **`CpuDeviceDescriptor`** — wraps a CPU thread identifier and NUMA node
* **`StubDeviceDescriptor`** — no-op implementation used in unit tests

### CudaDeviceContextProvider

`CudaDeviceContextProvider` consolidates the 15+ scattered device-switch call sites that existed in the previous codebase into a single canonical path. All code that needs to switch the active CUDA device now goes through this provider.

```java
// Previous pattern (scattered, now replaced):
// JCudaDriver.cuCtxSetCurrent(ctx);
// ... work ...
// JCudaDriver.cuCtxSetCurrent(prevCtx);

// New canonical pattern:
DeviceContextProvider provider = new CudaDeviceContextProvider();
try (DeviceContext ctx = provider.acquireContext(deviceDescriptor)) {
    // all work here; ctx.close() restores previous device automatically
}
```

### DeviceMemoryManager

`DeviceMemoryManager` provides per-device allocation tracking with configurable caps:

```java
import org.nd4j.linalg.device.DeviceMemoryManager;

DeviceMemoryManager mgr = DeviceMemoryManager.getInstance();

// Get total allocated bytes on device 0
long allocatedBytes = mgr.getAllocatedBytes(DeviceType.CUDA, 0);

// Set a cap for device 1 (16 GB)
mgr.setAllocationCap(DeviceType.CUDA, 1, 16L * 1024 * 1024 * 1024);

// Check remaining headroom
long available = mgr.getRemainingCapacity(DeviceType.CUDA, 1);
```

When an allocation would exceed the cap, `DeviceMemoryManager` throws `DeviceOutOfMemoryException` with a clear message showing current usage and the configured limit, rather than propagating an opaque native OOM error.

### DeviceContextProvider and DeviceContext

`DeviceContextProvider` is an interface implemented by each backend:

```java
public interface DeviceContextProvider {
    DeviceContext acquireContext(DeviceDescriptor descriptor);
}

public interface DeviceContext extends AutoCloseable {
    DeviceDescriptor getDescriptor();
    Object getNativeStreamHandle();  // CUDA stream, Metal command queue, etc.
    void close();  // restores previous device context
}
```

Using `try-with-resources` on `DeviceContext` guarantees that the previous context is always restored, even if the work block throws.

### MultiBackendWorkspace

`MultiBackendWorkspace` extends the existing ND4J workspace concept to span multiple devices. It maintains MSI (Memory Sharing Interface) coherence — when an array is accessed on a device where it was not most recently written, the workspace layer transparently copies the data before the access proceeds.

```java
import org.nd4j.linalg.device.MultiBackendWorkspace;

try (MultiBackendWorkspace ws = MultiBackendWorkspace.open("train-step")) {
    INDArray x = Nd4j.rand(DataType.FLOAT, 1024, 1024);  // allocated on default device
    ws.migrateToDevice(x, DeviceType.TPU, 0);           // move to TPU 0
    INDArray z = x.mmul(x.T());                          // executed on TPU
    ws.migrateToDevice(z, DeviceType.CPU, 0);            // bring result to CPU
    System.out.println(z.meanNumber());
}
```

### DeviceWorkspaceManager

`DeviceWorkspaceManager` is a thread-local registry of open `MultiBackendWorkspace` instances. Each thread has its own workspace stack; opening a workspace on thread A does not affect thread B.

```java
DeviceWorkspaceManager.getInstance().openWorkspace("scope-name");
// ... work ...
DeviceWorkspaceManager.getInstance().closeWorkspace("scope-name");
```

### DeviceRoutingConfiguration and MultiGpuTracer

`DeviceRoutingConfiguration` allows the application to specify routing rules — which device types are eligible for which op categories, and what fallback order to use when the preferred device is unavailable:

```java
DeviceRoutingConfiguration config = new DeviceRoutingConfiguration.Builder()
    .preferDevice(OpCategory.MATRIX_MULTIPLY, DeviceType.CUDA)
    .fallbackDevice(OpCategory.MATRIX_MULTIPLY, DeviceType.CPU)
    .preferDevice(OpCategory.ATTENTION, DeviceType.TPU)
    .build();

DeviceAwareOpExecutioner executioner = new DeviceAwareOpExecutioner(config);
```

`MultiGpuTracer` is a diagnostic utility that logs device transitions, allocation events, and cross-device copies during a traced execution window:

```java
try (MultiGpuTracer tracer = MultiGpuTracer.start()) {
    // ... operations ...
} // prints trace summary on close
```

### DeviceAwareNDArrayFactory and BackendRoutingStrategy

`DeviceAwareNDArrayFactory` is an `NDArrayFactory` implementation that consults the active `BackendRoutingStrategy` when creating arrays, routing allocation to the appropriate device:

```java
INDArray x = Nd4j.create(DataType.FLOAT, 1024, 1024);
// If routing strategy says CUDA for this size, x is allocated on GPU
// If it says TPU, x is an XLA buffer
// Application code does not change
```

## 13. Device Auto-Detection

When the backend is not forced via `backend.type`, ND4J probes available hardware in order of priority:

1. **CUDA:** calls `cudaGetDeviceCount()`. If one or more CUDA devices are found and the CUDA runtime is the expected version, `nd4j-cuda` is selected.
2. **TPU:** `PjrtClientManager::HasTpuDevice()` JNI probe. Requires `TPU_NAME` environment variable and `libtpu.so` accessible.
3. **Hexagon:** probes for `libQnnHtp.so` on `LD_LIBRARY_PATH`.
4. **CPU:** always available.

To inspect which backend was selected at runtime:

```java
System.out.println(Nd4j.getBackend().getClass().getName());

// Check device type via DeviceDescriptor
DeviceDescriptor desc = Nd4j.getBackend().getActiveDeviceDescriptor();
System.out.println(desc.getDeviceType());   // e.g. DeviceType.CUDA
System.out.println(desc.getDeviceIndex());  // e.g. 0
```

## 14. GraphExecutionMode Reference

SameDiff graph execution supports 17 execution modes. Modes are set per-graph and control the tradeoff between compilation overhead, runtime speed, device placement, and fallback behavior. This is documented in full in the [DSP Execution Engine](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/nd4j/samediff/dsp/README.md) page; a condensed reference follows.

| Mode             | Description                                                      |
| ---------------- | ---------------------------------------------------------------- |
| `EAGER`          | Execute each op immediately as it is added; no graph compilation |
| `GRAPH`          | Build full graph first, then execute; enables fusion             |
| `GRAPH_CACHED`   | `GRAPH` with compiled program cached by input shapes             |
| `JIT_CPU`        | JIT-compile graph for CPU; uses MLIR Linalg/arith                |
| `JIT_CUDA`       | JIT-compile for CUDA; produces PTX                               |
| `JIT_TPU`        | JIT-compile for TPU; produces HLO programs                       |
| `JIT_HEXAGON`    | JIT-compile for Hexagon DSP; produces QNN graph                  |
| `STREAMING`      | Process inputs as a stream; constant memory footprint            |
| `BATCHED`        | Accumulate inputs and execute in one batched pass                |
| `DISTRIBUTED`    | Partition graph across multiple devices                          |
| `ONNX_EXPORT`    | Execute and simultaneously export to ONNX                        |
| `ONNX_IMPORT`    | Execute an imported ONNX graph                                   |
| `DEBUG`          | Execute with per-op shape and value checks                       |
| `PROFILE`        | Execute with timing and memory usage instrumentation             |
| `FALLBACK_CPU`   | Attempt preferred device; fall back to CPU on failure            |
| `FALLBACK_CHAIN` | Attempt preferred device, then each fallback in priority order   |
| `DRY_RUN`        | Trace execution without computing output values                  |

Fallback chain example:

```java
SameDiff sd = SameDiff.create();
sd.setExecutionMode(GraphExecutionMode.FALLBACK_CHAIN);
// Attempts CUDA → TPU → CPU in priority order
```

## 15. Kernel Auto-Tuning and Dynamic Plugin Loading (ADR 0055, 0058)

ND4J includes a runtime kernel selection system that benchmarks available backend helpers for each op/shape combination and routes to the fastest one. The system is opt-in via environment variable or Java API and persists benchmark results across JVM restarts.

### How It Works

When an op executes with multiple available helpers (for example, both a cuDNN helper and a CUDA helper for `conv2d`), the dispatch layer:

1. Checks `SD_KERNEL_FORCE_ENGINE` — if set, only that engine is considered.
2. Filters out engines listed in `SD_KERNEL_DISABLE_ENGINES`.
3. If `SD_KERNEL_AUTOTUNE=1`, benchmarks all remaining usable helpers and caches the fastest one keyed by op hash + shape bucket + data type.
4. Otherwise falls back to the first usable helper (original behavior).

Shape bucketing (powers of 2) keeps the cache compact; benchmark results persist in the file specified by `SD_KERNEL_CACHE_PATH`.

### Environment Variables

| Variable                    | Description                                       | Example                       |
| --------------------------- | ------------------------------------------------- | ----------------------------- |
| `SD_KERNEL_AUTOTUNE`        | Enable runtime auto-tuning                        | `1`                           |
| `SD_KERNEL_FORCE_ENGINE`    | Force a single engine for all ops                 | `cuda`, `onednn`, `cpu`       |
| `SD_KERNEL_DISABLE_ENGINES` | Comma-separated list of engines to skip           | `onednn,mps`                  |
| `SD_KERNEL_CACHE_PATH`      | Path for persistent benchmark cache               | `/tmp/nd4j_kernel_cache.json` |
| `SD_KERNEL_WARMUP_RUNS`     | Warmup iterations before benchmarking             | `2`                           |
| `SD_KERNEL_BENCHMARK_RUNS`  | Benchmark iterations per helper                   | `5`                           |
| `SD_KERNEL_VERBOSE`         | Verbose logging of kernel selection               | `1`                           |
| `SD_KERNEL_PLUGIN_PATH`     | Colon-separated paths for custom kernel plugins   | `/opt/plugins:/usr/lib/nd4j`  |
| `SD_KERNEL_PLUGIN_AUTO`     | Auto-load all plugins found in plugin path        | `1`                           |
| `SD_KERNEL_STRATEGY`        | Selection strategy (`FASTEST`, `FIRST_AVAILABLE`) | `FASTEST`                     |

### Java API

```java
import org.nd4j.linalg.api.ops.executioner.KernelManager;
import org.nd4j.autodiff.samediff.config.KernelConfiguration;
import org.nd4j.autodiff.samediff.SameDiff;

// Fine-grained configuration through SameDiff
SameDiff sd = SameDiff.create();
sd.kernelConfiguration()
    .preferCuda()                          // prefer CUDA for all ops
    .disableEngine(Engine.ONEDNN)          // skip oneDNN globally
    .forConvolutions().useCudnn()          // cuDNN for all conv ops
    .and()
    .forLinearAlgebra().useOneDnn()        // oneDNN for matmul/gemm
    .and()
    .apply();

// Quick presets (GPU_OPTIMIZED, INTEL_OPTIMIZED, APPLE_SILICON_OPTIMIZED, etc.)
sd.kernelConfiguration()
    .usePreset(KernelConfiguration.Preset.GPU_OPTIMIZED)
    .apply();

// Query what kernels are available for an op
KernelManager km = KernelManager.getInstance();
km.searchOperations("conv*").forEach(op -> {
    System.out.println(op.getOpName());
    op.getAvailableKernels().forEach(k ->
        System.out.println("  " + k.getEngine() + " enabled=" + k.isEnabled()));
});
```

### Global enable/disable without SameDiff

```bash
# Force cuDNN for all ops during inference, disable oneDNN entirely
export SD_KERNEL_FORCE_ENGINE=cuda
export SD_KERNEL_DISABLE_ENGINES=onednn

# Enable auto-tuning and persist results
export SD_KERNEL_AUTOTUNE=1
export SD_KERNEL_CACHE_PATH=/var/cache/nd4j/kernels.json
```

### Dynamic Plugin Loading

Custom kernel implementations can be packaged as shared libraries and loaded at runtime without recompiling ND4J:

```bash
# Load a single plugin explicitly
```

```java
// Load at runtime
Nd4j.loadKernelPlugin("/opt/myorg/libmy_kernels.so");

// Or let ND4J auto-discover plugins in a directory
// (set SD_KERNEL_PLUGIN_PATH and SD_KERNEL_PLUGIN_AUTO=1)
```

A plugin must export a class derived from `SimpleKernelPlugin` and declare itself with the `SD_DECLARE_KERNEL_PLUGIN` macro. The plugin registers `PlatformHelper` subclasses for the ops it accelerates. ND4J calls `sd_plugin_api_version()` at load time to verify ABI compatibility; plugins compiled against an incompatible version of libnd4j are silently skipped.

```cpp
// Skeleton of a custom plugin (my_kernels.cpp)
#include <helpers/DynamicKernelLoader.h>
#include <ops/declarable/PlatformHelper.h>

class MyOptimizedConv2d : public PlatformHelper {
public:
    MyOptimizedConv2d() : PlatformHelper("conv2d", samediff::ENGINE_CPU) {}
    bool isUsable(graph::Context& ctx) override { /* shape/type checks */ return true; }
    Status invokeHelper(graph::Context& ctx) override { /* fast implementation */ return Status::OK; }
};

class MyKernelPlugin : public SimpleKernelPlugin {
public:
    MyKernelPlugin() : SimpleKernelPlugin("MyKernels", {1, 0, 0}) {}
    bool initialize() override {
        registerKernel("conv2d", samediff::ENGINE_CPU,
            []() { return new MyOptimizedConv2d(); }, /*priority=*/150);
        return true;
    }
};

SD_DECLARE_KERNEL_PLUGIN(MyKernelPlugin)
```

Compile on Linux:

```bash
g++ -shared -fPIC -o libmy_kernels.so my_kernels.cpp \
    -I/path/to/libnd4j/include -L/path/to/libnd4j/lib -lnd4j
```

## 16. OpenVINO CPU Graph Backend (ADR 0098)

OpenVINO is integrated as a `GraphBackend` in the CPU execution chain, providing broad op-level fusion for Intel hardware.

### CPU GraphBackend Chain

When DSP executes a SameDiff graph on CPU, it partitions the graph into segments and passes each segment through a priority-ordered chain of `GraphBackend` implementations:

```
MLX → OpenVINO → OneDNN → ACL → NNAPI → ArmHybrid → MLIR
```

OpenVINO sits before OneDNN in the chain because it covers approximately 200 ops from OpenVINO opset13, while OneDNN's graph fusion covers roughly 40 ops (primarily SDPA and related attention patterns). When the execution mode is `SLOT_BY_SLOT` the chain is bypassed entirely.

### Island (Mixed-Segment) Execution

Some segments contain ops that fall outside OpenVINO's opset13 (for example, SSM recurrence layers or custom ops). In these cases the segment is split into alternating "OV islands" (contiguous mappable ops compiled to `ov::InferRequest`) and "NativeRange" blocks (unmappable ops executed by `NativeSlotExecutor`). Execution interleaves compiled OV requests with native callbacks — no segment is forced to fall back entirely to unfused execution just because one op is unmappable.

### Runtime Configuration

`OpenVinoGraphBackend` configures `ov::Core` for single-request autoregressive decode:

| Setting          | Value                    | Rationale                                                                                                     |
| ---------------- | ------------------------ | ------------------------------------------------------------------------------------------------------------- |
| Performance mode | `LATENCY`                | Single stream, all threads intra-op. `THROUGHPUT` mode replicates model buffers, causing OOM on large models. |
| Hyper-threading  | Disabled                 | Reduces contention on shared cache                                                                            |
| CPU pinning      | Enabled                  | Prevents thread migration overhead                                                                            |
| Core selection   | P-cores only             | On Intel hybrid CPUs (12th gen+)                                                                              |
| Disk cache       | `~/.nd4j/openvino_cache` | Compiled model cache, avoids recompilation across runs                                                        |

### Compilation Caching

OpenVINO uses a two-level cache:

* **Segment-level LRU** — caches `ov::InferRequest` objects per segment. There is no entry cap (a prior 772-entry limit caused eviction thrashing on a 1913-slot Qwen model).
* **Topology-level** — shares a single `ov::CompiledModel` across transformer layers that have identical op structure. Each segment holds only a lightweight `ov::InferRequest` pointer into the shared compiled model. This significantly reduces memory when the same transformer block is repeated many times (e.g., 32 or 128 layers).

### FP16 Handling

At startup, OpenVINO queries the CPU ISA via OneDNN (`dnnl_get_effective_cpu_isa`). If the CPU lacks AVX512-FP16 or AMX-FP16 (e.g., AMD Ryzen), all FP16 parameters are promoted to FP32 before inference. Promoted tensors are cached per island to avoid per-token allocation overhead.

### When OpenVINO Is Selected

OpenVINO processes a segment when:

1. The graph is running in DSP mode (not `SLOT_BY_SLOT`).
2. The segment contains at least one op that maps to OpenVINO opset13.
3. OpenVINO runtime (`libopenvino.so`) is available on `LD_LIBRARY_PATH` or the system library path.

No user configuration is required to activate it — the backend chain is probed automatically. If `libopenvino.so` is not found, the chain falls through to OneDNN.

### Maven and Library Setup

No separate Maven artifact is needed. OpenVINO integration is part of `nd4j-native` on x86-64 Linux and Windows. Ensure the OpenVINO runtime is installed:

```bash
# Install OpenVINO runtime (Linux)
# https://docs.openvino.ai/latest/openvino_docs_install_guides_installing_openvino.html
source /opt/intel/openvino/setupvars.sh

# Verify ND4J can find it
java -Dnd4j.log.initialization=true -jar myapp.jar
```

### Environment Variable

| Variable          | Description                                                                              |
| ----------------- | ---------------------------------------------------------------------------------------- |
| `LD_LIBRARY_PATH` | Must include OpenVINO runtime libs (`libopenvino.so`, `libopenvino_intel_cpu_plugin.so`) |
| `OV_CACHE_DIR`    | Override compilation cache directory (default: `~/.nd4j/openvino_cache`)                 |

## 17. Multi-Backend Dynamic Op Routing (ADR 0059)

The multi-backend system enables concurrent CPU and GPU execution within a single JVM process. Rather than selecting one backend at startup, it loads both `nd4jcpu.so` and `nd4jcuda.so` and routes each op execution to the device that best matches the data's current location.

### Core Components

**`MultiBackendNativeOpsHolder`** — loads and manages multiple `NativeOps` implementations at runtime. Probes for both CPU and CUDA backends; gracefully skips unavailable ones.

```java
import org.nd4j.nativeblas.MultiBackendNativeOpsHolder;
import org.nd4j.linalg.api.ops.executioner.DeviceAwareOpExecutioner;

// Enable multi-backend mode (call once before any Nd4j operations)
MultiBackendNativeOpsHolder.enableMultiBackend();

// Retrieve ops for a specific device type
NativeOps cpuOps = MultiBackendNativeOpsHolder.getInstance()
    .getOpsForDeviceType(DeviceType.CPU);
NativeOps gpuOps = MultiBackendNativeOpsHolder.getInstance()
    .getOpsForDeviceType(DeviceType.CUDA_GPU);
```

**`BackendRoutingStrategy`** — interface that determines the target device for each op execution. The default `DefaultBackendRoutingStrategy` votes based on where the majority of input arrays currently reside, available device memory, and user-configured preferences.

```java
interface BackendRoutingStrategy {
    DeviceDescriptor selectTargetDevice(Op op);
    NativeOps getNativeOpsForDevice(DeviceDescriptor device);
    INDArray ensureOnDevice(INDArray array, DeviceDescriptor device);
}
```

**`DeviceAwareOpExecutioner`** — wraps multiple per-device `OpExecutioner` instances and routes each call through the active `BackendRoutingStrategy`. Automatically transfers input arrays to the target device before execution.

```java
// Register backend executioners
DeviceAwareOpExecutioner executioner = DeviceAwareOpExecutioner.getInstance();
executioner.registerBackendExecutioner(DeviceType.CPU, cpuExecutioner);
executioner.registerBackendExecutioner(DeviceType.CUDA_GPU, cudaExecutioner);

// From this point, normal ND4J ops automatically route to the right device
INDArray gpuArray = DeviceAwareNd4j.createOnGpu(new long[]{1000, 1000});
INDArray cpuArray = Nd4j.zeros(1000, 1000);

// cpuArray is automatically transferred to GPU; result lives on GPU
INDArray result = gpuArray.add(cpuArray);
```

### Simplified Setup with `DeviceAwareNd4j`

```java
import org.nd4j.linalg.device.DeviceAwareNd4j;

// Basic: enable routing with defaults (prefer GPU, auto-transfer enabled)
DeviceAwareNd4j.enableDeviceRouting();

// Advanced: customize routing policy
DeviceRoutingConfiguration config = DeviceRoutingConfiguration.builder()
    .defaultPolicy(DeviceRoutingPolicy.PREFER_GPU)
    .autoTransferEnabled(true)
    .gpuMemoryCapFraction(0.9)
    .build();
DeviceAwareNd4j.enableDeviceRouting(config);

// Create device-specific arrays
INDArray onGpu  = DeviceAwareNd4j.createOnGpu(new long[]{512, 512});
INDArray onCpu  = DeviceAwareNd4j.createOnCpu(new long[]{512, 512});
INDArray routed = DeviceAwareNd4j.createRouted(new long[]{512, 512}); // auto placement
```

### Classpath Requirements

Multi-backend requires both native libraries on the classpath:

```xml
<!-- CPU backend -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native-platform</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- CUDA backend (add alongside CPU — do not replace it) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-cuda-12.9-platform</artifactId>
    <version>${dl4j.version}</version>
</dependency>
```

When only one backend is available on the classpath, `MultiBackendNativeOpsHolder` degrades gracefully to single-backend mode and all routing decisions resolve to the available backend.

### Performance Notes

* Op routing adds a per-op device-detection step. For workloads with many small ops, prefer explicit device placement over automatic routing.
* Cross-device transfers (CPU array passed to a GPU op) involve a host-to-device copy. Use `DeviceAwareNd4j.createOnGpu` to allocate arrays where they will be used.
* `MultiGpuTracer` can profile routing decisions and flag unexpected cross-device transfers:

```java
try (MultiGpuTracer tracer = MultiGpuTracer.start()) {
    // ... your inference code ...
} // prints device-transition summary on close
```

## 18. Configuration Reference

### System Properties

| Property                        | Default                  | Description                                               |
| ------------------------------- | ------------------------ | --------------------------------------------------------- |
| `backend.type`                  | (auto)                   | Force a specific backend: `CPU`, `CUDA`, `TPU`, `HEXAGON` |
| `nd4j.tpu.name`                 | (from `TPU_NAME` env)    | TPU resource name for PJRT client                         |
| `nd4j.tpu.default.dtype`        | `BFLOAT16`               | Default data type for TPU arrays                          |
| `nd4j.hexagon.lib.path`         | (from `LD_LIBRARY_PATH`) | Override path to QNN libraries                            |
| `nd4j.zluda.auto.download`      | `true`                   | Whether to auto-download ZLUDA on AMD/Intel GPU           |
| `nd4j.device.memory.cap.CUDA.0` | (unlimited)              | Per-device allocation cap in bytes                        |
| `nd4j.multibackend.trace`       | `false`                  | Enable `MultiGpuTracer` for all executions                |
| `org.bytedeco.javacpp.maxbytes` | (unlimited)              | Off-heap/VRAM cap passed to JavaCPP                       |

### Environment Variables

| Variable                    | Description                                                              |
| --------------------------- | ------------------------------------------------------------------------ |
| `TPU_NAME`                  | TPU resource name (`local` for TPU VM, full path for TPU node)           |
| `CUDA_VISIBLE_DEVICES`      | Restrict CUDA device set visible to the process                          |
| `LD_LIBRARY_PATH`           | Must include QNN libs for Hexagon, cuDNN for CUDA cuDNN, MKL for oneDNN  |
| `ZLUDA_DEVICE`              | Selects the AMD/Intel device when ZLUDA is active                        |
| `SD_KERNEL_AUTOTUNE`        | Enable kernel auto-tuning (`1`) — see Section 15                         |
| `SD_KERNEL_FORCE_ENGINE`    | Force a single engine for all ops (e.g., `cuda`, `onednn`)               |
| `SD_KERNEL_DISABLE_ENGINES` | Comma-separated engines to skip (e.g., `onednn,mps`)                     |
| `SD_KERNEL_CACHE_PATH`      | Path for persistent benchmark cache                                      |
| `SD_KERNEL_PLUGIN_PATH`     | Plugin search directories for custom kernels                             |
| `OV_CACHE_DIR`              | OpenVINO compilation cache directory (default: `~/.nd4j/openvino_cache`) |

### Backend Priority Summary

```
CUDA (100) > HEXAGON (60) > TPU (50) > CPU (0)
```

When the ZLUDA path is active, the CUDA backend handles AMD and Intel GPUs at the same priority (100). When the SDX backend is present, it dispatches internally to CPU/Hexagon/Adreno depending on op type, so its effective priority in the SPI chain is separate from those individual backends.

### Kernel Engine Variables Quick Reference

For convenience, the kernel selection environment variables from Section 15 are also captured here:

```bash
SD_KERNEL_AUTOTUNE=1              # enable auto-tuning
SD_KERNEL_FORCE_ENGINE=cuda       # force a single engine
SD_KERNEL_DISABLE_ENGINES=onednn  # disable one or more engines
SD_KERNEL_CACHE_PATH=/tmp/kc.json # benchmark cache location
SD_KERNEL_VERBOSE=1               # log kernel dispatch decisions
```

## See Also

* [Backends Overview](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/nd4j/backends/overview/README.md) — SPI mechanism, backend discovery, classpath rules
* [CPU Backend (nd4j-native)](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/nd4j/backends/cpu/README.md) — AVX tuning, BLAS configuration, threading
* [CUDA Backend (nd4j-cuda)](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/nd4j/backends/cuda/README.md) — CUDA version matrix, cuDNN, multi-GPU, memory management
* [Memory and Workspaces](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/nd4j/workspaces/README.md) — off-heap memory, workspace scopes
* [DSP Execution Engine](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/nd4j/samediff/dsp/README.md) — full GraphExecutionMode documentation