> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/release-notes/1.0.0-rewrite.md).

# 1.0.0-rewrite

The 1.0.0-rewrite is a ground-up rearchitecture of the Deeplearning4j execution stack. Where M2.1 was a batch-training framework modeled on early deep learning toolkits, the rewrite repositions DL4J as a general-purpose, inference-first JVM runtime capable of running modern LLM and VLM workloads at competitive throughput.

This page covers what changed and why. For API-level documentation, each section links to the relevant guide page.

> **This is a transitional release.** Existing `org.deeplearning4j` and `org.nd4j` imports continue to work. A namespace consolidation is underway — the **next release** will complete the import cleanup. See [Namespace Consolidation](#namespace-consolidation) below.

***

## What Changed from M2.1

### The problem the rewrite solves

M2.1's execution model was **interpreted and eager**: every `SameDiff.output()` call re-analyzed the graph, resolved variable references by string key, allocated intermediates from scratch, and dispatched ops one at a time through JNI. For batch training on CNNs and small RNNs this was adequate. For autoregressive LLM decoding — where each token generation step runs the full graph and latency is measured in milliseconds — it was unusable. The per-step overhead of graph re-analysis and Java↔C++ round-trips dominated actual compute time.

The rewrite replaces this with a **compiled execution model** centered on the DSP (Dynamic Shape Plan) engine. The graph is analyzed once, compiled into a flat slot array, and replayed with zero per-step interpretation overhead. On CUDA, contiguous segments of the plan are captured into CUDA graphs and replayed as single kernel launches. On CPU, the same plan dispatches through a chain of graph-level backends — OpenVINO (\~200 fused ops), oneDNN Graph (\~80 fused ops), ARM Compute Library (\~150 ops with Conv+BN+ReLU fusion), Apple MLX/MPS (Metal-accelerated matmul and attention), or MLIR CPU JIT (x86 AMX, ARM NEON/SVE/SME) — depending on what hardware is available. The result is that SameDiff goes from "research-grade graph framework" to "production inference runtime" without changing the user-facing API.

### Trade-offs made

Every architecture decision in the rewrite reflects a clear set of trade-offs:

**Inference-first, training-compatible.** The DSP engine is optimized for the inference hot path (fixed topology, stable shapes, CUDA graph replay). Training still works through the same graph — gradients flow through DSP-compiled plans — but the optimization effort prioritized decode latency over training throughput. If your primary workload is batch CNN training on ImageNet, the rewrite may not be faster than M2.1 for that specific case. If your workload involves any form of autoregressive generation, the difference is 10-100x.

**Compiled graphs vs. dynamic control flow.** DSP compiles static-topology subgraphs. Ops with truly data-dependent shapes (`where`, `unique`, `nonzero`, dynamic reshapes from runtime values) break CUDA graph capture and force fallback to slot-by-slot execution within that segment. The graph optimizer (26 passes) works to maximize the capturable fraction — for typical transformer models, 95%+ of ops land in captured segments — but models with heavy dynamic control flow will see less benefit. Python-style eager control flow (if/else branching on tensor values) must be expressed as SameDiff control flow ops (`sd.whileLoop`, `sd.ifCond`) for the plan compiler to reason about them.

**SameDiff as the single execution path.** The rewrite unifies everything through SameDiff. DL4J's `MultiLayerNetwork` and `ComputationGraph` still work, but underneath they now convert to SameDiff for execution. New features (PEFT, GGUF import, LLM generation, hardware backends) are SameDiff-native and do not have DL4J-layer equivalents. This means: if you want DSP, Triton JIT, TPU execution, or any of the new hardware backends, your model needs to be expressible as a SameDiff graph. Converter utilities exist for migrating existing DL4J models — see [Migration Guide](#migration-guide).

**Multi-backend binary size vs. simplicity.** The `-platform` Maven artifact bundles native binaries for all supported OS/arch combinations in one JAR (\~200MB for CPU, larger for CUDA). The new `-lite` and `-compile` classifier variants let you trade binary size for capability: `-lite` strips unused data types for edge deployment, `-compile` adds the full Triton/MLIR/NVRTC JIT stack for maximum throughput. You choose which trade-off fits your deployment. See [Maven Setup — Classifiers](/en-1.0.0-rewrite/configuration/maven.md#platform-classifiers).

**CUDA 12.9 default, compilable for other versions.** M2.1 shipped separate `nd4j-cuda-11.4` and `nd4j-cuda-11.6` artifact IDs. The rewrite ships `nd4j-cuda-12.9` as the default — the pre-built binaries target CUDA 12.9 because that version provides the `cudaMallocAsync` pooling, CUDA graph capture/replay APIs, and Triton/PTX JIT integration that DSP depends on. However, `cuda.version` is a Maven property: if you build from source, you can compile against a different CUDA toolkit version by passing `-Dcuda.version=12.6` (or another 12.x release). The pre-built `-platform` JARs published to Maven Central target 12.9. CUDA 11.x is no longer supported — the `cudaMallocAsync` API (introduced in CUDA 11.2 but production-ready in 12.x) and CUDA graph stream-capture improvements in 12.x are hard requirements for DSP.

***

## DSP: Dynamic Shape Plan Execution Engine

**Docs:** [DSP Execution Engine](/en-1.0.0-rewrite/nd4j/overview-2/dsp.md)

DSP is the centerpiece of the rewrite. It replaces the old interpreted `GraphExecutioner` and `NativeGraphExecutioner` (both deleted in this release) with a compiled graph runtime.

### How it works

1. **Graph analysis.** `ForwardExecutionDAGBuilder` traverses the SameDiff graph once, resolving all variable dependencies, control flow frames, and cross-frame references. The result is a `ForwardExecutionDAG` cached in a `DAGCache`. This replaces the old `initSubgraph` method, which had fundamental convergence bugs that caused complex graphs to either fail to initialize or produce incorrect results.
2. **Plan compilation.** The DAG is compiled into a flat array of **slots** — each slot is a self-contained op descriptor with integer-indexed inputs (no string lookups), frozen `iArgs/tArgs/bArgs/dArgs`, and a target device ID. Input sources are sign-encoded: `>=0` means "output of slot N", `<0` means "external input at index `-(N+1)`".
3. **Segmentation.** Contiguous runs of slots with the same capturability and target device form **segments**. Capturable segments (pure arithmetic, matmul, attention, normalization) go through: warmup → shape freeze → CUDA graph capture → replay. Non-capturable segments (dynamic-shape ops, host-device sync points) execute slot-by-slot.
4. **Graph optimization.** Before compilation, a 26-pass `GraphOptimizer` transforms the graph: constant folding, dead code elimination, broadcast elimination, common subexpression elimination, algebraic simplification (`pow(x,2)→square`, `sigmoid(x)*x→swish`), normalization decomposition→fused ops (`rms_norm`, fused layer norm), and attention fusion (separate Q/K/V matmuls → fused SDPA). The optimizer is critical — without fusion, too many non-capturable ops interrupt CUDA graph segments.
5. **Shape-keyed caching.** Compiled plans are cached by an FNV-1a hash of segment bounds + input shapes. Same shape key = reuse the captured CUDA graph. Plans persist to disk (`~/.kompile/cache/dsp/`) across JVM restarts.
6. **Buffer coloring.** Compile-time analysis identifies non-overlapping slot lifetimes and assigns them to shared physical buffers, reducing intermediate buffer count by 10-20x. A per-device buffer pool enables cross-plan reuse, and LRU passivation releases GPU memory from inactive plans while keeping them in cache.

### GPU segment execution

For CUDA segments, DSP uses a three-tier dispatch:

1. **CUDA graph capture/replay** (base classifier). Contiguous capturable segments are warmed up slot-by-slot, then captured into a CUDA graph and replayed as a single kernel launch. This is available in the base `nd4j-cuda-12.9` artifact — no `-compile` classifier needed.
2. **Triton/NVRTC/PTX JIT** (`-compile` classifier). For segments that benefit from kernel fusion (element-wise chains, reduction patterns, attention variants), DSP dispatches to a JIT compilation stack:

| Priority | Backend              | What it does                                                                                                                                                      |
| -------- | -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1        | Triton MLIR          | Full Triton compilation pipeline via MLIR lowering — produces highly optimized fused kernels. Multi-target: NVIDIA (PTX), AMD (AMDGCN via ZLUDA), Intel (SPIR-V). |
| 2        | NVRTC                | Runtime CUDA C compilation for patterns Triton doesn't cover.                                                                                                     |
| 3        | PTX string templates | Fastest compilation, least optimization. Fallback for simple element-wise patterns.                                                                               |

The JIT stack requires the `-compile` classifier variant. See [Maven Setup — `-compile` Classifier](/en-1.0.0-rewrite/configuration/maven.md#dsp-jit-classifier--compile).

### CPU segment execution

DSP is not CUDA-only. On CPU, the same plan compilation and segmentation logic applies — the difference is which `GraphBackend` executes each segment. CPU segments are dispatched through a priority chain of graph-level backends, each implementing `canFuseSegment()` / `compileSegment()` / `executeSegment()`:

| Priority | Backend                 | What it covers                                                                                                                                                                                                             |
| -------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1        | **MLX**                 | Apple Silicon via Metal Performance Shaders. Routes to `MPSGraph` for matmul, attention (SDPA), conv, normalization.                                                                                                       |
| 2        | **OpenVINO**            | Intel CPU graph fusion. \~200 ops via opset13. Configured for latency mode with P-core pinning. FP16 promoted to FP32 on CPUs without AVX512-FP16/AMX-FP16. Disk cache at `~/.nd4j/openvino_cache`.                        |
| 3        | **oneDNN Graph**        | Intel oneDNN `dnnl::graph` API. Auto-partitions and fuses \~80 ops (conv, pooling, LSTM, matmul, activations, reductions). Mixed segments interleave `dnnl::graph` partitions with native slot execution for unmapped ops. |
| 4        | **ARM Compute Library** | ARM Cortex-A / Neoverse. \~150 ops with built-in ACL fusion (Conv+BN+ReLU, MatMul+Bias+Activation). Zero-copy via `import_memory()` when array is contiguous.                                                              |
| 5        | **NNAPI**               | Android Neural Networks API (API 27+). Routes to Hexagon DSP, Mali GPU, or NPU depending on the device.                                                                                                                    |
| 6        | **ARM Hybrid**          | MLIR with ARM-tuned tile sizes + optional Vulkan GPU offload on mobile. CPU path uses NEON/SVE vectorization; GPU path emits SPIR-V for Mali/Adreno compute shaders. Activated on Android NDK aarch64 or Linux ARM64.      |
| 7        | **MLIR CPU JIT**        | General-purpose MLIR JIT. Generates `memref`/`scf`/`arith`/`math` IR from slot segments, lowers through LLVM JIT. Supports x86 AMX, ARM NEON/SVE/SME via `MLIRCompileOptions`. Available with the `-compile` classifier.   |

Each backend in the chain is tried in order. If a backend's `canFuseSegment()` returns false for a given segment (e.g., OpenVINO can't handle a custom op), the segment falls through to the next backend. If no graph backend claims the segment, it executes slot-by-slot through the standard native op dispatcher.

**Op-level dispatch (below the graph level).** Independent of DSP segments, individual ops can dispatch to platform-specific implementations via `PlatformHelper` and the `PLATFORM_IMPL` macro. Over 100 ops have oneDNN platform implementations (matmul, conv, LSTM, attention, all activations). \~150 ops have ARM Compute Library implementations. Apple Accelerate provides BLAS (vecLib), FFT/element-wise (vDSP), transcendentals (vForce), and neural network primitives (BNNS). The `MultiPlatformDispatcher` selects the best available implementation per op, with auto-tuning via `DispatchMode.BENCHMARK`.

**Buffer coloring works on CPU too.** The `DspBufferColorMap` that assigns non-overlapping slot lifetimes to shared physical buffers is not GPU-specific — it runs on all platforms, reducing intermediate allocation count by 10-20x regardless of backend.

### GraphExecutionMode — all 19 values

The `GraphExecutionMode` enum controls which backend DSP targets. On CPU-only builds, GPU modes automatically remap to `EMULATED_REPLAY` (slot-by-slot with full lifecycle diagnostics).

| Mode                   | Code | Target                                                                     |
| ---------------------- | ---- | -------------------------------------------------------------------------- |
| `AUTO`                 | 0    | Probes the backend chain — picks the best available                        |
| `SLOT_BY_SLOT`         | 1    | No fusion, one op at a time                                                |
| `CUDA_GRAPHS`          | 2    | CUDA graph capture/replay; on CPU, remaps to oneDNN Graph / ACL fusion     |
| `NVRTC_JIT`            | 3    | NVRTC runtime CUDA C; on non-CUDA, falls through to Triton                 |
| `PTX_JIT`              | 4    | PTX string templates; on non-CUDA, falls through to Triton                 |
| `TRITON`               | 5    | Triton MLIR pipeline (GPU); on CPU, falls through to CPU graph backends    |
| `MLX`                  | 6    | Apple Silicon Metal Performance Shaders                                    |
| `ARM_HYBRID`           | 7    | MLIR with NEON/SVE/SME + optional Vulkan GPU offload                       |
| `NNAPI`                | 8    | Android Neural Networks API                                                |
| `HIP_GRAPHS`           | 9    | AMD ROCm HIP graph capture/replay                                          |
| `LEVEL_ZERO`           | 10   | Intel Level Zero mutable command list                                      |
| `VULKAN`               | 11   | Cross-vendor Vulkan compute command buffers                                |
| `METAL`                | 12   | Apple Metal indirect command buffers                                       |
| `TPU`                  | 13   | Google Cloud TPU via PJRT/HLO                                              |
| `HEXAGON`              | 14   | Qualcomm Hexagon NPU via hexagon-mlir                                      |
| `OPENVINO`             | 15   | Intel OpenVINO CPU (\~200 ops; also ARM via OpenVINO ARM plugin)           |
| `TVM`                  | 16   | Deprecated — kept for serialization compatibility                          |
| `EMULATED_REPLAY`      | 17   | Slot-by-slot with full replay lifecycle diagnostics; works on CPU and CUDA |
| `SHAPE_INFERENCE_ONLY` | 18   | Shape propagation only, no compute kernels                                 |

```java
// Intel CPU — route through OpenVINO graph fusion
sd.setGraphExecutionMode(GraphExecutionMode.OPENVINO);

// Apple Silicon — route through MLX/Metal
sd.setGraphExecutionMode(GraphExecutionMode.MLX);

// ARM server (Graviton, Ampere) — route through ACL + MLIR
sd.setGraphExecutionMode(GraphExecutionMode.ARM_HYBRID);

// Let DSP pick the best available
sd.setGraphExecutionMode(GraphExecutionMode.AUTO);
```

### What was removed for DSP

| Removed                                                    | Replaced by                                                                         |
| ---------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| `GraphExecutioner.java`                                    | `DynamicShapePlanExecutor`                                                          |
| `NativeGraphExecutioner.java`                              | DSP-based `InferenceSession`                                                        |
| 10 `LogicXxx.h/.cpp` files (LogicWhile, LogicSwitch, etc.) | Control flow resolved at Java level before plan serialization                       |
| `GraphProfilingHelper.cpp`                                 | `OpTimingTracker` with Chrome trace export                                          |
| `initSubgraph` method                                      | `ForwardExecutionDAGBuilder` + `DAGCache`                                           |
| `ExecutionPhase` enum                                      | `SegmentLifecycleState` (single state machine replacing two that diverged silently) |
| Static `ExecutionPlan` class                               | DSP shape-keyed plan cache                                                          |

***

## SameDiff as the Core Runtime

**Docs:** [SameDiff Overview](/en-1.0.0-rewrite/nd4j/overview-2.md)

The rewrite makes SameDiff the single execution substrate. Everything — DL4J neural networks, GGUF models, ONNX imports, PEFT-adapted models, LLM generation — converges on a SameDiff graph that DSP compiles and executes.

### What this means in practice

* **New features are SameDiff-native.** The LLM generation pipeline (`GenerationPipeline`), PEFT adapters (`LoraConfig`, `QLoraConfig`), RL alignment trainers (`GRPOTrainer`, `DPOTrainer`), and all new ops (flash attention, RoPE, RMSNorm, Mamba) operate on SameDiff graphs. They have no `MultiLayerNetwork` equivalents.
* **DL4J models convert to SameDiff.** `MultiLayerNetworkSameDiffConverter` and `ComputationGraphSameDiffConverter` convert initialized DL4J models to SameDiff graphs. This is a one-way bridge — SameDiff models cannot convert back to DL4J layer configs. See [Migration Guide](#migration-guide) for supported layer types.
* **Graph tracing bridges eager and compiled.** `Nd4j.graphScope()` lets you write eager-style ND4J code (`Nd4j.matmul(a, b)`, `Nd4j.nn.relu(x)`) inside a tracing scope. The scope records the operations as a SameDiff graph, compiles it through DSP, and replays the optimized version. This is the recommended way to get DSP benefits without rewriting existing imperative code.
* **User-defined ops extend the graph.** `@UserDefinedOp` + `UserDefinedCustomOp` lets you register custom ops that participate in SameDiff graph construction, DSP compilation, and serialization. Annotated ops are discovered at startup via classpath scanning.

### Execution analysis and debugging

The new execution framework (ADR 0048) adds analysis tools that were impossible with the old interpreted model:

* **`VariableEvolutionAnalysis`**: Classifies variable behavior across loop iterations as CONVERGING, DIVERGING, OSCILLATING, STABLE, or CHAOTIC. Useful for diagnosing training instability.
* **`LoopTerminationAnalyzer`**: Analyzes whether loops will terminate and estimates remaining iterations. Diagnoses infinite loops with `diagnoseInfiniteLoop()`.
* **`ExecutionTrace`**: Records every step the executor takes, including control flow decisions (Enter/Switch/Merge). Exportable to JSON for visualization.

```java
sd.enableExecutionAnalysis(AnalysisLevel.FULL);
Map<String, INDArray> result = sd.output(inputs, outputs);
VariableEvolutionAnalysis analysis = sd.getAnalysis().getVariableEvolution("loss");
System.out.println(analysis.getPattern());  // CONVERGING, DIVERGING, etc.
```

***

## Multi-GPU Device Execution

**Docs:** [Hardware Backends](/en-1.0.0-rewrite/nd4j/overview-1/hardware-backends.md) | [CUDA Backend](/en-1.0.0-rewrite/nd4j/overview-1/cuda.md)

M2.1's multi-GPU support was ad-hoc: device selection based on `cudaMemGetInfo` free memory (wrong — CUDA pool reservations reduce reported free memory without blocking allocation), no failover chain, and device-switch logic duplicated across 15+ call sites with subtle inconsistencies.

The rewrite introduces a structured multi-GPU execution model:

### Device selection and memory management

* **Selection metric: total memory, not free memory.** CUDA pool reservations pollute free memory reports. The rewrite uses total device memory as the primary allocation target, with soft-limit checking against actual availability via `cudaMemGetInfo` as a guard.
* **Five-stage allocation failover:** proactive soft-limit check → trim pool + retry same device → try peer devices (NVLink/P2P, sorted by free memory) → try non-peer devices → pinned host memory → OOM error.
* **Non-P2P compute budget = 0% by default.** Multi-GPU systems without NVLink (e.g., RTX 3070 Ti + RTX 4090) caused OOM crashes in M2.1 because non-peer failover was absent. The rewrite adds non-P2P failover for memory spillover but does not route compute to non-peer GPUs by default — host-staged D2H+H2D round-trips cause 100x slowdowns that trigger emergency reclaim cycles.
* **`CudaMemoryPool`**: `cudaMallocAsync`-based pool with device-safe free (saves/restores current CUDA device before `cudaFreeAsync` to prevent cross-device double-frees).
* **`HybridDataBuffer` with coherence tracking**: MSI-style coherence protocol (INVALID/SHARED/EXCLUSIVE/MODIFIED) per device. `syncToHost()`/`syncToDevice()` are no-ops when state is already valid, eliminating redundant transfers.

### Multi-GPU in DSP

When device placement assigns different slots to different GPUs, DSP inserts segment boundaries automatically. Each segment's compiled kernel or CUDA graph runs on its assigned device. Cross-segment data transfers happen at boundaries — NVLink peer transfers when available, host-staged otherwise.

```java
// VLM example: vision encoder on GPU 0, decoder on GPU 1
MultiPartModelLoader loader = MultiPartModelLoader.builder()
    .modelDirectory("/path/to/vlm-parts")
    .deviceMapping(Map.of(
        "vision_encoder", 0,
        "decoder", 1
    ))
    .build();
```

### Tensor and pipeline parallelism

New parallel execution primitives for large model inference:

* **`ColumnParallelLinear` / `RowParallelLinear`**: Split weight matrices across GPUs with NCCL all-reduce for communication.
* **`ModelParallelConfig`**: 6 parallelism strategies — DATA, TENSOR, PIPELINE, HYBRID, EXPERT, SEQUENCE.
* **DSP thread isolation**: Each thread gets its own plan instance. Mutable slot state is never shared across threads.
* **CUDA graph concurrent capture**: Per-device atomics (`g_captureActive[16]`) serialize concurrent captures to prevent CUDA error 900.

***

## Hardware Backend Expansion

**Docs:** [Hardware Backends](/en-1.0.0-rewrite/nd4j/overview-1/hardware-backends.md)

M2.1 supported two backends: CPU (nd4j-native) and CUDA (nd4j-cuda). The rewrite introduces a backend abstraction layer with pluggable `GraphBackend` implementations, and adds support for additional hardware targets.

### Production-ready backends

**CPU (`nd4j-native-platform`)**

The CPU backend ships with multiple graph-level and op-level backend integrations that DSP selects automatically based on the host platform:

| Integration                | Platform                         | Scope                                                 | Details                                                                                                                                                                                 |
| -------------------------- | -------------------------------- | ----------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **oneDNN** (Intel MKL-DNN) | x86 (Intel, AMD)                 | \~80 graph-fused ops + 100+ op-level platform helpers | `OneDnnGraphBackend` uses `dnnl::graph` API for segment fusion. Op-level `PLATFORM_IMPL` covers conv, pooling, LSTM, matmul, attention, all activations, reductions, binary/unary math. |
| **OpenVINO**               | x86 (Intel, AMD), ARM via plugin | \~200 ops via opset13                                 | `OpenVinoGraphBackend` with latency-mode config, P-core pinning, disk cache. Higher priority than oneDNN due to broader op coverage.                                                    |
| **ARM Compute Library**    | ARM Cortex-A, Neoverse           | \~150 ops                                             | `AclGraphBackend` with built-in ACL fusion (Conv+BN+ReLU, MatMul+Bias+Activation). Zero-copy via `import_memory()`.                                                                     |
| **Apple Accelerate**       | macOS (Intel + Apple Silicon)    | BLAS, FFT, element-wise, neural net                   | vecLib (matmul, SVD, QR, solve), vDSP (element-wise, FFT, reductions), vForce (transcendentals), BNNS (conv2d, pooling, batchnorm, activations). Op-level only — no graph backend.      |
| **Apple MPS / MLX**        | macOS Apple Silicon (M1+)        | Graph-level matmul, attention, conv, normalization    | `MLX` graph backend routes to `MPSGraph` for Metal-accelerated compute. 21 Objective-C++ files, zero-copy `MTLBuffer`, SDPA via `MPSGraphScaledDotProductAttentionOp`.                  |
| **MLIR CPU JIT**           | x86 (AMX), ARM (NEON/SVE/SME)    | General-purpose graph fusion                          | `MlirCpuGraphBackend` generates MLIR IR from slot segments, lowers through LLVM JIT. Requires `-compile` classifier.                                                                    |
| **ARM Hybrid**             | Android aarch64, Linux ARM64     | CPU MLIR + optional Vulkan GPU offload                | `ArmHybridGraphBackend` with ARM-tuned tile sizes, NEON/SVE vectorization. GPU path emits SPIR-V for Mali/Adreno.                                                                       |
| **NNAPI**                  | Android (API 27+)                | Hardware-accelerated NN inference                     | `NnapiGraphBackend` routes to Hexagon DSP, Mali GPU, or NPU depending on the Android device.                                                                                            |

**CUDA (`nd4j-cuda-12.9-platform`)**

| Integration                   | Scope                                | Details                                                                                          |
| ----------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------ |
| **CUDA graph capture/replay** | All capturable segments              | Base classifier. Warmup → freeze → capture → replay. Zero per-step Java overhead.                |
| **Triton MLIR**               | Fused kernel compilation             | `-compile` classifier. Full Triton pipeline. Multi-target: NVIDIA PTX, AMD AMDGCN, Intel SPIR-V. |
| **NVRTC**                     | Runtime CUDA C compilation           | `-compile` classifier. Fallback for patterns Triton doesn't cover.                               |
| **PTX string templates**      | Simple element-wise patterns         | `-compile` classifier. Fastest compilation, least optimization.                                  |
| **cuDNN**                     | Conv, LSTM, attention, normalization | Via `deeplearning4j-cuda-12.9` helper module. Op-level platform helpers.                         |
| **CUTLASS**                   | GEMM                                 | `CutlassGemmHelper` for optimized matrix multiply.                                               |

### WIP / experimental backends

**TPU (PJRT)** — `nd4j-tpu`

The TPU backend targets Google Cloud TPU v4 and v5 hardware via Google's PJRT C API (not C++ API — chosen for ABI stability). The approach compiles SameDiff ops to XLA HLO IR via `HloIRBuilder`, caches compiled executables in `TpuReplayHandle`, and manages device lifecycle through `PjrtClientManager`. BF16 is the default dtype. Shape inference runs on host CPU to avoid device round-trips.

The Java backend (`JTpuBackend`) is discovered via SPI with priority 50 (higher than CPU, lower than CUDA). `GraphExecutionMode.TPU` maps to native engine code 13.

**What works:** C++ infrastructure (`HloIRBuilder`, `PjrtClientManager`, `TpuReplayHandle`, `TpuGraphBackend`), Java SPI discovery, `GraphExecutionMode.TPU` routing.

**What's incomplete:** JavaCPP PJRT native binding generation, end-to-end integration tests on actual TPU hardware, multi-chip data/model parallelism, performance benchmarking. Cloud-only — no on-premise TPU path. HLO compilation overhead is 100ms–10s per graph.

**ZLUDA (AMD ROCm)** — reuses `nd4j-cuda-12.9` + ZLUDA runtime

ZLUDA takes a pragmatic approach: instead of maintaining a separate HIP/ROCm codebase, it translates CUDA API calls to HIP at runtime. This means the existing CUDA/JCublas backend runs unchanged on AMD GPUs. `ENGINE_ZLUDA_AMD = 3` in `Engine.h`. MIOpen replaces cuDNN for DNN layer acceleration.

`JZludaBackend` detects AMD hardware via `rocminfo`, priority = GPU - 10 (lower than native CUDA, so NVIDIA GPUs are preferred when both are present). ZLUDA is auto-downloaded at build time if not found.

**What works:** Full CUDA codebase reuse, MIOpen DNN layer replacement, SPI discovery, `GraphExecutionMode.HIP_GRAPHS` routing.

**What's incomplete:** Limited testing coverage. Expected 80-95% of native HIP performance on AMD. CUDA dynamic parallelism may not be supported. Unified memory behavior may differ. No independent HIP code path — fully dependent on ZLUDA translation layer being installed.

**ZLUDA (Intel Level Zero)** — reuses `nd4j-cuda-12.9` + ZLUDA runtime

Same approach as AMD ZLUDA but targeting Intel GPUs via Level Zero API. `ENGINE_ZLUDA_INTEL = 4`. oneDNN used for DNN layers (instead of MIOpen). Detected via `sycl-ls`. Even less testing coverage than AMD; expected 70-90% of native performance. `GraphExecutionMode.LEVEL_ZERO` routing.

**Hexagon DSP (QNN)** — `nd4j-hexagon`

Qualcomm Hexagon NPU targeting via SNPE/QNN SDK. INT8/INT16 mobile inference focus. `NnapiGraphBackend` on Android routes to Hexagon when available. `HexagonGraphBackend` in libnd4j uses hexagon-mlir for op compilation. Early scaffold with minimal functionality.

**Important:** TPU and ZLUDA should not be treated as production-ready for 1.0.0-rewrite. They participate in `AUTO` backend selection when hardware is detected and the module is on the classpath, but they have not been validated end-to-end. Pin your `GraphExecutionMode` explicitly if you need deterministic backend selection:

```java
// Explicit CUDA — won't accidentally route to ZLUDA or TPU
sd.setGraphExecutionMode(GraphExecutionMode.CUDA_GRAPHS);

// Explicit OpenVINO CPU
sd.setGraphExecutionMode(GraphExecutionMode.OPENVINO);
```

### The GraphBackend extensibility model

The hardware backend system operates at two levels:

**Graph-level (`GraphBackend` interface).** Each backend implements `canFuseSegment()`, `compileSegment()`, `executeSegment()`. DSP walks the priority chain until a backend claims each segment. Adding a new backend (custom ASIC, future hardware) requires implementing these three methods without touching SameDiff or DSP core.

**Op-level (`PlatformHelper` + `PLATFORM_IMPL` macro).** Individual ops register platform-specific implementations tagged with an `Engine` (e.g., `ENGINE_ONEDNN`, `ENGINE_ARM`, `ENGINE_ACCELERATE`). The `MultiPlatformDispatcher` holds multiple helpers per op and selects the best via `DispatchMode` — `AUTO` (default), `FIXED`, `ROUND_ROBIN`, or `BENCHMARK` (runtime auto-tuning with persistent `KernelPerformanceRegistry`).

Backend discovery uses Java SPI (`Nd4jBackend`) with priority-based selection. The C++ `Engine` enum maps all backends:

```
ENGINE_CPU = 0, ENGINE_CUDA = 1, ENGINE_TPU = 2, ENGINE_ZLUDA_AMD = 3,
ENGINE_ZLUDA_INTEL = 4, ENGINE_METAL = 5, ENGINE_VULKAN = 6, ENGINE_OPENCL = 7,
ENGINE_LLAMACPP = 8, ENGINE_VLM = 9, ENGINE_MPS = 10, ENGINE_ACCELERATE = 11,
ENGINE_ONEDNN = 12, ENGINE_ARM = 13, ENGINE_ANY = 99
```

***

## Model Import Coverage

**Docs:** [Model Import Overview](/en-1.0.0-rewrite/model-import/overview.md)

M2.1 supported Keras import (via `deeplearning4j-modelimport`), basic ONNX import, and TensorFlow frozen graph import. The rewrite dramatically expands format and architecture coverage.

### New import formats

| Format              | Module                          | Architectures                                                                                                                         | Notes                                                                                                                                                        |
| ------------------- | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **GGUF/GGML**       | `nd4j-ggml`                     | LLaMA 1/2/3/4, Gemma 2/3, Mistral, Phi-3/3.5, ChatGLM, Granite, LFM2, Nemotron, OLMo, OpenELM, SmolVLM2, Qwen3-VL, MiniCPM-V, Whisper | Full GGUF v1/v2/v3 parser. All quantization formats (Q2\_K through Q8\_K, all IQ variants, TQ ternary). Round-trip export. Memory-mapped I/O for 7B+ models. |
| **SafeTensors**     | `samediff-pipeline-safetensors` | SmolVLM2, Qwen3-VL                                                                                                                    | HuggingFace-format weight loading                                                                                                                            |
| **TorchScript**     | `nd4j-torchscript`              | ResNet, EfficientNet, VGG                                                                                                             | ZIP archive reader, pure-Java pickle parser                                                                                                                  |
| **ONNX** (expanded) | `nd4j-onnx-import`              | 120+ new ops, full ML domain                                                                                                          | Microsoft contrib LLM ops (GQA, RoPE, MoE). Bidirectional export: SameDiff → ONNX ModelProto.                                                                |

### AutoModel — unified entry point

`AutoModel.fromPretrained()` detects format from file extension or directory contents and routes to the correct importer:

```java
// Auto-detects GGUF, SafeTensors, ONNX, SDZ, TorchScript
SameDiff model = AutoModel.fromPretrained("/path/to/model");
```

All import paths produce a `SameDiff` graph as output, which feeds directly into DSP compilation.

### Pipeline SPI

Model loading is pluggable via `samediff-pipeline-core` SPI. Four pipeline modules ship: `samediff-pipeline-ggml`, `samediff-pipeline-safetensors`, `samediff-pipeline-onnx`, and `samediff-pipeline-core`. Custom formats can be added by implementing the pipeline interface and placing the JAR on the classpath.

***

## Refactoring and Removed Features

### Arbiter (removed)

Arbiter (`arbiter-core`, `arbiter-deeplearning4j`) has been removed from the project. It was a hyperparameter optimization framework built on top of DL4J's `MultiLayerNetwork` and `ComputationGraph`. With the shift to SameDiff as the core runtime, Arbiter's tight coupling to the old DL4J layer API made it unmaintainable. Alternatives: [Optuna](https://optuna.org/) via Python4J, [Ray Tune](https://docs.ray.io/en/latest/tune/index.html), or manual search over SameDiff training configs.

### Execution infrastructure (replaced)

The old interpreted execution stack was deleted, not deprecated:

* `GraphExecutioner`, `NativeGraphExecutioner` → replaced by DSP `DynamicShapePlanExecutor`
* `initSubgraph` → replaced by `ForwardExecutionDAGBuilder` + `DAGCache`
* 10 C++ `LogicXxx` control-flow handlers → control flow resolved at Java level before plan serialization
* `ExecutionPhase` enum → collapsed into `SegmentLifecycleState` (the old design had two parallel state machines that could diverge silently)
* `GraphProfilingHelper` → replaced by `OpTimingTracker` with Chrome trace export
* Old `gemm.h`/`gemm.cpp` → replaced by `MmulHelper` + `CutlassGemmHelper`
* Monolithic `NativeOps.cpp`/`NativeOps.cu` → split into \~50 focused translation units by op category
* `OnnxIRGraphRunner.kt`, `TensorflowIRGraphRunner.kt` → dead code, deleted

### DL4J layer API (retained but not the focus)

`MultiLayerNetwork` and `ComputationGraph` are **not removed** — they still work, and new layer types were added (`Deconvolution1D`, `SeparableConvolution1D`, `EinsumDense`, `LayerNormalization`, `GroupNormalization`, `UnitNormalization`). But no new infrastructure features target them. The recommended path for new projects is SameDiff directly.

### Old model zoo (superseded)

`deeplearning4j-zoo` is superseded by OmniHub (`org.eclipse.deeplearning4j.omnihub`), which provides `AutoModel.fromPretrained()` with format auto-detection across GGUF, SafeTensors, ONNX, and SDZ.

### CUDA 11.x backends (dropped)

`nd4j-cuda-11.4` and `nd4j-cuda-11.6` are no longer published as pre-built artifacts. The default CUDA version is 12.9 (minimum driver 525.60). You can build from source against other CUDA 12.x versions by setting `-Dcuda.version=12.x`, but pre-built Maven Central JARs target 12.9.

***

## New Native Operations (\~130 ops)

**Docs:** [New Operations Reference](/en-1.0.0-rewrite/nd4j/new-operations.md)

62 new LLM/VLM ops at the C++ level with matching Java SameDiff bindings:

**Attention:** FlashAttention-2, Grouped Query Attention (GQA), Multi-Latent Attention (MLA, DeepSeek-V3 style), Paged Attention, Cascade Attention, Lightning Attention, Sliding Window Attention, ONNX-compatible Multi-Head Attention.

**SSM / Recurrent:** Mamba selective scan, Mamba-2 SSD, Gated Delta Rule Networks (ICLR 2025).

**PEFT linear layers:** LoRA, DoRA, LoHa (Hadamard product), LoKr (Kronecker product) — all with forward and backward ops for training.

**Quantization:** FP8 E4M3/E5M2 matmul, AWQ dequantize, GGML block dequantize (Q4/Q8), INT4/INT8 matmul, SmoothQuant.

**Normalization:** RMSNorm, fused RMSNorm+SwiGLU, fused LayerNorm.

**Positional encoding:** RoPE, fused RoPE, multi-modal RoPE, ALiBi.

**Generation:** `AutoregressiveDecode` — full native decode loop as a single JNI call, eliminating per-step Java overhead.

**Audio/Signal:** 14 audio ops (mel spectrogram, MFCC, Griffin-Lim, pitch detection, spectral features), 5 signal ops (DFT, STFT, windowing).

**MoE:** Sparse routing + gating for Mixture of Experts models.

***

## LLM & VLM Stack

**Docs:** [LLM & VLM Overview](/en-1.0.0-rewrite/deeplearning4j/overview-4.md)

New Maven modules for running large language models and vision-language models:

* **`samediff-llm`**: `GenerationPipeline` with paged/quantized/MLA/prefix-tree KV cache strategies, speculative decoding (2-5x throughput), continuous batching, and streaming token output.
* **`samediff-vlm`**: Vision-language model support with `VLMPipelineExecutor`, `MultiPartModelLoader` (vision encoder + decoder on separate GPUs), and `ImageTiler` for multi-page document processing.
* **`samediff-audio`**: Whisper ASR pipeline with GGUF model loading, mel filterbank, and beam search.
* **`nd4j-tokenizers`**: Rust-backed HuggingFace/SentencePiece/CLIP tokenizers.
* **LLM eval framework**: `EvalRunner` with MMLU, ARC, GSM8K, HellaSwag, TruthfulQA, Winogrande benchmarks.

***

## Training: PEFT & RL Alignment

**Docs:** [PEFT & RL Alignment](/en-1.0.0-rewrite/deeplearning4j/peft-and-rl.md)

12 parameter-efficient fine-tuning methods: LoRA, QLoRA, DoRA, AdaLoRA, DyLoRA, LoHA, LoKr, IA3, VeRA, Prefix Tuning, Prompt Tuning, LoftQ.

9 RL alignment trainers: GRPO, DPO, DAPO, Dr.GRPO, PPO, KTO, ORPO, SimPO, GSPO. Plus VLM GRPO for vision-language RL.

Mixed precision: FP8 E4M3/E5M2 with dynamic loss scaling. 8-bit Adam optimizer (4x state memory reduction). Knowledge distillation trainer (logit/feature/attention KD). Dataset curation toolkit (dedup, decontamination, quality filtering, curriculum learning, sequence packing).

***

## Build System Changes

* **CUDA 12.9 default** (dropped pre-built 11.4, 11.6 artifacts). Compilable for other 12.x versions via `-Dcuda.version`. Minimum driver 525.60.
* **CMake overhaul**: 21 new modules for Triton, MLIR, ZLUDA, Hexagon, TPU, SDX.
* **Backend namespace isolation** (`SD_BACKEND_NAMESPACE`): Enables loading `nd4jcpu.so` and `nd4jcuda.so` in the same JVM process without symbol conflicts.
* **FlatBuffers 25.2.10** vendored (previously downloaded). New `BufferChunk` table for >2 GB arrays.
* **New dtypes**: `BFLOAT16 = 17`, `UTF8 = 50`, `UTF16 = 51`, `UTF32 = 52`.
* **18 named test suites** via `run-tests.yml`: `quick`, `sanity`, `nd4j`, `samediff`, `dl4jcore`, `keras`, `datavec`, `onnx`, `tensorflow`, `integration`, `libnd4j`, `llm`, `vlm`, `ggml`, `zoo`, `longrunning`, `all`.

***

## Namespace Consolidation

### Current State (1.0.0-rewrite)

This release uses **three package roots** in parallel. All three work and are fully supported:

| Root                           | Where Used                                                                               | Example                                              |
| ------------------------------ | ---------------------------------------------------------------------------------------- | ---------------------------------------------------- |
| `org.nd4j.*`                   | Core ND4J, SameDiff, PEFT, RL trainers, GGML, dataset curation, execution infrastructure | `org.nd4j.autodiff.samediff.config.LoraConfig`       |
| `org.eclipse.deeplearning4j.*` | New high-level application modules (LLM, VLM, audio, OmniHub, pipelines, SafeTensors)    | `org.eclipse.deeplearning4j.llm.GenerationPipeline`  |
| `org.deeplearning4j.*`         | Legacy DL4J neural network APIs (MultiLayerNetwork, ComputationGraph, Keras import)      | `org.deeplearning4j.nn.multilayer.MultiLayerNetwork` |

This split is intentional for this transitional release — existing code continues to work unchanged.

### Next Release: Import Cleanup

The **next release** will consolidate these into a unified namespace. What to expect:

* **`org.eclipse.deeplearning4j.*`** becomes the canonical root for all application-level modules
* **`org.nd4j.*`** remains for core array/tensor/SameDiff APIs (these are stable)
* **`org.deeplearning4j.*`** legacy imports will be deprecated with re-export shims — existing code compiles with deprecation warnings, not errors
* The Phase 2 release will ship an OpenRewrite recipe that automates the migration

**Action for this release**: Use whichever imports work. Do not refactor your imports to anticipate the cleanup.

***

## Examples

The [`deeplearning4j-examples`](https://github.com/eclipse/deeplearning4j-examples) repository contains **34 example files** demonstrating the new features.

### DSP Execution Engine (5 examples)

| Example                                     | What It Shows                                                                                    |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------ |
| `DSPExecutionExample.java`                  | DSP introduction — dynamic-shape inference, `GraphExecutionMode`, shape-keyed plan caching       |
| `DSPAdvancedExample.java`                   | Full DSP API — `DspHandle`, slot introspection, Chrome trace export, tensor/pipeline parallelism |
| `DSPBackendsAndKernelSelectionExample.java` | All 19 `GraphExecutionMode` backends, `KernelSelectionConfig`, 24 optimization passes            |
| `DSPDiskCacheAndTritonExample.java`         | `DspPlanDiskCache`, Triton kernel cache, `TritonCacheTool` CLI                                   |
| `DSPDiagnosticsAndDebuggingExample.java`    | 20 diagnostic categories, `DspDebugger`, `DspHandle` live introspection                          |

### LLM / VLM / Audio (6 examples)

| Example                             | What It Shows                                                                           |
| ----------------------------------- | --------------------------------------------------------------------------------------- |
| `LLMGenerationPipelineExample.java` | `GenerationPipeline`, KV cache strategies, speculative decoding, tensor parallelism     |
| `QwenTextGenerationExample.java`    | End-to-end: GGUF download → import → tokenize → generate with `ChatTemplate`            |
| `GraphOptimizerExample.java`        | `GraphOptimizer` — algebraic simplification, CSE, attention fusion                      |
| `SmolDoclingVLMExample.java`        | Vision-language model — ONNX components, image tiling, vision encoding, text generation |
| `VideoVLMExample.java`              | Video VLM — `VideoFrameSampler` strategies, SmolVLM2/Qwen3-VL                           |
| `WhisperSpeechToTextExample.java`   | Whisper ASR — mel spectrogram, transcription, timestamps                                |

### GGML/GGUF Import (4 examples)

| Example                         | What It Shows                                                                           |
| ------------------------------- | --------------------------------------------------------------------------------------- |
| `GGMLImportExportExample.java`  | Full GGML API — `ConversionOptions`, `GGUFReader`/`GGUFWriter`, round-trip quantization |
| `GGMLModelImportExample.java`   | Low-level import — architecture detection, quantization types                           |
| `HuggingFaceGGUFImport.java`    | `HuggingFaceHubDownloader` → `AutoModel` → format auto-detection                        |
| `SafeTensorsImportExample.java` | 3-level SafeTensors API — `AutoModel`, pipeline loader, raw reader                      |

### PEFT & RL (4 examples)

| Example                             | What It Shows                                                             |
| ----------------------------------- | ------------------------------------------------------------------------- |
| `SFTLoRATrainingConfigExample.java` | `LoraConfig`, `QLoraConfig`, `SFTConfig`, `GRPOConfig`, `DPOConfig`, BF16 |
| `AdvancedPEFTConfigExample.java`    | AdaLoRA, DoRA, IA3, Prefix/Prompt Tuning, PPO, KTO, ORPO, SimPO           |
| `SpecializedPEFTConfigExample.java` | LoftQ, LoHa, LoKr, VeRA, DyLoRA, multi-adapter serving                    |
| `RLAlignmentConfigExample.java`     | All 10 RL methods, reward model config, pipeline config                   |

### Training Infrastructure (5 examples)

| Example                                   | What It Shows                                                          |
| ----------------------------------------- | ---------------------------------------------------------------------- |
| `MixedPrecisionTrainingExample.java`      | FP16/BF16/FP8, `LossScaleConfig`, `GradientAccumulator`                |
| `DataCurationPipelineExample.java`        | Dedup, quality filtering, instruction formatting, stratified splitting |
| `KnowledgeDistillationExample.java`       | Logit/feature/attention KD, self-distillation                          |
| `TransferLearningAndFreezingExample.java` | Variable freezing, gradient checkpointing, continued pretraining       |
| `NewOptimizersExample.java`               | `Adam8bit` (4x memory reduction), `AdaBelief`                          |

### New Operations (3 examples)

| Example                      | What It Shows                                                                               |
| ---------------------------- | ------------------------------------------------------------------------------------------- |
| `TransformerOpsExample.java` | FlashAttention, SlidingWindow, FusedRoPE, RmsNorm, Mamba, FP8Matmul, tensor parallel linear |
| `SameDiffOpsExample.java`    | All op namespaces including `sd.audio()`, `sd.signal()`, mixed precision                    |
| `AudioOpsExample.java`       | 13 audio DSP ops — mel spectrogram, MFCC, Griffin-Lim, pitch detection                      |

***

## New Maven Modules

```xml
<!-- LLM generation pipeline -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-llm</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Vision-language model support -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-vlm</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Whisper ASR / audio -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-audio</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- GGUF model import -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-ggml</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- HuggingFace-compatible tokenizers -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-tokenizers</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- TorchScript/PyTorch model import -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-torchscript</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Pipeline SPI modules -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-pipeline-core</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- DSP Runtime SDK (Java) -->
<dependency>
    <groupId>org.eclipse.deeplearning4j</groupId>
    <artifactId>nd4j-dsp-runtime-java</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Hardware backends (add if hardware is present) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-tpu</artifactId>          <!-- Google Cloud TPU (WIP) -->
    <version>1.0.0-rewrite</version>
</dependency>
```

***

## Bug Fixes & Improvements

Merged since M2.1:

### Memory Leaks

* Fix CUDA lstmLayer permute/transpose memory leak ([#10404](https://github.com/deeplearning4j/deeplearning4j/pull/10404))
* Fix lstmLayer.cu weight transformation memory leak ([#10403](https://github.com/deeplearning4j/deeplearning4j/pull/10403))
* Fix MmulHelper::mmulNxN memory leak ([#10394](https://github.com/deeplearning4j/deeplearning4j/pull/10394))
* Fix BaseNDArray.toFlatArray() memory leak for view arrays ([#10410](https://github.com/deeplearning4j/deeplearning4j/pull/10410))

### Correctness

* Fix COORDS2INDEX macro to use strides instead of shapes ([#10393](https://github.com/deeplearning4j/deeplearning4j/pull/10393))
* Fix DataType inconsistency in float\[] constant buffer handling ([#10411](https://github.com/deeplearning4j/deeplearning4j/pull/10411))
* Fix inverted boolean logic in DeallocatorService listener delegation ([#10412](https://github.com/deeplearning4j/deeplearning4j/pull/10412))
* Fix byte order handling in DataTypeConversions ([#10401](https://github.com/deeplearning4j/deeplearning4j/pull/10401))

### Security

* Fix command injection vulnerabilities in Windows bat scripts ([#10409](https://github.com/deeplearning4j/deeplearning4j/pull/10409), [#10407](https://github.com/deeplearning4j/deeplearning4j/pull/10407))

### API & Infrastructure

* Simplify batched GEMM API ([#10361](https://github.com/deeplearning4j/deeplearning4j/pull/10361))
* JavaCPP resource configuration for GraalVM native image support ([#10287](https://github.com/deeplearning4j/deeplearning4j/pull/10287))
* Autodiff core improvements ([#10280](https://github.com/deeplearning4j/deeplearning4j/pull/10280))
* CMake modernization ([#10245](https://github.com/deeplearning4j/deeplearning4j/pull/10245))
* Maven version updates for Java 25 support ([#10243](https://github.com/deeplearning4j/deeplearning4j/pull/10243))
* SameDiff file format scaling improvements ([#10209](https://github.com/deeplearning4j/deeplearning4j/pull/10209))

***

## Migration Guide

### DL4J → SameDiff Model Conversion

If you have existing `MultiLayerNetwork` or `ComputationGraph` models and want to run them through DSP (CUDA graph replay, Triton JIT, TPU/Hexagon backends), two converter utilities bridge the gap:

```java
// MultiLayerNetwork
MultiLayerNetwork net = MultiLayerNetwork.load(new File("mymodel.zip"), true);
net.init();
SameDiff sd = MultiLayerNetworkSameDiffConverter.toSameDiff(net);
sd.setGraphExecutionMode(GraphExecutionMode.TRITON);
Map<String, INDArray> result = sd.output(
    Collections.singletonMap("input", inputArray), "output");

// ComputationGraph
ComputationGraph cg = ComputationGraph.load(new File("mygraph.zip"), true);
cg.init();
SameDiff sd = ComputationGraphSameDiffConverter.toSameDiff(cg);
sd.save(new File("converted.sdz"), true);  // Save as SDZ for fast future loads
```

**Supported layer types:** DenseLayer, OutputLayer, ConvolutionLayer, Convolution1DLayer, Deconvolution2D, SeparableConvolution2D, DepthwiseConvolution2D, SubsamplingLayer, Subsampling1DLayer, BatchNormalization, ActivationLayer, DropoutLayer (pass-through), GlobalPoolingLayer, EmbeddingLayer, EmbeddingSequenceLayer, LSTM, SimpleRnn, LocalResponseNormalization, ZeroPaddingLayer, ZeroPadding1DLayer, Upsampling2D, RepeatVector, MergeVertex (concat), ElementWiseVertex (add/sub/product/avg/max).

**Limitations:**

* One-way only — SameDiff cannot convert back to DL4J layer configs.
* Unsupported layer types throw `UnsupportedOperationException`.
* Dropout is pass-through in the converted graph.

### Opting In to DSP Optimization

```java
// System property
-Dnd4j.execution.mode=TRITON

// Programmatic
sd.setGraphExecutionMode(GraphExecutionMode.TRITON);
```

To debug optimizer behavior:

```
-Dnd4j.optimizer.skip=AttentionFusion
-Dnd4j.optimizer.logApplied=true
```

### New Backend Discovery

Hardware backends (TPU, Hexagon, ZLUDA) are discovered automatically if the corresponding Maven module is on the classpath and the hardware is present. No code changes required. Pin `GraphExecutionMode` explicitly for deterministic backend selection.

### Serialization Compatibility

* Existing FlatBuffers-based `SameDiff.load()`/`SameDiff.save()` continue to work.
* New SDNB (section-based binary) and SDZ (ZIP-wrapped SDNB) formats support models >2 GB via sharding.
* GGUF models imported via `GGMLModelImport` can be converted to SDZ via `convertToSDZ()` for faster subsequent loads.
* DSP compiled plans cache to `~/.kompile/cache/dsp/` and invalidate automatically on model changes.