> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/release-notes/1.0.0-rewrite.md).

# 1.0.0-rewrite

> **This is a transitional release.** The 1.0.0-rewrite introduces the full new feature set (DSP, LLM/VLM, PEFT, hardware backends) while maintaining backward compatibility with existing `org.deeplearning4j` and `org.nd4j` imports. A namespace consolidation is underway — the **next release** will complete the import cleanup, unifying packages under a consistent structure. See [Namespace Consolidation](#namespace-consolidation) below for what to expect.

## Highlights

The 22 PRs ([#10435](https://github.com/deeplearning4j/deeplearning4j/pull/10435)–[#10456](https://github.com/deeplearning4j/deeplearning4j/pull/10456)) introduce capabilities that reposition DL4J as a first-class runtime for modern LLM and VLM workloads. Each feature area has its own full documentation page — the links below take you to comprehensive guides with API references, code examples, and configuration details.

| Feature                                                           | Docs                                                                                                                                       | Scale                                                   |
| ----------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------- |
| [DSP Execution Engine](/nd4j/overview-2/dsp.md)                   | Compiled graph runtime with CUDA graph capture/replay, Triton/NVRTC/PTX JIT, 26-pass graph optimizer, and multi-backend dispatch           | \~240 new files                                         |
| [LLM & VLM Stack](/deeplearning4j/overview-4.md)                  | Generation pipelines, paged KV cache, speculative decoding, continuous batching, tokenizers, eval framework, model editing                 | 456 new files                                           |
| [GGML/GGUF Import](/model-import/overview-3.md)                   | Import quantized LLMs from `.gguf` files with 20+ architecture handlers, complete quantization codecs, round-trip export                   | 121 new files                                           |
| [PEFT & RL Alignment](/deeplearning4j/peft-and-rl.md)             | 12 PEFT methods, 9 RL alignment trainers, FP8 mixed precision, 8-bit Adam, knowledge distillation, dataset curation                        | \~80 new files                                          |
| [ONNX Import & Export](/model-import/overview-2/onnx-expanded.md) | \~120 new ONNX ops including Microsoft LLM contrib, bidirectional SameDiff-to-ONNX export                                                  | \~130 files                                             |
| [New Operations (\~130)](/nd4j/new-operations.md)                 | Fused attention, KV cache, PEFT linear layers, normalization, positional encoding, quantization, SSM/recurrent, MoE, audio/signal          | \~80 new files                                          |
| [Hardware Backends](/nd4j/overview-1/hardware-backends.md)        | TPU (PJRT), Hexagon DSP (QNN), ZLUDA (AMD/Intel), SDX, ARM ACL (124 ops), Apple Accelerate (28 ops), cuDNN expansion, llama.cpp, MPS, MLIR | 382+ files                                              |
| DSP Runtime SDK                                                   | Stable C ABI with Java, Kotlin, Python, Rust, Swift, C# bindings                                                                           | See [DSP docs](/nd4j/overview-2/dsp.md#dsp-runtime-sdk) |
| Build System & CI                                                 | CUDA 12.9, 21 CMake modules, backend namespace isolation, 18 test suites                                                                   | \~60 files                                              |

***

## Namespace Consolidation

### Current State (1.0.0-rewrite)

This release uses **three package roots** in parallel. All three work and are fully supported:

| Root                           | Where Used                                                                               | Example                                              |
| ------------------------------ | ---------------------------------------------------------------------------------------- | ---------------------------------------------------- |
| `org.nd4j.*`                   | Core ND4J, SameDiff, PEFT, RL trainers, GGML, dataset curation, execution infrastructure | `org.nd4j.autodiff.samediff.config.LoraConfig`       |
| `org.eclipse.deeplearning4j.*` | New high-level application modules (LLM, VLM, audio, OmniHub, pipelines, SafeTensors)    | `org.eclipse.deeplearning4j.llm.GenerationPipeline`  |
| `org.deeplearning4j.*`         | Legacy DL4J neural network APIs (MultiLayerNetwork, ComputationGraph, Keras import)      | `org.deeplearning4j.nn.multilayer.MultiLayerNetwork` |

This split is intentional for this transitional release — existing code continues to work unchanged.

### Next Release: Import Cleanup

The **next release** will consolidate these into a unified namespace structure. What to expect:

* **`org.eclipse.deeplearning4j.*`** becomes the canonical root for all application-level modules
* **`org.nd4j.*`** remains for core array/tensor/SameDiff APIs (these are stable)
* **`org.deeplearning4j.*`** legacy imports will be deprecated with re-export shims — existing code compiles with deprecation warnings, not errors
* PEFT, RL trainers, and training configs currently under `org.nd4j.autodiff.samediff.config.*` will move to a more discoverable location
* The `org.nd4j.ggml.*` package may be consolidated under the pipeline framework

**Action for this release**: Use whichever imports work. Do not refactor your imports to anticipate the cleanup — the next release will provide clear migration tooling.

***

## Examples

The [`deeplearning4j-examples`](https://github.com/eclipse/deeplearning4j-examples) repository contains **34 example files** demonstrating the new features. All examples are in the `samediff-examples`, `onnx-import-examples`, and `dl4j-examples` modules.

### DSP Execution Engine (5 examples)

| Example                                     | What It Shows                                                                                                            |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `DSPExecutionExample.java`                  | DSP introduction — dynamic-shape inference, `GraphExecutionMode`, shape-keyed plan caching                               |
| `DSPAdvancedExample.java`                   | Full DSP API — `DspHandle`, slot introspection, Chrome trace export, tensor/pipeline parallelism                         |
| `DSPBackendsAndKernelSelectionExample.java` | All 19 `GraphExecutionMode` backends, `KernelSelectionConfig`, 24 optimization passes, TPU/Hexagon/ZLUDA/Metal/MLX/NNAPI |
| `DSPDiskCacheAndTritonExample.java`         | `DspPlanDiskCache`, Triton kernel cache, `TritonCacheTool` CLI, plan binary format                                       |
| `DSPDiagnosticsAndDebuggingExample.java`    | 20 diagnostic categories, `DspDebugger`, `DspHandle` live introspection, `DspPlanAssertions`                             |

Path: `samediff-examples/src/main/java/org/nd4j/examples/samediff/advanced/execution/`

### LLM Generation (3 examples)

| Example                             | What It Shows                                                                                                            |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `LLMGenerationPipelineExample.java` | `GenerationPipeline` API, KV cache strategies, speculative decoding, tensor parallelism, VLM integration                 |
| `QwenTextGenerationExample.java`    | End-to-end: GGUF download → `GGMLModelImport` → `HuggingFaceTokenizer` → `GenerationPipeline` → chat with `ChatTemplate` |
| `GraphOptimizerExample.java`        | `GraphOptimizer` — algebraic simplification, peephole, strength reduction, CSE, attention fusion                         |

Path: `samediff-examples/src/main/java/org/nd4j/examples/samediff/quickstart/modeling/`

### VLM and Audio (3 examples)

| Example                           | What It Shows                                                                                                    |
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| `SmolDoclingVLMExample.java`      | Vision-language model — ONNX components, image tiling, vision encoding, embedding merging, text generation       |
| `VideoVLMExample.java`            | Video VLM — `VideoFrameSampler` strategies (UNIFORM/FIXED\_FPS/KEYFRAME), `VideoPreprocessor`, SmolVLM2/Qwen3-VL |
| `WhisperSpeechToTextExample.java` | Whisper ASR — model download, mel spectrogram, transcription, timestamps                                         |

### GGML/GGUF Import (4 examples)

| Example                         | What It Shows                                                                                            |
| ------------------------------- | -------------------------------------------------------------------------------------------------------- |
| `GGMLImportExportExample.java`  | Full GGML API — `ConversionOptions`, `ExportOptions`, `GGUFReader`/`GGUFWriter`, round-trip quantization |
| `GGMLModelImportExample.java`   | Low-level import — architecture detection, quantization types Q2\_K through IQ2\_XS                      |
| `HuggingFaceGGUFImport.java`    | `HuggingFaceHubDownloader` → `AutoModel` → format auto-detection (GGUF/SafeTensors/ONNX/PyTorch)         |
| `SafeTensorsImportExample.java` | 3-level SafeTensors API — `AutoModel`, `SafeTensorsPipelineLoader`, `SafeTensorsReader`                  |

Path: `onnx-import-examples/src/main/java/org/deeplearning4j/modelimportexamples/omnihub/`

### PEFT (3 examples)

| Example                             | What It Shows                                                                                                                                 |
| ----------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| `SFTLoRATrainingConfigExample.java` | `LoraConfig`, `QLoraConfig`, `SFTConfig`, `GRPOConfig`, `DPOConfig`, BF16 mixed precision                                                     |
| `AdvancedPEFTConfigExample.java`    | `AdaLoraConfig`, `DoraConfig`, `IA3Config`, `PrefixTuningConfig`, `PromptTuningConfig`, `PPOConfig`, `KTOConfig`, `ORPOConfig`, `SimPOConfig` |
| `SpecializedPEFTConfigExample.java` | `LoftQConfig`, `LohaConfig`, `LokrConfig`, `VeraConfig`, `DyLoraConfig`, `AdapterConfig`, `LoraAdapterCache` (multi-adapter serving)          |

Path: `samediff-examples/src/main/java/org/nd4j/examples/samediff/quickstart/training/`

### RL Alignment (1 example)

| Example                         | What It Shows                                                                                                                                               |
| ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `RLAlignmentConfigExample.java` | All 10 RL methods: DPO (standard/IPO/RDPO), GRPO, PPO, KTO, ORPO, SimPO, DAPO (asymmetric clipping), GSPO, Dr.GRPO, `RewardModelConfig`, `RLPipelineConfig` |

### Training Infrastructure (5 examples)

| Example                                   | What It Shows                                                                                                                             |
| ----------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `MixedPrecisionTrainingExample.java`      | FP16/BF16/FP8 mixed precision, `LossScaleConfig`, `GradientAccumulator`, `FP8TrainingConfig`                                              |
| `DataCurationPipelineExample.java`        | `TextDeduplicator`, `TextQualityFilter`, `InstructionDataFormatter`, `StratifiedSplitter`, `LengthBucketingIterator`, `WeightedDataMixer` |
| `KnowledgeDistillationExample.java`       | `DistillationTrainer` — logit KD, feature KD, attention KD, combined                                                                      |
| `TransferLearningAndFreezingExample.java` | `PeftModel`, variable freezing, `GradientCheckpointConfig`, `ContinuedPretrainingConfig`                                                  |
| `NewOptimizersExample.java`               | `Adam8bit` (4x memory reduction), `AdaBelief`                                                                                             |

### New Operations (3 examples)

| Example                      | What It Shows                                                                                                                                                                            |
| ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `TransformerOpsExample.java` | `FlashAttention`, `SlidingWindowAttention`, `FusedRoPE`/`FusedMRoPE`, `RmsNorm`, `SiLU`, `SelectiveScan` (Mamba), `TokenSample`, `FP8Matmul`, `ColumnParallelLinear`/`RowParallelLinear` |
| `SameDiffOpsExample.java`    | All op namespaces including new `sd.audio()`, `sd.signal()`, mixed precision types                                                                                                       |
| `AudioOpsExample.java`       | 13 audio DSP ops — mel spectrogram, MFCC, Griffin-Lim, pitch detection, spectral features                                                                                                |

### Additional Examples

| Example                                   | What It Shows                                                                                   |
| ----------------------------------------- | ----------------------------------------------------------------------------------------------- |
| `TorchScriptImportExample.java`           | `TorchScriptModelImport` — import PyTorch `.pt` models via `nd4j-torchscript`                   |
| `OmniHubPretrainedModels.java`            | `OmniHubUtils`, `Pretrained`, `HuggingFaceHubDownloader` — model zoo with format auto-detection |
| `TtsTrainingPipelineExample.java`         | TTS fine-tuning with LoRA — `TtsTrainingPipeline`, `AudioDataProcessor`                         |
| `LRScheduleConfigExample.java`            | `CosineWarmupSchedule` + 9 other schedules                                                      |
| `KnowledgeDistillationConfigExample.java` | `DistillationConfig` builder variants                                                           |
| `TransferLearningConfigExample.java`      | `TransferLearning`, `FineTuneConfiguration`, `VariableGroup`                                    |

***

## New Maven Modules

```xml
<!-- LLM generation pipeline -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-llm</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Vision-language model support -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-vlm</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Whisper ASR / audio -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-audio</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- GGUF model import -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-ggml</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- HuggingFace-compatible tokenizers -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-tokenizers</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- TorchScript/PyTorch model import -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-torchscript</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Pipeline SPI modules -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-pipeline-core</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-pipeline-ggml</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-pipeline-safetensors</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-pipeline-onnx</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- DSP Runtime SDK (Java) -->
<dependency>
    <groupId>org.eclipse.deeplearning4j</groupId>
    <artifactId>nd4j-dsp-runtime-java</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Hardware backends (add if hardware is present) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-tpu</artifactId>          <!-- Google Cloud TPU -->
    <version>1.0.0-rewrite</version>
</dependency>
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-hexagon</artifactId>       <!-- Qualcomm Hexagon DSP -->
    <version>1.0.0-rewrite</version>
</dependency>
```

***

## Bug Fixes & Improvements

The following issues were resolved in merges since M2.1:

### Memory Leaks

* Fix CUDA lstmLayer permute/transpose memory leak ([#10404](https://github.com/deeplearning4j/deeplearning4j/pull/10404))
* Fix lstmLayer.cu weight transformation memory leak ([#10403](https://github.com/deeplearning4j/deeplearning4j/pull/10403))
* Fix MmulHelper::mmulNxN memory leak ([#10394](https://github.com/deeplearning4j/deeplearning4j/pull/10394))
* Fix BaseNDArray.toFlatArray() memory leak for view arrays ([#10410](https://github.com/deeplearning4j/deeplearning4j/pull/10410))

### Correctness

* Fix COORDS2INDEX macro to use strides instead of shapes ([#10393](https://github.com/deeplearning4j/deeplearning4j/pull/10393))
* Fix DataType inconsistency in float\[] constant buffer handling ([#10411](https://github.com/deeplearning4j/deeplearning4j/pull/10411))
* Fix inverted boolean logic in DeallocatorService listener delegation ([#10412](https://github.com/deeplearning4j/deeplearning4j/pull/10412))
* Fix byte order handling in DataTypeConversions ([#10401](https://github.com/deeplearning4j/deeplearning4j/pull/10401))

### Security

* Fix command injection vulnerabilities in Windows bat scripts ([#10409](https://github.com/deeplearning4j/deeplearning4j/pull/10409), [#10407](https://github.com/deeplearning4j/deeplearning4j/pull/10407))

### API & Infrastructure

* Simplify batched GEMM API ([#10361](https://github.com/deeplearning4j/deeplearning4j/pull/10361))
* JavaCPP resource configuration for native image (GraalVM) support ([#10287](https://github.com/deeplearning4j/deeplearning4j/pull/10287))
* Autodiff core improvements ([#10280](https://github.com/deeplearning4j/deeplearning4j/pull/10280))
* CMake modernization ([#10245](https://github.com/deeplearning4j/deeplearning4j/pull/10245))
* Maven version updates for Java 25 support ([#10243](https://github.com/deeplearning4j/deeplearning4j/pull/10243))
* SameDiff file format scaling improvements ([#10209](https://github.com/deeplearning4j/deeplearning4j/pull/10209))

***

## Migration Guide

### This Is a Transitional Release

1.0.0-rewrite is designed to be **additive** — existing M2.1 code continues to work. The new features are opt-in via new Maven modules and new API entry points. No existing imports or APIs are removed.

However, be aware that the **next release** will begin the namespace consolidation. To prepare:

* Avoid deep coupling to internal package paths (e.g., `org.nd4j.autodiff.samediff.config.*` may move)
* Prefer the high-level entry points (`GenerationPipeline`, `GGMLModelImport`, `PeftModelFactory`) over assembling components manually
* The examples in `deeplearning4j-examples` show the intended usage patterns

### Opting In to DSP Optimization

Graph execution mode defaults to `AUTO`. To explicitly select a mode:

```java
// Set via system property
-Dnd4j.execution.mode=TRITON

// Or programmatically (restricted to BenchmarkConfigApplier in production)
sd.setGraphExecutionMode(GraphExecutionMode.TRITON);
```

To skip specific optimizer passes during debugging:

```
-Dnd4j.optimizer.skip=AttentionFusion
-Dnd4j.optimizer.logApplied=true
```

### New Backend Discovery

Hardware backends (TPU, Hexagon, ZLUDA) are discovered automatically if the corresponding Maven module is on the classpath and the hardware is present. No code changes are required.

### Serialization Compatibility

* `DspPlanDiskCache` stores compiled plans in `~/.kompile/cache/dsp/` using a content-based key. Cache invalidates automatically on model changes.
* New model formats (SDNB, SDZ) are supported alongside the existing FlatBuffers serialization. Both `SameDiff.load()` and `SameDiff.save()` continue to work.
* GGUF models imported via `GGMLModelImport` can be converted to native SDZ format via `convertToSDZ()` for faster subsequent loads.

### Build System

* CUDA 12.9 is now supported alongside 12.6
* Backend namespace isolation (`SD_BACKEND_NAMESPACE`) enables true multi-backend co-loading in a single JVM process
* 18 named test suites available via `run-tests.yml`: `quick`, `sanity`, `nd4j`, `samediff`, `dl4jcore`, `keras`, `datavec`, `onnx`, `tensorflow`, `integration`, `libnd4j`, `llm`, `vlm`, `ggml`, `zoo`, `longrunning`, `all`