> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/release-notes/1.0.0-rewrite.md).

# 1.0.0-rewrite

> **This is a transitional release.** The 1.0.0-rewrite introduces the full new feature set (DSP, LLM/VLM, PEFT, hardware backends) while maintaining backward compatibility with existing `org.deeplearning4j` and `org.nd4j` imports. A namespace consolidation is underway — the **next release** will complete the import cleanup, unifying packages under a consistent structure. See [Namespace Consolidation](#namespace-consolidation) below for what to expect.

## Highlights

The 22 PRs ([#10435](https://github.com/deeplearning4j/deeplearning4j/pull/10435)–[#10456](https://github.com/deeplearning4j/deeplearning4j/pull/10456)) introduce capabilities that reposition DL4J as a first-class runtime for modern LLM and VLM workloads. Each feature area has its own full documentation page — the links below take you to comprehensive guides with API references, code examples, and configuration details.

| Feature                                                           | Docs                                                                                                                                       | Scale                                                   |
| ----------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------- |
| [DSP Execution Engine](/nd4j/overview-2/dsp.md)                   | Compiled graph runtime with CUDA graph capture/replay, Triton/NVRTC/PTX JIT, 26-pass graph optimizer, and multi-backend dispatch           | \~240 new files                                         |
| [LLM & VLM Stack](/deeplearning4j/overview-4.md)                  | Generation pipelines, paged KV cache, speculative decoding, continuous batching, tokenizers, eval framework, model editing                 | 456 new files                                           |
| [GGML/GGUF Import](/model-import/overview-3.md)                   | Import quantized LLMs from `.gguf` files with 20+ architecture handlers, complete quantization codecs, round-trip export                   | 121 new files                                           |
| [PEFT & RL Alignment](/deeplearning4j/peft-and-rl.md)             | 12 PEFT methods, 9 RL alignment trainers, FP8 mixed precision, 8-bit Adam, knowledge distillation, dataset curation                        | \~80 new files                                          |
| [ONNX Import & Export](/model-import/overview-2/onnx-expanded.md) | \~120 new ONNX ops including Microsoft LLM contrib, bidirectional SameDiff-to-ONNX export                                                  | \~130 files                                             |
| [New Operations (\~130)](/nd4j/new-operations.md)                 | Fused attention, KV cache, PEFT linear layers, normalization, positional encoding, quantization, SSM/recurrent, MoE, audio/signal          | \~80 new files                                          |
| [Hardware Backends](/nd4j/overview-1/hardware-backends.md)        | TPU (PJRT), Hexagon DSP (QNN), ZLUDA (AMD/Intel), SDX, ARM ACL (124 ops), Apple Accelerate (28 ops), cuDNN expansion, llama.cpp, MPS, MLIR | 382+ files                                              |
| DSP Runtime SDK                                                   | Stable C ABI with Java, Kotlin, Python, Rust, Swift, C# bindings                                                                           | See [DSP docs](/nd4j/overview-2/dsp.md#dsp-runtime-sdk) |
| Build System & CI                                                 | CUDA 12.9, 21 CMake modules, backend namespace isolation, 18 test suites                                                                   | \~60 files                                              |

***

## Namespace Consolidation

### Current State (1.0.0-rewrite)

This release uses **three package roots** in parallel. All three work and are fully supported:

| Root                           | Where Used                                                                               | Example                                              |
| ------------------------------ | ---------------------------------------------------------------------------------------- | ---------------------------------------------------- |
| `org.nd4j.*`                   | Core ND4J, SameDiff, PEFT, RL trainers, GGML, dataset curation, execution infrastructure | `org.nd4j.autodiff.samediff.config.LoraConfig`       |
| `org.eclipse.deeplearning4j.*` | New high-level application modules (LLM, VLM, audio, OmniHub, pipelines, SafeTensors)    | `org.eclipse.deeplearning4j.llm.GenerationPipeline`  |
| `org.deeplearning4j.*`         | Legacy DL4J neural network APIs (MultiLayerNetwork, ComputationGraph, Keras import)      | `org.deeplearning4j.nn.multilayer.MultiLayerNetwork` |

This split is intentional for this transitional release — existing code continues to work unchanged.

### Next Release: Import Cleanup

The **next release** will consolidate these into a unified namespace structure. What to expect:

* **`org.eclipse.deeplearning4j.*`** becomes the canonical root for all application-level modules
* **`org.nd4j.*`** remains for core array/tensor/SameDiff APIs (these are stable)
* **`org.deeplearning4j.*`** legacy imports will be deprecated with re-export shims — existing code compiles with deprecation warnings, not errors
* PEFT, RL trainers, and training configs currently under `org.nd4j.autodiff.samediff.config.*` will move to a more discoverable location
* The `org.nd4j.ggml.*` package may be consolidated under the pipeline framework

**Action for this release**: Use whichever imports work. Do not refactor your imports to anticipate the cleanup — the next release will provide clear migration tooling.

***

## Examples

The [`deeplearning4j-examples`](https://github.com/eclipse/deeplearning4j-examples) repository contains **34 example files** demonstrating the new features. All examples are in the `samediff-examples`, `onnx-import-examples`, and `dl4j-examples` modules.

### DSP Execution Engine (5 examples)

| Example                                     | What It Shows                                                                                                            |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `DSPExecutionExample.java`                  | DSP introduction — dynamic-shape inference, `GraphExecutionMode`, shape-keyed plan caching                               |
| `DSPAdvancedExample.java`                   | Full DSP API — `DspHandle`, slot introspection, Chrome trace export, tensor/pipeline parallelism                         |
| `DSPBackendsAndKernelSelectionExample.java` | All 19 `GraphExecutionMode` backends, `KernelSelectionConfig`, 24 optimization passes, TPU/Hexagon/ZLUDA/Metal/MLX/NNAPI |
| `DSPDiskCacheAndTritonExample.java`         | `DspPlanDiskCache`, Triton kernel cache, `TritonCacheTool` CLI, plan binary format                                       |
| `DSPDiagnosticsAndDebuggingExample.java`    | 20 diagnostic categories, `DspDebugger`, `DspHandle` live introspection, `DspPlanAssertions`                             |

Path: `samediff-examples/src/main/java/org/nd4j/examples/samediff/advanced/execution/`

### LLM Generation (3 examples)

| Example                             | What It Shows                                                                                                            |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `LLMGenerationPipelineExample.java` | `GenerationPipeline` API, KV cache strategies, speculative decoding, tensor parallelism, VLM integration                 |
| `QwenTextGenerationExample.java`    | End-to-end: GGUF download → `GGMLModelImport` → `HuggingFaceTokenizer` → `GenerationPipeline` → chat with `ChatTemplate` |
| `GraphOptimizerExample.java`        | `GraphOptimizer` — algebraic simplification, peephole, strength reduction, CSE, attention fusion                         |

Path: `samediff-examples/src/main/java/org/nd4j/examples/samediff/quickstart/modeling/`

### VLM and Audio (3 examples)

| Example                           | What It Shows                                                                                                    |
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| `SmolDoclingVLMExample.java`      | Vision-language model — ONNX components, image tiling, vision encoding, embedding merging, text generation       |
| `VideoVLMExample.java`            | Video VLM — `VideoFrameSampler` strategies (UNIFORM/FIXED\_FPS/KEYFRAME), `VideoPreprocessor`, SmolVLM2/Qwen3-VL |
| `WhisperSpeechToTextExample.java` | Whisper ASR — model download, mel spectrogram, transcription, timestamps                                         |

### GGML/GGUF Import (4 examples)

| Example                         | What It Shows                                                                                            |
| ------------------------------- | -------------------------------------------------------------------------------------------------------- |
| `GGMLImportExportExample.java`  | Full GGML API — `ConversionOptions`, `ExportOptions`, `GGUFReader`/`GGUFWriter`, round-trip quantization |
| `GGMLModelImportExample.java`   | Low-level import — architecture detection, quantization types Q2\_K through IQ2\_XS                      |
| `HuggingFaceGGUFImport.java`    | `HuggingFaceHubDownloader` → `AutoModel` → format auto-detection (GGUF/SafeTensors/ONNX/PyTorch)         |
| `SafeTensorsImportExample.java` | 3-level SafeTensors API — `AutoModel`, `SafeTensorsPipelineLoader`, `SafeTensorsReader`                  |

Path: `onnx-import-examples/src/main/java/org/deeplearning4j/modelimportexamples/omnihub/`

### PEFT (3 examples)

| Example                             | What It Shows                                                                                                                                 |
| ----------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| `SFTLoRATrainingConfigExample.java` | `LoraConfig`, `QLoraConfig`, `SFTConfig`, `GRPOConfig`, `DPOConfig`, BF16 mixed precision                                                     |
| `AdvancedPEFTConfigExample.java`    | `AdaLoraConfig`, `DoraConfig`, `IA3Config`, `PrefixTuningConfig`, `PromptTuningConfig`, `PPOConfig`, `KTOConfig`, `ORPOConfig`, `SimPOConfig` |
| `SpecializedPEFTConfigExample.java` | `LoftQConfig`, `LohaConfig`, `LokrConfig`, `VeraConfig`, `DyLoraConfig`, `AdapterConfig`, `LoraAdapterCache` (multi-adapter serving)          |

Path: `samediff-examples/src/main/java/org/nd4j/examples/samediff/quickstart/training/`

### RL Alignment (1 example)

| Example                         | What It Shows                                                                                                                                               |
| ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `RLAlignmentConfigExample.java` | All 10 RL methods: DPO (standard/IPO/RDPO), GRPO, PPO, KTO, ORPO, SimPO, DAPO (asymmetric clipping), GSPO, Dr.GRPO, `RewardModelConfig`, `RLPipelineConfig` |

### Training Infrastructure (5 examples)

| Example                                   | What It Shows                                                                                                                             |
| ----------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `MixedPrecisionTrainingExample.java`      | FP16/BF16/FP8 mixed precision, `LossScaleConfig`, `GradientAccumulator`, `FP8TrainingConfig`                                              |
| `DataCurationPipelineExample.java`        | `TextDeduplicator`, `TextQualityFilter`, `InstructionDataFormatter`, `StratifiedSplitter`, `LengthBucketingIterator`, `WeightedDataMixer` |
| `KnowledgeDistillationExample.java`       | `DistillationTrainer` — logit KD, feature KD, attention KD, combined                                                                      |
| `TransferLearningAndFreezingExample.java` | `PeftModel`, variable freezing, `GradientCheckpointConfig`, `ContinuedPretrainingConfig`                                                  |
| `NewOptimizersExample.java`               | `Adam8bit` (4x memory reduction), `AdaBelief`                                                                                             |

### New Operations (3 examples)

| Example                      | What It Shows                                                                                                                                                                            |
| ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `TransformerOpsExample.java` | `FlashAttention`, `SlidingWindowAttention`, `FusedRoPE`/`FusedMRoPE`, `RmsNorm`, `SiLU`, `SelectiveScan` (Mamba), `TokenSample`, `FP8Matmul`, `ColumnParallelLinear`/`RowParallelLinear` |
| `SameDiffOpsExample.java`    | All op namespaces including new `sd.audio()`, `sd.signal()`, mixed precision types                                                                                                       |
| `AudioOpsExample.java`       | 13 audio DSP ops — mel spectrogram, MFCC, Griffin-Lim, pitch detection, spectral features                                                                                                |

### Additional Examples

| Example                                   | What It Shows                                                                                   |
| ----------------------------------------- | ----------------------------------------------------------------------------------------------- |
| `TorchScriptImportExample.java`           | `TorchScriptModelImport` — import PyTorch `.pt` models via `nd4j-torchscript`                   |
| `OmniHubPretrainedModels.java`            | `OmniHubUtils`, `Pretrained`, `HuggingFaceHubDownloader` — model zoo with format auto-detection |
| `TtsTrainingPipelineExample.java`         | TTS fine-tuning with LoRA — `TtsTrainingPipeline`, `AudioDataProcessor`                         |
| `LRScheduleConfigExample.java`            | `CosineWarmupSchedule` + 9 other schedules                                                      |
| `KnowledgeDistillationConfigExample.java` | `DistillationConfig` builder variants                                                           |
| `TransferLearningConfigExample.java`      | `TransferLearning`, `FineTuneConfiguration`, `VariableGroup`                                    |

***

## New Maven Modules

```xml
<!-- LLM generation pipeline -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-llm</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Vision-language model support -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-vlm</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Whisper ASR / audio -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-audio</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- GGUF model import -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-ggml</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- HuggingFace-compatible tokenizers -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-tokenizers</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- TorchScript/PyTorch model import -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-torchscript</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Pipeline SPI modules -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-pipeline-core</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-pipeline-ggml</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-pipeline-safetensors</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-pipeline-onnx</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- DSP Runtime SDK (Java) -->
<dependency>
    <groupId>org.eclipse.deeplearning4j</groupId>
    <artifactId>nd4j-dsp-runtime-java</artifactId>
    <version>1.0.0-rewrite</version>
</dependency>

<!-- Hardware backends (add if hardware is present) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-tpu</artifactId>          <!-- Google Cloud TPU -->
    <version>1.0.0-rewrite</version>
</dependency>
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-hexagon</artifactId>       <!-- Qualcomm Hexagon DSP -->
    <version>1.0.0-rewrite</version>
</dependency>
```

***

## Bug Fixes & Improvements

The following issues were resolved in merges since M2.1:

### Memory Leaks

* Fix CUDA lstmLayer permute/transpose memory leak ([#10404](https://github.com/deeplearning4j/deeplearning4j/pull/10404))
* Fix lstmLayer.cu weight transformation memory leak ([#10403](https://github.com/deeplearning4j/deeplearning4j/pull/10403))
* Fix MmulHelper::mmulNxN memory leak ([#10394](https://github.com/deeplearning4j/deeplearning4j/pull/10394))
* Fix BaseNDArray.toFlatArray() memory leak for view arrays ([#10410](https://github.com/deeplearning4j/deeplearning4j/pull/10410))

### Correctness

* Fix COORDS2INDEX macro to use strides instead of shapes ([#10393](https://github.com/deeplearning4j/deeplearning4j/pull/10393))
* Fix DataType inconsistency in float\[] constant buffer handling ([#10411](https://github.com/deeplearning4j/deeplearning4j/pull/10411))
* Fix inverted boolean logic in DeallocatorService listener delegation ([#10412](https://github.com/deeplearning4j/deeplearning4j/pull/10412))
* Fix byte order handling in DataTypeConversions ([#10401](https://github.com/deeplearning4j/deeplearning4j/pull/10401))

### Security

* Fix command injection vulnerabilities in Windows bat scripts ([#10409](https://github.com/deeplearning4j/deeplearning4j/pull/10409), [#10407](https://github.com/deeplearning4j/deeplearning4j/pull/10407))

### API & Infrastructure

* Simplify batched GEMM API ([#10361](https://github.com/deeplearning4j/deeplearning4j/pull/10361))
* JavaCPP resource configuration for native image (GraalVM) support ([#10287](https://github.com/deeplearning4j/deeplearning4j/pull/10287))
* Autodiff core improvements ([#10280](https://github.com/deeplearning4j/deeplearning4j/pull/10280))
* CMake modernization ([#10245](https://github.com/deeplearning4j/deeplearning4j/pull/10245))
* Maven version updates for Java 25 support ([#10243](https://github.com/deeplearning4j/deeplearning4j/pull/10243))
* SameDiff file format scaling improvements ([#10209](https://github.com/deeplearning4j/deeplearning4j/pull/10209))

***

## Migration Guide

### This Is a Transitional Release

1.0.0-rewrite is designed to be **additive** — existing M2.1 code continues to work. The new features are opt-in via new Maven modules and new API entry points. No existing imports or APIs are removed.

However, be aware that the **next release** will begin the namespace consolidation. To prepare:

* Avoid deep coupling to internal package paths (e.g., `org.nd4j.autodiff.samediff.config.*` may move)
* Prefer the high-level entry points (`GenerationPipeline`, `GGMLModelImport`, `PeftModelFactory`) over assembling components manually
* The examples in `deeplearning4j-examples` show the intended usage patterns

### Opting In to DSP Optimization

Graph execution mode defaults to `AUTO`. To explicitly select a mode:

```java
// Set via system property
-Dnd4j.execution.mode=TRITON

// Or programmatically (restricted to BenchmarkConfigApplier in production)
sd.setGraphExecutionMode(GraphExecutionMode.TRITON);
```

To skip specific optimizer passes during debugging:

```
-Dnd4j.optimizer.skip=AttentionFusion
-Dnd4j.optimizer.logApplied=true
```

### New Backend Discovery

Hardware backends (TPU, Hexagon, ZLUDA) are discovered automatically if the corresponding Maven module is on the classpath and the hardware is present. No code changes are required.

### Serialization Compatibility

* `DspPlanDiskCache` stores compiled plans in `~/.kompile/cache/dsp/` using a content-based key. Cache invalidates automatically on model changes.
* New model formats (SDNB, SDZ) are supported alongside the existing FlatBuffers serialization. Both `SameDiff.load()` and `SameDiff.save()` continue to work.
* GGUF models imported via `GGMLModelImport` can be converted to native SDZ format via `convertToSDZ()` for faster subsequent loads.

### Build System

* CUDA 12.9 is now supported alongside 12.6
* Backend namespace isolation (`SD_BACKEND_NAMESPACE`) enables true multi-backend co-loading in a single JVM process
* 18 named test suites available via `run-tests.yml`: `quick`, `sanity`, `nd4j`, `samediff`, `dl4jcore`, `keras`, `datavec`, `onnx`, `tensorflow`, `integration`, `libnd4j`, `llm`, `vlm`, `ggml`, `zoo`, `longrunning`, `all`


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deeplearning4j.konduit.ai/release-notes/1.0.0-rewrite.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
