> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/deeplearning4j/contributing.md).

# Contributing

Contributions to Eclipse Deeplearning4j are welcome. This guide covers the full contributor workflow: legal requirements, build process, project architecture, how to add new ops or examples, and how to get a pull request merged.

### Eclipse Contributor Agreement (ECA)

Deeplearning4j is an Eclipse Foundation project. Before your first pull request can be merged, you must sign the **Eclipse Contributor Agreement**:

1. Create an account at [accounts.eclipse.org](https://accounts.eclipse.org/user/register).
2. Sign the ECA at [accounts.eclipse.org/user/eca](https://accounts.eclipse.org/user/eca).
3. **The email on your Eclipse account must exactly match the email on your GitHub account.** This is how the automated check identifies you.

The ECA incorporates the Developer Certificate of Origin (DCO) v1.1. By signing, you certify that your contributions are your own (or that you have the right to submit them) and grant Eclipse a non-exclusive, perpetual license. You retain copyright. You only need to sign once — the ECA is valid for 3 years and can be re-signed.

An Eclipse bot automatically checks every pull request. If your ECA is missing or your email doesn't match, the bot will comment with instructions.

**ECA FAQ:** [eclipse.org/legal/eca/faq](https://www.eclipse.org/legal/eca/faq/)

***

### Repository Structure

All DL4J libraries live in a single monorepo at [github.com/deeplearning4j/deeplearning4j](https://github.com/deeplearning4j/deeplearning4j).

#### Maven modules

| Module           | What it is                                                                                                                                  |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `libnd4j`        | C++ native compute engine. All ops, kernels, DSP execution engine, graph backends. Built with CMake, invoked through Maven via JavaCPP.     |
| `nd4j`           | Java ND4J API, SameDiff autodiff, backend bindings (CPU, CUDA), ONNX import, GGML import, tokenizers, DSP runtime SDK                       |
| `deeplearning4j` | High-level DL4J layers (`MultiLayerNetwork`, `ComputationGraph`), Keras import, LLM/VLM pipelines, PEFT, RL alignment trainers, training UI |
| `datavec`        | Data pipeline — record readers, transforms, schema, serialization                                                                           |
| `python4j`       | Embedded CPython execution from the JVM                                                                                                     |
| `omnihub`        | Model hub — `AutoModel.fromPretrained()`, format auto-detection                                                                             |
| `codegen`        | Op code generation from op descriptors                                                                                                      |
| `platform-tests` | **All tests live here.** Tests are never placed in the modules being tested.                                                                |
| `resources`      | Shared test resources                                                                                                                       |

#### Key directories inside `libnd4j`

| Directory                          | Contents                                                                                                                     |
| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
| `include/ops/declarable/generic/`  | Op implementations (C++ templates, one file per op or per op group)                                                          |
| `include/ops/declarable/platform/` | Platform-specific op implementations: `mkldnn/` (oneDNN), `armcompute/` (ARM ACL), `accelerate/` (Apple), `mlir/` (MLIR JIT) |
| `include/ops/declarable/headers/`  | Op header declarations                                                                                                       |
| `include/graph/`                   | DSP execution engine, graph backends, plan compiler                                                                          |
| `include/array/`                   | `NDArray` C++ implementation                                                                                                 |
| `include/system/`                  | Platform macros (`SD_HOST`, `SD_DEVICE`, `SD_INLINE`), `Engine.h`                                                            |
| `include/helpers/`                 | BLAS helpers, `MmulHelper`, `LoopKind`                                                                                       |
| `include/loops/`                   | Kernel loop implementations (transform, reduce, broadcast, etc.)                                                             |

#### Companion repositories

| Repository                                                                    | What it is                                                                      |
| ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| [deeplearning4j-examples](https://github.com/eclipse/deeplearning4j-examples) | Runnable example programs — see [Contributing Examples](#contributing-examples) |
| [deeplearning4j-docs](https://github.com/KonduitAI/deeplearning4j-docs)       | This documentation site (GitBook)                                               |

***

### Build Process

#### Prerequisites

* **JDK 11+** (JDK 17 recommended)
* **Maven 3.6.3+**
* **CMake 3.19+** and a C++17-capable compiler (GCC 9+, Clang 12+, MSVC 2019+)
* **ccache** — essential for iterative development. First native build: 30–45 minutes. With ccache, subsequent builds after small changes: \~30 seconds.
* **CUDA toolkit 12.9** (for GPU builds) + compatible NVIDIA driver (525.60+)
* **Project Lombok** IDE plugin — without it your IDE will show false compilation errors

#### CPU build

```bash
mvn -Pcpu \
  -Dlibnd4j.buildthreads=$(nproc) \
  -pl libnd4j,:nd4j-cpu-backend-common,:nd4j-native \
  clean install -DskipTests
```

#### CUDA build

```bash
mvn -Pcuda \
  -Dlibnd4j.chip=cuda \
  -Dlibnd4j.buildthreads=$(nproc) \
  -pl libnd4j,:nd4j-cuda-12.9 \
  clean install -DskipTests
```

To enable Triton JIT compilation (for the `-compile` classifier):

```bash
mvn -Pcuda \
  -Dlibnd4j.chip=cuda \
  -Dlibnd4j.triton=ON \
  -Dlibnd4j.buildthreads=$(nproc) \
  -pl libnd4j,:nd4j-cuda-12.9 \
  clean install -DskipTests
```

#### Java-only module build (no native compilation)

If you're only changing Java code and the native library is already built:

```bash
mvn install -DskipTests -pl <module>
```

#### Build rules

* Always use `install`, never just `compile` — downstream modules need the JAR in your local Maven repo.
* If building C++, always rebuild the Java bindings too (both `libnd4j` AND the backend module).
* Never invoke `make` directly — it skips Java binding regeneration and produces mismatched artifacts.
* **ccache is critical.** Never run `ccache -C` or `ccache --clear`. If you suspect stale results, touch the specific source file to force recompilation of just that file.

#### Building for a different CUDA version

The default CUDA version is 12.9, but `cuda.version` is a Maven property:

```bash
mvn -Pcuda -Dcuda.version=12.6 -Dlibnd4j.chip=cuda \
  -pl libnd4j,:nd4j-cuda-12.6 \
  clean install -DskipTests
```

***

### Platform Tests

**All tests live in `platform-tests/`.** Tests are never placed in the modules being tested — this is a hard project rule.

#### Why tests are centralized

The individual library modules (`nd4j/`, `deeplearning4j/`, `datavec/`) declare only compile-time dependencies and do not include a concrete backend. `platform-tests` is the single place where:

1. A concrete backend (`nd4j-native` or `nd4j-cuda`) is declared as a dependency, making execution possible.
2. Surefire is configured with the memory sizes, JVM flags, and native library hooks needed for testing.
3. JUnit 5 extensions enforce backend-appropriate test selection automatically.
4. The Maven Shade plugin builds a self-contained uber-JAR for benchmark/profiling runs outside Maven.

The root `pom.xml` does **not** include `platform-tests` by default. CI workflows `cd` into the `platform-tests` directory and run `mvn test` there directly.

#### Running tests

Always run from the `platform-tests` directory:

```bash
cd platform-tests

# Run a single test class
mvn test -Dtest=MyTestClass

# Run a single test method
mvn test -Dtest=MyTestClass#myTestMethod

# Run a parameterized test method (requires trailing wildcard)
mvn test -Dtest=MyTestClass#myParameterizedMethod*
```

**Never run `mvn test` from the project root** — it triggers full native rebuilds and runs every test suite, which takes hours.

#### Backend selection

Backend selection is entirely Maven property-driven. Two properties control what backend your tests run against:

| Property              | Default       | What it does                                                  |
| --------------------- | ------------- | ------------------------------------------------------------- |
| `backend.artifactId`  | `nd4j-native` | Selects the ND4J backend JAR (CPU or CUDA)                    |
| `platform.classifier` | Auto-detected | Selects the native binary variant (e.g., `linux-x86_64-avx2`) |

```bash
# CPU (default)
mvn test -Dtest=MyTestClass

# CUDA
mvn test -Dtest=MyTestClass -Dbackend.artifactId=nd4j-cuda-12.9

# CPU with specific classifier
mvn test -Dtest=MyTestClass -Dplatform.classifier=linux-x86_64-onednn-avx2
```

Setting `backend.artifactId` also activates Maven profiles that set backend priority system properties. When `nd4j-native` is selected, `org.nd4j.cpu.priority=10000` and GPU priority is 0, ensuring the CPU backend wins even if CUDA is on the classpath (and vice versa for `nd4j-cuda`).

#### Memory and JVM configuration

`platform-tests` configures Surefire with properties that control JVM heap, off-heap memory, and garbage collection:

| Property            | Default | What it does                                               |
| ------------------- | ------- | ---------------------------------------------------------- |
| `test.heap.size`    | `32g`   | JVM `-Xmx` per Surefire fork                               |
| `test.offheap.size` | `32g`   | JavaCPP max off-heap bytes                                 |
| `test.nogc`         | `true`  | Disables ND4J array GC and JavaCPP pointer GC during tests |
| `surefire.forks`    | `1`     | Number of forked JVM processes                             |
| `surefire.threads`  | `1`     | Threads per fork                                           |

The CUDA profile (`-Dbackend.artifactId=nd4j-cuda`) automatically reduces heap to `14g` and increases threads to `4`.

Override these for local runs if your machine has less memory:

```bash
mvn test -Dtest=MyTestClass -Dtest.heap.size=6g -Dtest.offheap.size=6g
```

Surefire also sets environment variables for deterministic behavior:

* `OMP_NUM_THREADS=1` — single-threaded OpenMP to avoid nondeterminism
* `OPENBLAS_CORETYPE=Haswell` — deterministic BLAS kernel selection
* `CUDA_LAUNCH_BLOCKING=1` — synchronous CUDA for debugging

#### Test organization

Tests are organized under `src/test/java/` (and `src/test/kotlin/` for import framework tests):

| Package                                                   | What it tests                                                  |
| --------------------------------------------------------- | -------------------------------------------------------------- |
| `org.eclipse.deeplearning4j.nd4j.*`                       | ND4J core: array ops, workspaces, datasets, shapes, data types |
| `org.eclipse.deeplearning4j.dl4jcore.*`                   | DL4J layers, training, gradient checks, model persistence      |
| `org.eclipse.deeplearning4j.frameworkimport.keras.*`      | Keras model import                                             |
| `org.eclipse.deeplearning4j.frameworkimport.onnx.*`       | ONNX import (Kotlin)                                           |
| `org.eclipse.deeplearning4j.frameworkimport.tensorflow.*` | TensorFlow import (Kotlin)                                     |
| `org.eclipse.deeplearning4j.integration.*`                | End-to-end integration tests                                   |
| `org.eclipse.deeplearning4j.longrunning.*`                | Long-running stress tests                                      |
| `org.eclipse.deeplearning4j.zoo.*`                        | Model zoo tests                                                |
| `org.datavec.*`                                           | DataVec: API, Arrow, Image, JDBC, Excel                        |
| `org.nd4j.*`                                              | Arrow serde, CUDA allocator, Python4J, TF-Lite                 |

#### Test tags

Tests use JUnit 5 tags (defined in `org.nd4j.common.tests.tags.TagNames`) for selective execution. Pass tags via Maven:

```bash
# Run only SameDiff tests
mvn test -Dtests=samediff

# Exclude long-running tests
mvn test -DexcludedTests="long-running-test,large-resources"
```

Common tags:

| Tag                 | What it selects                 |
| ------------------- | ------------------------------- |
| `samediff`          | SameDiff autodiff tests         |
| `training`          | Model training tests            |
| `onnx`              | ONNX import tests               |
| `keras`             | Keras import tests              |
| `tensorflow`        | TensorFlow import tests         |
| `dl4j-old-api`      | Legacy DL4J API tests           |
| `workspaces`        | Memory workspace tests          |
| `ndarray-indexing`  | Array indexing/slicing tests    |
| `long-running-test` | Tests that take minutes to run  |
| `large-resources`   | Tests that download large files |
| `downloads`         | Tests requiring network access  |
| `spark`             | Distributed training tests      |
| `python`            | Python4J bridge tests           |
| `multi-threaded`    | Concurrent tests                |

CI excludes `long-running-test`, `large-resources`, and `downloads` by default. The `BackendCheckerExtension` additionally disables `multi-threaded`, `spark`, and `python` when running on GPU.

#### JUnit 5 extensions

Three auto-registered extensions (via `META-INF/services`) manage test behavior:

| Extension                 | What it does                                                                                                                                                                                                        |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `BackendCheckerExtension` | Disables resource-heavy test tags when running on GPU. Checks `Nd4j.getEnvironment().isCPU()` and skips `large-resources`, `downloads`, `long-running-test`, `multi-threaded`, `spark`, and `python` tests on CUDA. |
| `TFGraphCheckerExtension` | Conditionally skips TensorFlow graph tests based on an allowlist. When `EXECUTE_ONLY_MODELS` is non-empty, only matching model tests run.                                                                           |
| `DeallocationExtension`   | Manages off-heap memory tracking between tests. Sets `CURRENT_TEST_*` system properties for allocation debugging.                                                                                                   |

#### Base test classes

Most tests extend one of these base classes (from the `nd4j-common-tests` and `deeplearning4j-common-tests` modules):

| Base class                 | Used by                  | What it does                                                                                                                                                |
| -------------------------- | ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `BaseND4JTest`             | ND4J tests               | Sets profiling mode, default data types, thread count. After each test: destroys workspaces, checks for workspace leaks (exits on leak), logs memory stats. |
| `BaseNd4jTestWithBackends` | Parameterized ND4J tests | Extends `BaseND4JTest`. Adds backend parameterization via `@MethodSource("configs")` — tests run once per available backend.                                |
| `BaseDL4JTest`             | DL4J tests               | Configures profiling, data types, thread count. Provides `skipUnlessIntegrationTests()` gated by `DL4J_INTEGRATION_TESTS` env var.                          |

#### Test scripts

`platform-tests/` includes convenience scripts:

| Script                    | What it runs                                                                         |
| ------------------------- | ------------------------------------------------------------------------------------ |
| `run-onnx-tests.sh`       | ONNX SameDiff import tests (`org.nd4j.samediff.frameworkimport.onnx.**`)             |
| `run-tensorflow-tests.sh` | TensorFlow SameDiff import tests (`org.nd4j.samediff.frameworkimport.tensorflow.**`) |
| `run-keras-tests.sh`      | Keras model import tests (`org.deeplearning4j.nn.modelimport.keras.**`)              |
| `run-benchmarks.sh`       | Standalone JUnit console launcher with optional valgrind/compute-sanitizer support   |
| `bootstrap-onnx.sh`       | Downloads \~65 ONNX Zoo models and converts them (not a test runner — data setup)    |

#### Benchmarking and profiling

The Maven Shade plugin builds a self-contained JAR (`platform-tests-1.0.0-SNAPSHOT-shaded.jar`) at package phase. This enables running tests outside Maven Surefire, which is useful for profiling with external tools:

```bash
# Build the shaded JAR
mvn package -DskipTests

# Run a specific test class with the standalone JUnit launcher
java -cp junit-platform-console-standalone.jar \
  org.junit.platform.console.ConsoleLauncher \
  -cp=target/platform-tests-1.0.0-SNAPSHOT-shaded.jar \
  -c=org.eclipse.deeplearning4j.dl4jcore.gradientcheck.CNN1DGradientCheckTest
```

The `bin/java` wrapper script in `platform-tests/` is the injection point for memory analysis tools. Surefire's `<jvm>` config points to this wrapper instead of the system `java`. The wrapper reads `TEST_RUNNER_PREFIX` from the environment:

* **Valgrind**: Generates suppression files for libjvm.so, adds `--track-origins=yes --error-limit=no`
* **Compute-Sanitizer**: Adds `--tool=memcheck --report-api-errors all --show-backtrace yes` (for CUDA memory debugging)

```bash
# Run with valgrind
TEST_RUNNER_PREFIX=valgrind mvn test -Dtest=MyTestClass

# Run with CUDA compute-sanitizer
TEST_RUNNER_PREFIX=compute-sanitizer mvn test -Dtest=MyTestClass \
  -Dbackend.artifactId=nd4j-cuda-12.9
```

#### Test resources

Many tests require pre-trained model files and test fixtures from the external `dl4j-test-resources` artifact (`org.deeplearning4j:dl4j-test-resources`). This must be installed in your local Maven repo before those tests will pass. CI workflows fetch it automatically; for local development, clone and install from [KonduitAI/dl4j-test-resources](https://github.com/KonduitAI/dl4j-test-resources).

#### Numerical gradient checks

Any new layer, loss function, or custom op with a backward pass must pass a numerical gradient check:

```java
boolean passed = GradientCheckUtil.checkGradients(
    new GradientCheckUtil.MLNConfig()
        .net(net)
        .input(input)
        .labels(labels));
assertTrue(passed, "Gradient check failed");
```

Gradient checks confirm that analytic (backprop) gradients match finite-difference numerical gradients. A failing gradient check means there is a bug in the backward pass.

***

### How Backends Work

Understanding the backend architecture is essential before contributing ops or backend-specific code.

#### Backend discovery (Java SPI)

ND4J uses Java's `ServiceLoader` to discover backends at runtime. Each backend JAR ships a `META-INF/services/org.nd4j.linalg.factory.Nd4jBackend` file naming its implementation class:

* **CPU**: `org.nd4j.linalg.cpu.nativecpu.CpuBackend` (in `nd4j-native`)
* **CUDA**: `org.nd4j.linalg.jcublas.JCublasBackend` (in `nd4j-cuda-12.9`)

At startup, `Nd4jBackend.load()` collects all backends via `ServiceLoader`, sorts by priority (configurable via system properties `nd4j.backend.priorityCPU` / `nd4j.backend.priorityGPU`), and calls `isAvailable()` on each in order. The first available one wins. In practice, CUDA wins if GPUs are present because `JCublasBackend` calls `cudaGetDeviceCount` and succeeds, while `CpuBackend.isAvailable()` always returns true as a fallback.

#### Initialization chain

```
Nd4j (static init)
  → Nd4jBackend.load()          // SPI discovery, priority sort
  → initWithBackend(backend)    // reads backend .properties file
    → OpExecutioner             // NativeOpExecutioner (CPU) or CudaExecutioner (CUDA)
    → NativeOpsHolder           // loads JNI bridge: Nd4jCpu or Nd4jCuda
      → NativeOps               // JNI interface to libnd4j C++
        → libnd4j native code   // actual kernel execution
```

Each backend defines its classes in a properties file (`nd4j-native.properties` or `nd4j-jcublas.properties`). `Nd4j.initWithBackend()` reflectively instantiates:

* `opexec` → the `OpExecutioner` implementation
* `native.ops` → the `NativeOps` JNI bridge class

`NativeOpExecutioner` delegates every op call (`execReduceFloat`, `execScalar`, `execCustomOp`) through `NativeOps` to the C++ shared library.

#### Platform helper dispatch (C++ side)

Platform-specific op implementations (oneDNN, cuDNN, ACL, Apple Accelerate) plug in entirely at the C++ level. Java has no role in this dispatch.

The `PLATFORM_IMPL(op_name, ENGINE)` macro (in `libnd4j/include/system/platform_boilerplate.h`) uses a static struct initializer to auto-register the helper with `OpRegistrator` when the shared library loads. At op execution time, `OpRegistrator::getPlatformHelper(hash, engine)` looks up registered helpers. If `isUsable(context)` returns true (correct dtypes, shapes, library available), `invokeHelper(context)` runs the accelerated implementation instead of the generic kernel.

Key C++ files:

* `libnd4j/include/ops/declarable/PlatformHelper.h` — base class
* `libnd4j/include/system/platform_boilerplate.h` — `PLATFORM_IMPL` / `PLATFORM_CHECK` macros
* `libnd4j/include/ops/declarable/OpRegistrator.h` — registry
* `libnd4j/include/execution/Engine.h` — engine enum (`ENGINE_CPU=0`, `ENGINE_CUDA=1`, etc.)

***

### Op Codegen and SameDiff Namespaces

Ops are not hand-written Java classes — they are **code-generated** from a two-phase pipeline. Understanding this pipeline is essential for adding new ops.

#### Phase 1: C++ → Protobuf IR (`libnd4j-gen`)

The `codegen/libnd4j-gen` module scans C++ op source files and extracts argument signatures.

**What it scans:** All files in `libnd4j/include/ops/` containing op declaration macros:

* `CUSTOM_OP_IMPL(NAME, NIN, NOUT, INPLACEABLE, TARGS, IARGS)`
* `OP_IMPL`, `REDUCTION_OP_IMPL`, `BROADCASTABLE_OP_IMPL`
* `BOOLEAN_OP_IMPL`, `LIST_OP_IMPL`, `CONFIGURABLE_OP_IMPL`, `DIVERGENT_OP_IMPL`

**Entry point:** `ParseOpFile.java`, run via `codegen/libnd4j-gen/generate.sh`

**Output:** A protobuf text-format file (`op-ir.proto`) describing every op's argument names, types, and counts. The compiled proto class `OpNamespace.java` lives in `nd4j-api`. A bundled snapshot is stored at `nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/resources/ops.proto`.

This IR is used at runtime for ONNX and TensorFlow import op mapping.

#### Phase 2: Kotlin DSL → Java namespace classes (`op-codegen`)

The `codegen/op-codegen` module generates the Java API surface from a Kotlin DSL.

**Descriptor files:** One Kotlin file per namespace in `codegen/op-codegen/src/main/ops/org/nd4j/codegen/ops/`:

```
Math.kt          NeuralNetwork.kt    CNN.kt           RNN.kt
Random.kt        Linalg.kt           Bitwise.kt       Image.kt
SDBaseOps.kt     SDLoss.kt           Signal.kt        Audio.kt
Training.kt
```

**Example entry** (from `Math.kt`):

```kotlin
Op("abs", transformSame) {
    javaOpClass = "Abs"
    Doc(Language.ANY, DocScope.ALL) {
        "Elementwise absolute value operation: out = abs(x)"
    }
    Input(NUMERIC, "x") { description = "Input variable" }
    Output(NUMERIC, "output") { description = "Output variable" }
}
```

**Generator:** `Nd4jNamespaceGenerator.java` uses JavaPoet to emit `.java` source files.

**Entry point:** `CLI.java -dir <repo_root> -namespaces ALL -projects all`

**Output:** Generated Java classes in `nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/`:

| Namespace      | SameDiff class (`sd.math()`, etc.) | ND4J class (`Nd4j.math()`, etc.) |
| -------------- | ---------------------------------- | -------------------------------- |
| Math           | `SDMath`                           | `NDMath`                         |
| Neural Network | `SDNN`                             | `NDNN`                           |
| CNN            | `SDCNN`                            | `NDCNN`                          |
| RNN            | `SDRNN`                            | `NDRNN`                          |
| Random         | `SDRandom`                         | `NDRandom`                       |
| Linear Algebra | `SDLinalg`                         | `NDLinalg`                       |
| Bitwise        | `SDBitwise`                        | `NDBitwise`                      |
| Image          | `SDImage`                          | `NDImage`                        |
| Base Ops       | `SDBaseOps`                        | `NDBase`                         |
| Loss           | `SDLoss`                           | `NDLoss`                         |
| Signal         | `SDSignal`                         | `NDSignal`                       |
| Audio          | `SDAudio`                          | `NDAudio`                        |
| Training       | `SDTraining`                       | `NDTraining`                     |

Users access ops through these namespaces:

```java
SameDiff sd = SameDiff.create();
SDVariable x = sd.var("x", Nd4j.rand(2, 3));

// sd.math() → SDMath
SDVariable abs = sd.math().abs(x);

// sd.nn() → SDNN
SDVariable relu = sd.nn().relu(x, 0);

// sd.linalg() → SDLinalg
SDVariable det = sd.linalg().det(x);

// sd.audio() → SDAudio
SDVariable mel = sd.audio().melSpectrogram(x, 16000, 512, 256, 80);
```

**Do not edit `SD*.java` or `ND*.java` files directly** — they are generated and will be overwritten. Edit the Kotlin DSL in `codegen/op-codegen/src/main/ops/` instead.

***

### Contributing New Ops

Adding a new native op is a multi-step process that spans C++, codegen, and Java. Here is the full end-to-end flow.

#### Step 1: C++ implementation (libnd4j)

Create the op in `libnd4j/include/ops/declarable/generic/` under the appropriate subdirectory (`nn/`, `transforms/`, `reduce/`, `linalg/`, etc.):

```cpp
// libnd4j/include/ops/declarable/generic/nn/my_new_op.cpp
#include <ops/declarable/CustomOperations.h>

namespace sd {
namespace ops {

CUSTOM_OP_IMPL(my_new_op, 2, 1, false, 0, 0) {
    auto input = INPUT_VARIABLE(0);
    auto weights = INPUT_VARIABLE(1);
    auto output = OUTPUT_VARIABLE(0);

    // Implementation
    // ...

    return sd::Status::OK;
}

DECLARE_SHAPE_FN(my_new_op) {
    auto inShape = inputShape->at(0);
    // Calculate output shape
    return SHAPELIST(ConstantShapeHelper::getInstance().createShapeInfo(
        DataType::FLOAT32, 'c', {outRows, outCols}));
}

DECLARE_TYPES(my_new_op) {
    getOpDescriptor()
        ->setAllowedInputTypes({ALL_FLOATS})
        ->setAllowedOutputTypes({ALL_FLOATS});
}

}  // namespace ops
}  // namespace sd
```

The macro arguments to `CUSTOM_OP_IMPL` are: `(name, numInputs, numOutputs, inPlaceable, numTArgs, numIArgs)`.

If the op needs a backward pass for training, also implement `my_new_op_bp`:

```cpp
CUSTOM_OP_IMPL(my_new_op_bp, 3, 2, false, 0, 0) {
    auto input = INPUT_VARIABLE(0);
    auto weights = INPUT_VARIABLE(1);
    auto gradOut = INPUT_VARIABLE(2);  // gradient from upstream
    auto gradInput = OUTPUT_VARIABLE(0);
    auto gradWeights = OUTPUT_VARIABLE(1);

    // Backward pass implementation
    return sd::Status::OK;
}
```

#### Step 2: Platform-specific implementations (optional)

For performance-critical ops, add accelerated implementations using the `PLATFORM_IMPL` macro. These are auto-registered at library load time — no Java-side wiring needed:

```cpp
// libnd4j/include/ops/declarable/platform/mkldnn/my_new_op_mkldnn.cpp
#include <ops/declarable/PlatformHelper.h>

namespace sd {
namespace ops {
namespace platforms {

PLATFORM_IMPL(my_new_op, ENGINE_CPU) {
    // oneDNN-optimized implementation
    return sd::Status::OK;
}

PLATFORM_CHECK(my_new_op, ENGINE_CPU) {
    auto input = INPUT_VARIABLE(0);
    return input->dataType() == DataType::FLOAT32
        && input->rankOf() == 4;  // only handle 4D inputs
}

}  // namespace platforms
}  // namespace ops
}  // namespace sd
```

Available engines for `PLATFORM_IMPL`:

| Engine constant                | Library             | Platform directory     |
| ------------------------------ | ------------------- | ---------------------- |
| `ENGINE_CPU` / `ENGINE_ONEDNN` | Intel oneDNN        | `platform/mkldnn/`     |
| `ENGINE_CUDA`                  | NVIDIA cuDNN        | `platform/cudnn/`      |
| `ENGINE_ARM`                   | ARM Compute Library | `platform/armcompute/` |
| `ENGINE_ACCELERATE`            | Apple Accelerate    | `platform/accelerate/` |
| `ENGINE_MPS`                   | Apple Metal         | `platform/mps/`        |

#### Step 3: Register launch dimensions (CUDA ops)

If the op runs on CUDA, register its launch configuration in `include/system/LaunchDims.h` and `LaunchDims.cu`.

#### Step 4: Regenerate the op IR

Run the libnd4j-gen scanner to pick up the new op's argument signature:

```bash
cd codegen/libnd4j-gen
bash generate.sh /path/to/libnd4j
```

This updates `op-ir.proto` with the new op's descriptor. The IR is used for ONNX/TF import mapping.

#### Step 5: Add to the Kotlin codegen DSL

Add the op to the appropriate Kotlin file in `codegen/op-codegen/src/main/ops/org/nd4j/codegen/ops/`. For a neural network op, add it to `NeuralNetwork.kt`:

```kotlin
Op("my_new_op") {
    javaOpClass = "MyNewOp"
    Doc(Language.ANY, DocScope.ALL) {
        "Description of what my_new_op does."
    }
    Input(NUMERIC, "input") { description = "Input tensor" }
    Input(NUMERIC, "weights") { description = "Weight tensor" }
    Output(NUMERIC, "output") { description = "Output tensor" }
}
```

#### Step 6: Run the code generator

```bash
cd codegen/op-codegen
# Regenerate all namespace classes
java -cp target/classes org.nd4j.codegen.cli.CLI \
  -dir /path/to/deeplearning4j -namespaces ALL -projects all
```

This regenerates `SDNN.java`, `NDNN.java` (or whichever namespace you added the op to) with your new op included. **Do not edit the generated files directly.**

#### Step 7: Test

Add a test in `platform-tests/`:

```java
@Test
void testMyNewOp() {
    SameDiff sd = SameDiff.create();
    SDVariable input = sd.var("input", Nd4j.rand(2, 3));
    SDVariable weights = sd.var("weights", Nd4j.rand(3, 4));
    SDVariable output = sd.nn().myNewOp(input, weights);

    Map<String, INDArray> result = sd.output(
        Collections.emptyMap(), "output");
    assertNotNull(result.get("output"));
    assertArrayEquals(new long[]{2, 4}, result.get("output").shape());
}
```

For ops with backward passes, also add a gradient check test.

***

### Contributing Examples

Examples live in a separate repository: [github.com/eclipse/deeplearning4j-examples](https://github.com/eclipse/deeplearning4j-examples).

#### Repository structure

Each sub-project is a self-contained Maven project (no aggregate root POM):

| Module                               | Focus                                             |
| ------------------------------------ | ------------------------------------------------- |
| `dl4j-examples`                      | DL4J neural network examples                      |
| `samediff-examples`                  | SameDiff, DSP, LLM generation, PEFT, RL alignment |
| `nd4j-ndarray-examples`              | ND4J array operations                             |
| `data-pipeline-examples`             | DataVec ETL examples                              |
| `onnx-import-examples`               | ONNX and GGML model import, OmniHub               |
| `tensorflow-keras-import-examples`   | TensorFlow/Keras import                           |
| `dl4j-distributed-training-examples` | Spark distributed training                        |
| `android-examples`                   | Android deployment                                |
| `mvn-project-template`               | Minimal starter template                          |

#### Example conventions

Each example is a standalone runnable Java class with a `public static void main(String[] args)` method. Follow the existing pattern:

```java
package org.deeplearning4j.examples.quickstart.modeling;

// ... imports ...

/**
 * Brief description of what the example demonstrates.
 *
 * Key concepts:
 * - First concept
 * - Second concept
 *
 * @author Your Name
 */
public class MyExample {

    public static void main(String[] args) throws Exception {
        // Example code — runnable as-is
    }
}
```

**Guidelines:**

* **Runnable.** The example must compile and run without modification, external data downloads, or special hardware (unless clearly documented at the top).
* **Self-contained.** All configuration, model building, and data loading happen within the `main` method or private helper methods in the same class.
* **Well-commented.** Explain what each section does and why. Examples are learning tools — clarity beats brevity.
* **No test classes.** Examples run directly via `main()`, not as JUnit tests.
* **Apache 2.0 header.** Include the standard Apache 2.0 license header at the top of every file.

#### Organization

Place your example in the appropriate sub-project and tier:

* `quickstart/` — beginner-friendly, demonstrates one concept clearly
  * `modeling/` — building and training models
  * `features/` — specific DL4J features (early stopping, UI, save/load)
  * `datapipeline/` — loading and transforming data
* `advanced/` — more complex, may combine multiple concepts
  * `modelling/` — attention, seq2seq, object detection, style transfer
  * `features/` — custom layers, transfer learning, advanced configuration

#### Submitting example PRs

1. Fork `deeplearning4j-examples`, create a branch.
2. Add your example in the appropriate module and tier.
3. Verify it compiles: `mvn compile` in the sub-project directory.
4. Verify it runs: `mvn exec:java -Dexec.mainClass="org.deeplearning4j.examples...."`.
5. Open a PR to `eclipse/deeplearning4j-examples:master`.

***

### C++ Guide (libnd4j)

This section covers how to write C++ code in libnd4j. It is organized around what you'll actually do as a contributor — writing ops that work on NDArrays — rather than as an exhaustive macro catalog. All paths are relative to `libnd4j/` in the monorepo.

#### How most ops work: NDArray methods

Most ops don't touch raw buffers, loops, or CUDA kernels directly. NDArray has a rich method API that handles CPU/CUDA dispatch, threading, type promotion, and stride-aware iteration for you. **Start here** — only drop to lower levels when you need custom logic that NDArray methods don't cover.

**Element-wise transforms**

```cpp
// Apply a built-in transform (handles all types, CPU + CUDA, strided memory):
input->applyTransform(transform::Sqrt, output);
input->applyTransform(transform::Abs, output);
input->applyTransform(transform::Sigmoid, output);

// In-place (output == input):
input->applyTransform(transform::Tanh, input);
```

The `transform::*` enum covers all standard element-wise functions. The loop infrastructure handles LoopKind dispatch on CPU and kernel launches on CUDA — you get the optimized path automatically.

**Pairwise operations**

```cpp
// Broadcast-aware binary ops (handles shape broadcasting automatically):
input->applyBroadcast(broadcast::Add, {0, 1}, bias, output);

// Same-shape element-wise:
a->applyPairwiseTransform(pairwise::Multiply, b, output);

// Scalar operations:
input->applyScalar(scalar::Add, 5.0f, output);
```

**Reductions**

```cpp
// Full reduction to scalar:
auto sum = input->reduceNumber(reduce::Sum);
auto mean = input->reduceNumber(reduce::Mean);

// Reduce along dimensions:
std::vector<sd::LongType> dims = {1};
input->reduceAlongDimension(reduce::Sum, output, &dims, keepDims);

// Index reductions:
auto argmax = input->indexReduceNumber(indexreduce::IndexMax);
```

**Custom element-wise logic with lambdas**

When a built-in transform doesn't exist for what you need, use the `LAMBDA` macros. These create portable lambdas that work on both CPU and CUDA:

```cpp
// Square each element:
auto square = LAMBDA_T(x) { return x * x; };
input->applyLambda(square, output);

// Pairwise with custom logic:
auto myOp = LAMBDA_TT(x, y) {
    return x > y ? x - y : y - x;
};
input->applyPairwiseLambda(other, myOp, output);

// With index (e.g., positional encoding):
auto posEnc = ILAMBDA_T(x) {
    return x + static_cast<T>(index);
};
input->applyIndexedLambda(posEnc, output);
```

Lambda variants: `LAMBDA_T` (generic), `LAMBDA_D` (double), `LAMBDA_F` (float), `LAMBDA_H` (float16). Pairwise: `LAMBDA_TT`, `LAMBDA_DD`, `LAMBDA_FF`. Indexed: `ILAMBDA_T`, `ILAMBDA_D`, `ILAMBDA_F`.

**Choosing your approach**

| What you need                                          | Use this                                             | Why                                                     |
| ------------------------------------------------------ | ---------------------------------------------------- | ------------------------------------------------------- |
| Standard math op (abs, sqrt, sin, relu...)             | `applyTransform(transform::X, output)`               | Already optimized with platform helpers (oneDNN, cuDNN) |
| Binary op with broadcasting                            | `applyBroadcast(broadcast::X, dims, other, output)`  | Handles shape broadcast rules automatically             |
| Same-shape binary op                                   | `applyPairwiseTransform(pairwise::X, other, output)` | Simpler path when shapes are known to match             |
| Reduce to scalar or along dims                         | `reduceNumber()` / `reduceAlongDimension()`          | Optimized reduction with tree-reduce on CUDA            |
| Custom element-wise logic                              | `LAMBDA_T` + `applyLambda()`                         | Portable CPU + CUDA, type-dispatched                    |
| Completely custom kernel (attention, convolution, ...) | Drop to raw buffers + CUDA kernel                    | Only when nothing above fits                            |

#### Writing a complete op

Here's the pattern for the three most common op types. Each is a complete, working example.

**Example 1: Simple element-wise op (NDArray methods)**

Most ops look like this — a few lines calling NDArray methods:

```cpp
// include/ops/declarable/generic/transforms/my_activation.cpp

#include <ops/declarable/CustomOperations.h>

namespace sd {
namespace ops {

OP_IMPL(my_activation, 1, 1, true) {
    auto input  = INPUT_VARIABLE(0);
    auto output = OUTPUT_VARIABLE(0);

    // x * sigmoid(x) — the SiLU/Swish activation
    NDArray sigmoid(input->shapeInfo(), input->dataType(), false, block.launchContext());
    input->applyTransform(transform::Sigmoid, &sigmoid);
    input->applyPairwiseTransform(pairwise::Multiply, &sigmoid, output);

    return sd::Status::OK;
}

DECLARE_TYPES(my_activation) {
    getOpDescriptor()->setAllowedInputTypes({ALL_FLOATS})->setSameMode(true);
}

}  // namespace ops
}  // namespace sd
```

`OP_IMPL` is used here because output shape == input shape (no need for `DECLARE_SHAPE_FN`). The `true` in the third argument means the op supports in-place execution.

**Example 2: Reduction with custom shape**

```cpp
// include/ops/declarable/generic/reduce/my_reduce.cpp

CUSTOM_OP_IMPL(reduce_variance, -1, 1, false, 0, 0) {
    auto input  = INPUT_VARIABLE(0);
    auto output = OUTPUT_VARIABLE(0);

    std::vector<sd::LongType> dimensions;
    if (block.width() > 1) {
        auto axesVector = INPUT_VARIABLE(1);
        helpers::adjustAxis(input->rankOf(), axesVector, dimensions);
    } else if (block.getIArguments()->size()) {
        dimensions = *block.getIArguments();
    }

    bool keepDims = block.getBArguments()->size() ? B_ARG(0) : false;

    // NDArray handles the entire reduction — CPU threading, CUDA kernel, all types:
    input->varianceAlongDimension(variance::SummaryStatsVariance, output,
                                   &dimensions, keepDims);
    return sd::Status::OK;
}

DECLARE_SHAPE_FN(reduce_variance) {
    auto inShape = inputShape->at(0);
    // ... parse dimensions, keepDims ...
    return SHAPELIST(ShapeUtils::evalReduceShapeInfo(
        shape::order(inShape), &dimensions, inShape, keepDims, false,
        block.getWorkspace()));
}

DECLARE_TYPES(reduce_variance) {
    getOpDescriptor()->setAllowedInputTypes({ALL_FLOATS})
                     ->setAllowedOutputTypes({ALL_FLOATS});
}
```

`CUSTOM_OP_IMPL` is used because reduction changes the output shape (requires `DECLARE_SHAPE_FN`). `-1` for NIN means variable number of inputs (the dims tensor is optional).

**Example 3: Custom CUDA kernel (when NDArray methods aren't enough)**

Some ops need hand-written kernels — fused operations, complex indexing patterns, or algorithms that don't decompose into existing primitives. Here's the full three-layer pattern:

```cpp
// include/ops/declarable/helpers/cuda/my_custom_op.cu

#include <ops/declarable/helpers/my_custom_op.h>
#include <helpers/DebugHelper.h>
#include <execution/cuda/LaunchDims.h>

namespace sd {
namespace ops {
namespace helpers {

// Layer 1: Device kernel
template <typename T>
SD_KERNEL static void myCustomKernel(
    const void *vx, const LongType *xShapeInfo,
    const void *vy, const LongType *yShapeInfo,
    void *vz, const LongType *zShapeInfo) {

    auto x = reinterpret_cast<const T*>(vx);
    auto y = reinterpret_cast<const T*>(vy);
    auto z = reinterpret_cast<T*>(vz);

    // Thread 0 caches shape metadata in shared memory
    __shared__ LongType length;
    __shared__ LongType xRank, yRank, zRank;
    __shared__ const LongType *xShape, *xStride, *yShape, *yStride, *zShape, *zStride;
    if (threadIdx.x == 0) {
        length  = shape::length(zShapeInfo);
        xRank   = shape::rank(xShapeInfo);
        xShape  = shape::shapeOf(xShapeInfo);
        xStride = shape::stride(xShapeInfo);
        yRank   = shape::rank(yShapeInfo);
        yShape  = shape::shapeOf(yShapeInfo);
        yStride = shape::stride(yShapeInfo);
        zRank   = shape::rank(zShapeInfo);
        zShape  = shape::shapeOf(zShapeInfo);
        zStride = shape::stride(zShapeInfo);
    }
    __syncthreads();

    // Grid-stride loop — handles any array size
    auto tid = blockIdx.x * blockDim.x + threadIdx.x;
    auto totalThreads = gridDim.x * blockDim.x;
    for (LongType i = tid; i < length; i += totalThreads) {
        LongType coords[SD_MAX_RANK];
        LongType xOffset, yOffset, zOffset;

        // Convert linear index → coordinates → strided buffer offsets
        INDEX2COORDS(i, zRank, zShape, coords);
        COORDS2INDEX(xRank, xStride, coords, xOffset);
        COORDS2INDEX(yRank, yStride, coords, yOffset);
        COORDS2INDEX(zRank, zStride, coords, zOffset);

        z[zOffset] = /* your custom logic */ x[xOffset] + y[yOffset];
    }
}

// Layer 2: Host launcher (type-specific, launched from type dispatch)
template <typename T>
static void myCustomLauncher(dim3 dims, cudaStream_t *stream,
    const void *x, const LongType *xShape,
    const void *y, const LongType *yShape,
    void *z, const LongType *zShape) {

    myCustomKernel<T><<<dims.x, dims.y, dims.z, *stream>>>(
        x, xShape, y, yShape, z, zShape);
    DebugHelper::checkErrorCode(stream, "myCustomKernel failed");
}

// Layer 3: Public helper (called from the op body)
void myCustomHelper(LaunchContext *context, NDArray &input, NDArray &other, NDArray &output) {
    auto dims = getLaunchDims("myCustomOp");

    // Sync: ensure inputs are on device, output buffer is ready
    NDArray::prepareSpecialUse({&output}, {&input, &other});

    // Type-dispatch to the correct template
    BUILD_SINGLE_SELECTOR(input.dataType(), myCustomLauncher,
        (dims, context->getCudaStream(),
         input.specialBuffer(),  input.specialShapeInfo(),
         other.specialBuffer(),  other.specialShapeInfo(),
         output.specialBuffer(), output.specialShapeInfo()),
        SD_FLOAT_TYPES);

    // Register: mark output as written on device
    NDArray::registerSpecialUse({&output}, {&input, &other});
}

}  // namespace helpers
}  // namespace ops
}  // namespace sd
```

The op body just calls the helper:

```cpp
CUSTOM_OP_IMPL(my_custom_op, 2, 1, false, 0, 0) {
    auto input  = INPUT_VARIABLE(0);
    auto other  = INPUT_VARIABLE(1);
    auto output = OUTPUT_VARIABLE(0);

    helpers::myCustomHelper(block.launchContext(), *input, *other, *output);
    return sd::Status::OK;
}
```

#### Op declaration macros

**Defined in:** `include/system/op_boilerplate.h`

| When to use                      | Macro                                                              | Why                                 |
| -------------------------------- | ------------------------------------------------------------------ | ----------------------------------- |
| Output shape == input shape      | `OP_IMPL(NAME, NIN, NOUT, INPLACEABLE)`                            | No `DECLARE_SHAPE_FN` needed        |
| Output shape differs from input  | `CUSTOM_OP_IMPL(NAME, NIN, NOUT, INPLACEABLE, TARGS, IARGS)`       | You must write `DECLARE_SHAPE_FN`   |
| Binary op with broadcasting      | `BROADCASTABLE_OP_IMPL(NAME, TARGS, IARGS)`                        | Shape inference via broadcast rules |
| Reduction                        | `REDUCTION_OP_IMPL(NAME, NIN, NOUT, INPLACEABLE, TARGS, IARGS)`    | Reduction-specific base class       |
| Boolean check                    | `BOOLEAN_OP_IMPL(NAME, NIN, SCALAR)`                               | Returns true/false                  |
| Pass-through shape, has T/I args | `CONFIGURABLE_OP_IMPL(NAME, NIN, NOUT, INPLACEABLE, TARGS, IARGS)` | Like `OP_IMPL` but with typed args  |

Parameters: `NIN` = number of inputs (`-1` for variable), `NOUT` = number of outputs, `INPLACEABLE` = `true`/`false`, `TARGS` = float arg count, `IARGS` = integer arg count.

Every op also needs:

```cpp
// Type constraints (always required):
DECLARE_TYPES(my_op) {
    getOpDescriptor()
        ->setAllowedInputTypes({ALL_FLOATS})   // or sd::DataType::ANY, SD_COMMON_TYPES, etc.
        ->setAllowedOutputTypes({ALL_FLOATS});
}
// OR, if output type always matches input type:
DECLARE_SAME_TYPE(my_op)

// Shape function (required for CUSTOM_OP_IMPL only):
DECLARE_SHAPE_FN(my_op) {
    auto inShape = inputShape->at(0);
    // ... compute output shape ...
    return SHAPELIST(ConstantShapeHelper::getInstance().createShapeInfo(
        DataType::FLOAT32, 'c', {outDim0, outDim1}));
}
```

#### Input/output and argument access

Within any op body:

```cpp
auto input  = INPUT_VARIABLE(0);    // NDArray* — i-th input tensor
auto output = OUTPUT_VARIABLE(0);   // NDArray* — i-th output (pre-allocated)

auto alpha = T_ARG(0);              // double — i-th float argument
auto axis  = I_ARG(0);              // LongType — i-th integer argument
auto flag  = B_ARG(0);              // bool — i-th boolean argument

REQUIRE_TRUE(input->rankOf() == 4, 0,
    "my_op: expected rank 4, got %d", input->rankOf());
```

`REQUIRE_TRUE` is the standard way to validate inputs — it includes file/line in the error message automatically.

#### Two different "helpers" — don't confuse them

libnd4j has two distinct mechanisms that both get called "helpers." They work at different levels and solve different problems:

|                  | Helper methods (`helpers/`)                                               | Platform helper ops (`platform/`)                                              |
| ---------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| **What it is**   | Regular C++ functions with separate CPU and CUDA implementations          | Optional vendor-accelerated replacements for existing ops                      |
| **Selection**    | **Compile-time** — CMake picks `helpers/cpu/*.cpp` or `helpers/cuda/*.cu` | **Runtime** — `PLATFORM_CHECK` inspects dtypes, ranks, flags at execution time |
| **Fallback**     | None — exactly one implementation is linked                               | Yes — if check fails, the generic op body runs instead                         |
| **Namespace**    | `sd::ops::helpers::`                                                      | `sd::ops::platforms::`                                                         |
| **Location**     | `include/ops/declarable/helpers/{cpu,cuda,impl}/`                         | `include/ops/declarable/platform/{mkldnn,cudnn,armcompute,accelerate}/`        |
| **Libraries**    | None (pure C++/CUDA)                                                      | oneDNN, cuDNN, ARM Compute Library, Apple Accelerate                           |
| **Who calls it** | The op body calls `helpers::myFunc()` unconditionally                     | The executor intercepts the op *before* its body runs                          |

#### Helper methods (compile-time CPU/CUDA split)

Most ops delegate their real work to helper functions. A header in `helpers/` declares the signature, and separate `.cpp` and `.cu` files provide CPU and CUDA implementations. CMake globs one or the other — there is no runtime dispatch.

**Directory structure:**

```
include/ops/declarable/helpers/
    activations.h          ← function signatures
    batchnorm.h
    convolutions.h
    gather.h
    transforms.h
    ... (~70 headers)
    cpu/
        activations.cpp    ← CPU implementation (OpenMP threading)
        batchnorm.cpp
        gather.cpp
        ...
    cuda/
        activations.cu     ← CUDA implementation (kernels)
        batchnorm.cu
        gather.cu
        ...
    impl/
        unique.cpp         ← shared code compiled in both CPU and CUDA builds
        listdiff.cpp
        ...
```

CPU and CUDA files mirror each other — same file stems, same function signatures, different implementations. `impl/` contains helpers that need no platform split (compiled in both builds).

**CMake selection** (from `cmake/MainBuildFlow.cmake`):

```cmake
if(SD_CUDA)
    file(GLOB_RECURSE CUSTOMOPS_HELPERS_SOURCES
        ./include/ops/declarable/helpers/cuda/*.cu
        ./include/ops/declarable/helpers/impl/*.cpp)
    # explicitly exclude cpu/ files
else()
    file(GLOB_RECURSE CUSTOMOPS_HELPERS_CPU_SOURCES
        ./include/ops/declarable/helpers/cpu/*.cpp)
    file(GLOB_RECURSE CUSTOMOPS_HELPERS_IMPL_SOURCES
        ./include/ops/declarable/helpers/impl/*.cpp)
endif()
```

No `#ifdef` guards inside the files — the build system ensures only one set is compiled.

**How ops call helpers:**

```cpp
// include/ops/declarable/generic/nn/softmax.cpp
#include <ops/declarable/helpers/activations.h>

CONFIGURABLE_OP_IMPL(softmax, 1, 1, true, 0, 0) {
    auto input  = INPUT_VARIABLE(0);
    auto output = OUTPUT_VARIABLE(0);
    const int dim = block.getIArguments()->size() > 0 ? INT_ARG(0) : input->rankOf() - 1;

    // Calls either the cpu/ or cuda/ implementation — whichever was compiled in
    helpers::softmax(block.launchContext(), input, output, dim);
    return sd::Status::OK;
}
```

The call is unconditional. `block.launchContext()` carries the CUDA stream on GPU builds or is a no-op context on CPU builds. The helper uses it to launch kernels:

```cpp
// In helpers/cuda/activations.cu:
void softmax(LaunchContext *context, NDArray *input, NDArray *output, int dim) {
    NDArray::prepareSpecialUse({output}, {input});
    auto dims = getLaunchDims("softmax");
    BUILD_SINGLE_SELECTOR(input->dataType(), softmaxCudaLauncher,
        (dims, context->getCudaStream(), input->specialBuffer(), ...),
        SD_FLOAT_TYPES);
    NDArray::registerSpecialUse({output}, {input});
}

// In helpers/cpu/activations.cpp:
void softmax(LaunchContext *context, NDArray *input, NDArray *output, int dim) {
    // OpenMP-parallelized implementation working on host buffers
    auto func = PRAGMA_THREADS_FOR {
        for (auto i = start; i < stop; i++) { ... }
    };
    samediff::Threads::parallel_for(func, 0, numTads);
}
```

**The `LaunchContext` pattern:** Helpers always take `LaunchContext*` as their first parameter. On CUDA builds it provides `getCudaStream()`, `getCublasHandle()`, `getCusolverHandle()`, and workspace access. On CPU builds the CUDA methods don't exist — the context just wraps a `Workspace*`. This lets the same function signature work on both platforms.

**When to write a new helper method:**

* Your op needs logic that can't be expressed with NDArray methods (applyTransform, reduceAlongDimension, etc.)
* You need a CUDA kernel for performance
* The same op needs to work on both CPU and CUDA builds

Create a header in `helpers/`, a `.cpp` in `helpers/cpu/`, and a `.cu` in `helpers/cuda/`. Call it from your op body with `helpers::myFunc(block.launchContext(), ...)`.

#### Platform helper ops (runtime vendor dispatch)

**Defined in:** `include/system/platform_boilerplate.h`, `include/ops/declarable/PlatformHelper.h`

Platform helper ops are a completely separate system. They provide **vendor-library-accelerated** replacements for ops that already have a generic implementation. The key difference: they are checked at **runtime**, and the op **falls back** to its generic body if the check fails.

**The dispatch chain** (from `DeclarableOp::execute()` in `impl/DeclarableOp.cpp`):

```
1. Are helpers allowed? (global flag + per-context flag)
   ↓ yes
2. Is a platform helper registered for (this op's hash, current engine)?
   ↓ yes (O(1) hash map lookup in OpRegistrator)
3. Does PLATFORM_CHECK return true? (runtime: inspects dtypes, ranks, strides, flags)
   ↓ yes
4. Run PLATFORM_IMPL (vendor-accelerated path)

If any gate fails → run the generic op body (validateAndExecute)
```

There is at most **one** platform helper per `(opHash, engine)` pair. Multiple backends for the same engine (e.g., oneDNN and ARM Compute both use `ENGINE_CPU`) are mutually exclusive at **build time** — CMake only compiles one into a given binary.

**Writing a platform helper:**

```cpp
// include/ops/declarable/platform/mkldnn/softmax.cpp

#include <ops/declarable/PlatformHelper.h>
#include <ops/declarable/OpRegistrator.h>

namespace sd {
namespace ops {
namespace platforms {

// The runtime check — return a Requirements object (acts as bool)
PLATFORM_CHECK(softmax, ENGINE_CPU) {
    auto x = INPUT_VARIABLE(0);
    auto z = OUTPUT_VARIABLE(0);

    Requirements req("ONEDNN SOFTMAX OP");
    req.expectTrue(block.isUseONEDNN(), "isUseONEDNN")
    && req.expectFalse(makeInfoVariable(x->isEmpty(), "empty input"), "expected non-empty")
    && req.expectGreater(makeInfoVariable(x->rankOf(), "input rank"), 2)
    && req.expectLess(makeInfoVariable(x->rankOf(), "input rank"), 7)
    && req.expectEq(makeInfoVariable(x->dataType(), "input dtype"), DataType::FLOAT32);
    req.logTheSuccess();
    return req;
}

// The accelerated implementation (only runs if PLATFORM_CHECK returned true)
PLATFORM_IMPL(softmax, ENGINE_CPU) {
    auto input  = INPUT_VARIABLE(0);
    auto output = OUTPUT_VARIABLE(0);

    // oneDNN-specific code using dnnl::engine, dnnl::memory, dnnl::softmax_forward
    dnnl::engine eng(dnnl::engine::kind::cpu, 0);
    // ... set up memory descriptors, primitive, execute ...

    return sd::Status::OK;
}

}  // namespace platforms
}  // namespace ops
}  // namespace sd
```

The `PLATFORM_IMPL` macro auto-registers the helper with `OpRegistrator` at library load time via a static struct initializer — no manual registration needed.

**The Requirements system:** `PLATFORM_CHECK` returns a `Requirements` object (defined in `include/system/RequirementsHelper.h`). It chains conditions with `&&` and provides `expectEq`, `expectIn`, `expectTrue`, `expectLess`, `expectGreater`, etc. If any condition fails, the chain short-circuits and the whole check returns false. When debug+verbose mode is on, `logTheSuccess()` logs all passing conditions.

**Engine constants:**

| Engine                         | Library             | Source directory       | Build-time constraint  |
| ------------------------------ | ------------------- | ---------------------- | ---------------------- |
| `ENGINE_CPU` / `ENGINE_ONEDNN` | Intel oneDNN        | `platform/mkldnn/`     | x86 builds with oneDNN |
| `ENGINE_CUDA`                  | NVIDIA cuDNN        | `platform/cudnn/`      | CUDA builds only       |
| `ENGINE_ARM`                   | ARM Compute Library | `platform/armcompute/` | ARM builds with ACL    |
| `ENGINE_ACCELERATE`            | Apple Accelerate    | `platform/accelerate/` | macOS/iOS builds       |
| `ENGINE_MPS`                   | Apple Metal         | `platform/mps/`        | macOS/iOS builds       |

**Op coverage (sampling):**

| Op                           | cuDNN | oneDNN | ARM Compute |
| ---------------------------- | ----- | ------ | ----------- |
| `conv2d` / `conv2d_bp`       | yes   | yes    | yes         |
| `conv3dnew` / `conv3dnew_bp` | yes   | yes    | —           |
| `depthwise_conv2d` / `_bp`   | yes   | yes    | —           |
| `avgpool2d` / `maxpool2d`    | yes   | yes    | yes         |
| `batchnorm` / `batchnorm_bp` | yes   | yes    | —           |
| `softmax`                    | —     | yes    | —           |
| `matmul`                     | —     | yes    | —           |
| `lstmLayer`                  | yes   | yes    | —           |
| `ctc_loss`                   | yes   | —      | —           |

**When to write a platform helper vs. a helper method:**

* **Helper method** — you're writing the primary implementation of an op that needs to work on both CPU and CUDA. This is the common case.
* **Platform helper** — you're adding a vendor-optimized fast-path for an op that *already works*. The generic implementation must exist first. The platform helper is a bonus that kicks in only when the runtime conditions are met (right dtype, right rank, library available, etc.)

#### How they interact

A typical op has both:

```
softmax op body
    ├─ PLATFORM_CHECK(softmax, ENGINE_CPU) → oneDNN path (if available + float32 + rank 3-6)
    │   (checked by DeclarableOp::execute BEFORE the op body runs)
    │
    └─ if no platform helper matched:
        op body runs → calls helpers::softmax(context, input, output, dim)
            ├─ helpers/cpu/activations.cpp  (if CPU build)
            └─ helpers/cuda/activations.cu  (if CUDA build)
```

The platform helper completely **replaces** the op body — it doesn't call the helper method. It's an alternative path, not a wrapper. The helper method is the fallback that runs when no platform helper is available or when the platform check fails (wrong dtype, wrong rank, etc.).

#### Platform macros

**Defined in:** `include/system/common.h`

Do not use raw CUDA/compiler annotations. The project macros compile on both CPU and CUDA builds:

| Banned                     | Use instead               | CPU expansion     | CUDA expansion           |
| -------------------------- | ------------------------- | ----------------- | ------------------------ |
| `__host__`                 | `SD_HOST`                 | *(empty)*         | `__host__`               |
| `__device__`               | `SD_DEVICE`               | *(empty)*         | `__device__`             |
| `__global__`               | `SD_KERNEL`               | *(empty)*         | `__global__`             |
| `__host__ __device__`      | `SD_HOST_DEVICE`          | *(empty)*         | `__host__ __device__`    |
| `__forceinline__`          | `SD_INLINE`               | `inline`          | `__forceinline__ inline` |
| `#pragma omp parallel for` | `PRAGMA_OMP_PARALLEL_FOR` | `#pragma omp ...` | *(empty)*                |

Composite qualifiers: `SD_OP_DEF` (host+device inline, SIMD hint on CPU), `SD_META_DEF` (host-only inline).

#### OpenMP macros

**Defined in:** `include/system/openmp_pragmas.h`

Never use raw `#pragma omp`. On MSVC most of these expand to nothing (limited OpenMP support), so the macros are required for portability.

Most useful in practice:

```cpp
// Basic parallel for:
PRAGMA_OMP_PARALLEL_FOR
for (LongType i = 0; i < length; i++) { ... }

// With thread count control:
PRAGMA_OMP_PARALLEL_FOR_THREADS(numThreads)
for (LongType i = 0; i < length; i++) { ... }

// With reduction:
double sum = 0;
PRAGMA_OMP_PARALLEL_FOR_REDUCTION(+:sum)
for (LongType i = 0; i < length; i++) { sum += values[i]; }

// Combined parallel for + SIMD:
PRAGMA_OMP_PARALLEL_FOR_SIMD
for (LongType i = 0; i < length; i++) { ... }

// Collapsed nested loops:
PRAGMA_OMP_PARALLEL_FOR_SIMD_COLLAPSE(2)
for (LongType i = 0; i < rows; i++)
    for (LongType j = 0; j < cols; j++) { ... }

// Atomic update:
PRAGMA_OMP_ATOMIC
counter++;
```

**In practice, you rarely need these directly.** NDArray methods and the loop infrastructure handle threading for you. You'll only write explicit OpenMP when implementing a helper function that works on raw buffers.

#### Type dispatch

**Defined in:** `include/system/type_boilerplate.h`

Type dispatch is needed when you drop to raw buffers (e.g., in CUDA kernels or helper functions). NDArray methods handle this internally — you only need `BUILD_SINGLE_SELECTOR` when calling a templated helper from non-templated op code.

```cpp
// Runtime dispatch: call myHelper<T>() for the correct T based on DataType:
BUILD_SINGLE_SELECTOR(input->dataType(), myHelper,
    (input->buffer(), output->buffer(), length),
    SD_FLOAT_TYPES);
// Expands to: switch(dataType) { case FLOAT32: myHelper<float>(...); break; ... }
```

**Use the narrowest type list.** `SD_FLOAT_TYPES` (4 types) instead of `SD_COMMON_TYPES` (13 types) when only floats are supported. Each type instantiation adds to binary size and compile time.

| Type list           | Count | Use when                                            |
| ------------------- | ----- | --------------------------------------------------- |
| `SD_FLOAT_TYPES`    | 4     | Op only makes sense on floats (most neural net ops) |
| `SD_NUMERIC_TYPES`  | 12    | Op works on floats and integers                     |
| `SD_COMMON_TYPES`   | 13    | Op works on any type including bool                 |
| `SD_INTEGER_TYPES`  | 8     | Op only works on integers                           |
| `SD_INDEXING_TYPES` | 2     | Op uses indices (INT32, INT64 only)                 |

For template instantiation in `.cpp`/`.cu` files (forces the compiler to emit code for each type):

```cpp
BUILD_SINGLE_TEMPLATE(template void myHelper, (const void*, void*, LongType), SD_FLOAT_TYPES);
```

#### CUDA kernel patterns in detail

Only write custom kernels when NDArray methods can't express your logic. When you do, follow these patterns exactly.

**Shared memory for shape info**

Thread 0 reads shape metadata once; all threads use it. This avoids redundant global memory reads:

```cpp
__shared__ LongType length, rank;
__shared__ const LongType *shapePtr, *stridePtr;
if (threadIdx.x == 0) {
    length    = shape::length(shapeInfo);
    rank      = shape::rank(shapeInfo);
    shapePtr  = shape::shapeOf(shapeInfo);
    stridePtr = shape::stride(shapeInfo);
}
__syncthreads();
```

**Grid-stride loop**

Always use this pattern — never assume the array fits in one grid:

```cpp
auto tid = blockIdx.x * blockDim.x + threadIdx.x;
auto totalThreads = gridDim.x * blockDim.x;
for (LongType i = tid; i < length; i += totalThreads) {
    // process element i
}
```

**INDEX2COORDS / COORDS2INDEX**

These macros convert between linear indices and strided buffer offsets. They handle both C and Fortran ordering. **Always use them** — never assume contiguous memory layout:

```cpp
LongType coords[SD_MAX_RANK], offset;
INDEX2COORDS(linearIndex, rank, shapePtr, coords);   // index → coordinates
COORDS2INDEX(rank, stridePtr, coords, offset);        // coordinates → buffer offset
T value = buffer[offset];
```

**Launch dimensions**

**Defined in:** `include/execution/cuda/LaunchDims.h`, `LaunchDims.cu`

The `dim3` packing convention: `.x` = blocks per grid, `.y` = threads per block, `.z` = shared memory bytes. Retrieve by name from a global registry:

```cpp
auto dims = getLaunchDims("myOp");
myKernel<<<dims.x, dims.y, dims.z, *stream>>>(...);
```

Every named entry supports environment-variable overrides (e.g., `GRID_SIZE_MY_OP`, `BLOCK_SIZE_MY_OP`) for runtime tuning without recompiling. **Always register your launch dims** — never hardcode `<<<256, 512>>>`.

**CUDA coherence: prepareSpecialUse / registerSpecialUse**

Every CUDA op **must** bookend kernel launches with these calls:

```cpp
// Before: sync inputs to device, prepare output
NDArray::prepareSpecialUse({output}, {input1, input2});

// ... kernel launch ...

// After: mark output as written on device, inputs as read
NDArray::registerSpecialUse({output}, {input1, input2});
```

Forgetting this causes silent data corruption — the host and device copies of the buffer get out of sync.

#### Loop infrastructure (internals)

**Directory:** `include/loops/`

You don't call these directly — NDArray methods dispatch to them. But understanding the architecture helps when debugging performance or contributing new loop types.

| Category            | Classes                                                                                                    | Example ops                  |
| ------------------- | ---------------------------------------------------------------------------------------------------------- | ---------------------------- |
| Element-wise        | `TransformSame<X>`, `TransformFloat<X,Z>`, `TransformBool<X,Z>`, `TransformStrict<X>`, `TransformAny<X,Z>` | Abs, Sqrt, Sigmoid, IsNan    |
| Reductions          | `ReduceSameFunction<X>`, `ReduceFloatFunction<X,Z>`, `ReduceBoolFunction<X,Z>`, `ReduceLongFunction<X,Z>`  | Sum, Mean, Any, CountNonZero |
| Index reductions    | `IndexReduce<X,Z>`                                                                                         | ArgMax, ArgMin               |
| Binary              | `PairWiseTransform<X,Y,Z>`, `Broadcast<X,Y,Z>`, `ScalarTransform<X,Y,Z>`                                   | Add, Multiply, broadcast ops |
| Pairwise reductions | `Reduce3<X,Z>`, `SummaryStatsReduce<X,Z>`                                                                  | CosineSimilarity, Variance   |

CPU implementations in `include/loops/cpu/`, CUDA in `include/loops/cuda/`.

**LoopKind** (`include/helpers/LoopKind.h`) classifies the fastest CPU loop strategy: `RANK1`–`RANK5` use direct stride arithmetic (no coordinate conversion), `BROADCAST_SCALAR_X/Y` handles scalar broadcast, and `COMMON` falls back to `INDEX2COORDS`/`COORDS2INDEX`. This dispatch happens automatically — do not hand-write rank-specialized loops.

#### NDArray shape and element access

For debugging, small tensors, or setup code (never in hot paths):

```cpp
// Shape queries:
input->rankOf()              // rank
input->sizeAt(dim)           // size of dimension
input->lengthOf()            // total elements
input->dataType()            // sd::DataType enum
input->ordering()            // 'c' or 'f'
input->shapeInfo()           // raw LongType* shape buffer

// Element access (host-side, slow — auto-syncs host/device):
T val = input->t<T>(i);            // by linear index
T val = input->t<T>(i, j);         // by 2D coordinates
T val = input->t<T>(i, j, k);      // by 3D coordinates
input->p<T>(i, newValue);          // write element

// Buffer access (for performance-critical code):
T *hostBuf   = input->bufferAsT<T>();      // host pointer
T *deviceBuf = input->specialBufferasT<T>();  // CUDA device pointer
```

#### Error handling and debugging

```cpp
// Validate inputs (preferred — includes file:line automatically):
REQUIRE_TRUE(input->rankOf() >= 2, 0,
    "my_op: expected rank >= 2, got %d", input->rankOf());

// Throw unconditionally:
THROW_EXCEPTION("my_op: unsupported configuration");

// Logging (sd_debug/sd_verbose are no-ops unless debug mode is on):
sd_printf("my_op: processing shape [%lld, %lld]\n", dim0, dim1);
sd_debug("my_op: entering branch for dtype %d\n", input->dataType());

// CUDA error check (after kernel launches):
DebugHelper::checkErrorCode(stream, "myKernel failed");
```

#### Rules and conventions

1. **Start with NDArray methods.** `applyTransform`, `reduceAlongDimension`, `applyBroadcast`, `applyPairwiseLambda` — these handle threading, CUDA dispatch, type promotion, and stride-aware iteration. Only drop to raw buffers when you have custom logic that can't be expressed this way.
2. **Use project macros.** `SD_HOST`/`SD_DEVICE`/`SD_KERNEL`/`SD_INLINE` instead of raw CUDA annotations. `PRAGMA_OMP_*` instead of raw `#pragma omp`.
3. **Never assume contiguous memory.** Always use `INDEX2COORDS` / `COORDS2INDEX` in kernels. Even arrays that look contiguous may be views with non-trivial strides.
4. **Always bookend CUDA kernels with `prepareSpecialUse` / `registerSpecialUse`.** Forgetting this causes silent data corruption.
5. **Use the narrowest type list.** `SD_FLOAT_TYPES` instead of `SD_COMMON_TYPES` when only floats are supported — fewer instantiations means smaller binaries and faster compile times.
6. **Register launch dimensions.** Add entries to `LaunchDims.h`/`LaunchDims.cu`. Never hardcode grid/block sizes.
7. **Use grid-stride loops.** `for (i = tid; i < length; i += totalThreads)` — handles any array size.
8. **Cache shape info in `__shared__`.** Thread 0 reads, all threads sync, then all threads use shared copies.
9. **Do not use `ews()` / `elementWiseStride`.** Deprecated — returns wrong results for views.
10. **Use `REQUIRE_TRUE` for input validation.** Not raw `if` + `throw`.
11. **Do not use smart pointers.** libnd4j uses raw pointers. Smart pointers conflict with the workspace allocator and existing ownership model.
12. **Memory allocation:** On the rare occasion you need raw allocation (temporary index arrays, workspace scratch), use `ALLOCATE(ptr, workspace, length, Type)` / `RELEASE(ptr, workspace)` to integrate with the workspace system. But most ops never need this — NDArray manages its own memory.

***

### Java Conventions

* **Java 11** source compatibility. Do not use Java 17 language features in core modules.
* **4-space indent**, no tabs.
* **Lombok** annotations (`@Data`, `@Builder`, `@Slf4j`) — follow the style of surrounding code.
* No wildcard imports (`import org.nd4j.*`).
* **Javadoc** on all public methods and classes.
* Generated code (JavaCPP presets) must never be edited directly — update the preset configuration instead.

***

### Pull Request Workflow

#### 1. Fork and branch

```bash
git clone https://github.com/YOUR_USERNAME/deeplearning4j.git
cd deeplearning4j
git remote add upstream https://github.com/deeplearning4j/deeplearning4j.git
git fetch upstream
git checkout -b my-feature upstream/master
```

#### 2. Make your changes

Keep commits small and focused. Each commit should compile and pass tests independently.

#### 3. Rebase before submitting

```bash
git fetch upstream
git rebase upstream/master
```

#### 4. Push and open PR

```bash
git push origin my-feature
```

Open a pull request to `deeplearning4j/deeplearning4j:master`. Include:

* **What** was changed and **why**
* How to test the change
* Any relevant issue numbers (`Fixes #1234`)

#### 5. CI and review

CI runs automatically. Address reviewer feedback by pushing additional commits — do not force-push a branch under review. A maintainer will merge once the PR is approved and CI passes.

#### PR checklist

* [ ] ECA signed and email matches GitHub account
* [ ] Branch is rebased on current `upstream/master`
* [ ] Code compiles with `mvn clean install -DskipTests`
* [ ] New or changed behavior is covered by tests in `platform-tests/`
* [ ] New public API has Javadoc
* [ ] Numerical gradient checks pass for any new op with a backward pass
* [ ] C++ code uses project macros (`SD_HOST`, `SD_DEVICE`, `PRAGMA_OMP_*`), not raw annotations

***

### CI/CD Build Environment

Understanding the CI pipeline helps when debugging build failures or adding new build targets. All CI configuration lives in `.github/workflows/`.

#### Build matrix

The project builds native artifacts across multiple platforms, each producing classifier-tagged Maven artifacts:

| Platform                  | Workflow                           | Runners                        |
| ------------------------- | ---------------------------------- | ------------------------------ |
| Linux x86\_64 (CPU)       | `build-deploy-linux-x86_64.yml`    | `ubuntu-22.04`                 |
| Linux x86\_64 (CUDA 12.6) | `build-deploy-linux-cuda-12.6.yml` | Self-hosted                    |
| Linux x86\_64 (CUDA 12.9) | `build-deploy-linux-cuda-12.9.yml` | Self-hosted                    |
| Linux ARM64               | `build-deploy-linux-arm64.yml`     | Self-hosted ARM64              |
| macOS ARM64               | `build-deploy-mac-arm64.yml`       | `macos-14`                     |
| Windows x86\_64           | `build-deploy-windows-x86_64.yml`  | `windows-2022`                 |
| Android ARM64             | `build-deploy-android-arm64.yml`   | `ubuntu-22.04` (cross-compile) |
| Android x86\_64           | `build-deploy-android-x86_64.yml`  | `ubuntu-22.04` (cross-compile) |

#### Classifier system

Each platform build produces artifacts with classifiers that encode the helper library and extension:

```
nd4j-native-1.0.0-SNAPSHOT-linux-x86_64.jar          # base
nd4j-native-1.0.0-SNAPSHOT-linux-x86_64-onednn.jar   # oneDNN helper
nd4j-native-1.0.0-SNAPSHOT-linux-x86_64-avx2.jar     # AVX2 extension
nd4j-native-1.0.0-SNAPSHOT-linux-x86_64-onednn-avx2.jar  # both
```

CPU builds use a **matrix** of helper × extension:

| Dimension     | Values                       | What it means                                                                  |
| ------------- | ---------------------------- | ------------------------------------------------------------------------------ |
| **helper**    | `onednn`, `compile`, (empty) | Helper library linked: oneDNN graph API, MLIR/Triton compile stack, or generic |
| **extension** | `avx2`, `avx512`, (empty)    | x86 SIMD extension level targeted                                              |

The matrix produces up to 9 combinations (3 × 3). The `compile` helper variant requires LLVM/MLIR at build time and produces the Triton JIT compilation backend.

CUDA builds use a simpler matrix:

| Dimension  | Values                      | What it means                                       |
| ---------- | --------------------------- | --------------------------------------------------- |
| **helper** | `cudnn`, `compile`, (empty) | cuDNN helper, Triton compile stack, or generic CUDA |

#### CI environment details

The standard CI build environment for Linux x86\_64:

| Component      | Version / Config                                                           |
| -------------- | -------------------------------------------------------------------------- |
| OS             | Ubuntu 22.04                                                               |
| JDK            | Temurin 11 (build), 17 (test)                                              |
| Maven          | 3.9.x                                                                      |
| CMake          | Latest via `apt`                                                           |
| Compiler cache | **sccache** (not ccache — CI uses sccache for its S3 remote cache support) |
| Protobuf       | libprotobuf-dev (from apt)                                                 |
| Debug symbols  | `libdwarf-dev`, `libelf-dev`, `binutils-dev` (for DWARF stack traces)      |
| LLVM/MLIR      | LLVM 18 (only for `compile` helper variant)                                |
| Swap           | 12 GB swap file (native builds are memory-intensive)                       |

**Difference from local development:** CI uses **sccache** instead of ccache. sccache supports remote S3 caching, which means CI cache is shared across runs. Local developers should still use **ccache**, which is simpler and doesn't require S3 configuration.

CUDA CI builds add:

| Component            | Version / Config                                          |
| -------------------- | --------------------------------------------------------- |
| CUDA toolkit         | 12.6 or 12.9 (installed via `Jimver/cuda-toolkit` action) |
| Compute capabilities | `8.6 9.0` (Ampere + Hopper)                               |
| Build timeout        | 720 minutes (12 hours)                                    |
| Runners              | Self-hosted with NVIDIA GPUs                              |

#### Test infrastructure

Tests run via the `run-tests.yml` workflow, which supports 16 test suites:

| Suite              | What it tests                |
| ------------------ | ---------------------------- |
| `nd4j`             | ND4J core array operations   |
| `samediff`         | SameDiff autodiff engine     |
| `java-cpp`         | JavaCPP bindings             |
| `dl4j-core`        | DL4J neural network layers   |
| `datavec`          | Data pipeline (DataVec)      |
| `keras`            | Keras model import           |
| `onnx`             | ONNX import                  |
| `dl4j-spark`       | Distributed training (Spark) |
| `dsp`              | DSP execution engine         |
| `llm`              | LLM/VLM inference stack      |
| `peft`             | PEFT and RL alignment        |
| `ggml`             | GGML/GGUF model import       |
| `omnihub`          | OmniHub model loading        |
| `python4j`         | Python4J bridge              |
| `tokenizers`       | Tokenizer implementations    |
| `model-evaluation` | LLM evaluation benchmarks    |

The workflow accepts parameters:

```yaml
# Run a single suite
test_suite: "samediff"

# Run all 16 suites
test_suite: "all"

# Quick mode — runs 5 core suites in parallel (nd4j, samediff, java-cpp, dl4j-core, datavec)
test_suite: "quick"

# Backend selection
backend_artifact_id: "nd4j-cuda-12.9"
backend_classifier: "linux-x86_64-onednn-avx2"

# JVM config
heap_size: "6g"
```

Test resources (models, test data) are fetched from `dl4j-test-resources` at the start of each run. Test results are uploaded as artifacts (Surefire XML reports) for each suite.

#### Snapshot deployment

Successful builds on the `master` branch deploy Maven snapshots to [central.sonatype.com](https://central.sonatype.com) (the OSSRH Sonatype snapshot repository). The deployment uses retry with exponential backoff (up to 3 attempts) to handle transient upload failures.

Deployed artifacts include:

* Backend JARs with platform classifiers
* DSP Runtime SDK (native shared libraries)
* SDK JARs (Java, with sources and javadoc)

#### Reproducing CI builds locally

To replicate what CI does for a CPU build with the oneDNN helper and AVX2 extension:

```bash
mvn -Pcpu \
  -Dlibnd4j.buildthreads=$(nproc) \
  -Dlibnd4j.helper=onednn \
  -Dlibnd4j.extension=avx2 \
  -pl libnd4j,:nd4j-cpu-backend-common,:nd4j-native \
  clean install -DskipTests
```

To replicate a CUDA build with cuDNN:

```bash
mvn -Pcuda \
  -Dlibnd4j.chip=cuda \
  -Dlibnd4j.helper=cudnn \
  -Dlibnd4j.buildthreads=$(nproc) \
  -pl libnd4j,:nd4j-cuda-12.9 \
  clean install -DskipTests
```

The key difference: CI uses sccache with remote caching and builds all matrix combinations. Locally you only need to build the one combination you're working on.

***

### Reporting Issues

File bugs and feature requests at [github.com/deeplearning4j/deeplearning4j/issues](https://github.com/deeplearning4j/deeplearning4j/issues).

A useful bug report includes:

* DL4J version (or commit hash if built from source)
* Java version and OS
* Backend (CPU or CUDA, and CUDA toolkit version)
* A minimal reproducible example
* Full stack trace
* Expected vs. actual behavior

For example-specific bugs, use [github.com/eclipse/deeplearning4j-examples/issues](https://github.com/eclipse/deeplearning4j-examples/issues).

***

### Community

| Channel                                                                            | Purpose                       |
| ---------------------------------------------------------------------------------- | ----------------------------- |
| [GitHub Issues](https://github.com/deeplearning4j/deeplearning4j/issues)           | Bug reports, feature requests |
| [GitHub Discussions](https://github.com/deeplearning4j/deeplearning4j/discussions) | Questions, design discussions |