> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/deeplearning4j/contributing.md).

# Contributing

Contributions to Eclipse Deeplearning4j are welcome. This guide covers the full contributor workflow: legal requirements, build process, project architecture, how to add new ops or examples, and how to get a pull request merged.

### Eclipse Contributor Agreement (ECA)

Deeplearning4j is an Eclipse Foundation project. Before your first pull request can be merged, you must sign the **Eclipse Contributor Agreement**:

1. Create an account at [accounts.eclipse.org](https://accounts.eclipse.org/user/register).
2. Sign the ECA at [accounts.eclipse.org/user/eca](https://accounts.eclipse.org/user/eca).
3. **The email on your Eclipse account must exactly match the email on your GitHub account.** This is how the automated check identifies you.

The ECA incorporates the Developer Certificate of Origin (DCO) v1.1. By signing, you certify that your contributions are your own (or that you have the right to submit them) and grant Eclipse a non-exclusive, perpetual license. You retain copyright. You only need to sign once — the ECA is valid for 3 years and can be re-signed.

An Eclipse bot automatically checks every pull request. If your ECA is missing or your email doesn't match, the bot will comment with instructions.

**ECA FAQ:** [eclipse.org/legal/eca/faq](https://www.eclipse.org/legal/eca/faq/)

***

### Repository Structure

All DL4J libraries live in a single monorepo at [github.com/deeplearning4j/deeplearning4j](https://github.com/deeplearning4j/deeplearning4j).

#### Maven modules

| Module           | What it is                                                                                                                                  |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `libnd4j`        | C++ native compute engine. All ops, kernels, DSP execution engine, graph backends. Built with CMake, invoked through Maven via JavaCPP.     |
| `nd4j`           | Java ND4J API, SameDiff autodiff, backend bindings (CPU, CUDA), ONNX import, GGML import, tokenizers, DSP runtime SDK                       |
| `deeplearning4j` | High-level DL4J layers (`MultiLayerNetwork`, `ComputationGraph`), Keras import, LLM/VLM pipelines, PEFT, RL alignment trainers, training UI |
| `datavec`        | Data pipeline — record readers, transforms, schema, serialization                                                                           |
| `python4j`       | Embedded CPython execution from the JVM                                                                                                     |
| `omnihub`        | Model hub — `AutoModel.fromPretrained()`, format auto-detection                                                                             |
| `codegen`        | Op code generation from op descriptors                                                                                                      |
| `platform-tests` | **All tests live here.** Tests are never placed in the modules being tested.                                                                |
| `resources`      | Shared test resources                                                                                                                       |

#### Key directories inside `libnd4j`

| Directory                          | Contents                                                                                                                     |
| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
| `include/ops/declarable/generic/`  | Op implementations (C++ templates, one file per op or per op group)                                                          |
| `include/ops/declarable/platform/` | Platform-specific op implementations: `mkldnn/` (oneDNN), `armcompute/` (ARM ACL), `accelerate/` (Apple), `mlir/` (MLIR JIT) |
| `include/ops/declarable/headers/`  | Op header declarations                                                                                                       |
| `include/graph/`                   | DSP execution engine, graph backends, plan compiler                                                                          |
| `include/array/`                   | `NDArray` C++ implementation                                                                                                 |
| `include/system/`                  | Platform macros (`SD_HOST`, `SD_DEVICE`, `SD_INLINE`), `Engine.h`                                                            |
| `include/helpers/`                 | BLAS helpers, `MmulHelper`, `LoopKind`                                                                                       |
| `include/loops/`                   | Kernel loop implementations (transform, reduce, broadcast, etc.)                                                             |

#### Companion repositories

| Repository                                                                    | What it is                                                                      |
| ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| [deeplearning4j-examples](https://github.com/eclipse/deeplearning4j-examples) | Runnable example programs — see [Contributing Examples](#contributing-examples) |
| [deeplearning4j-docs](https://github.com/KonduitAI/deeplearning4j-docs)       | This documentation site (GitBook)                                               |

***

### Build Process

#### Prerequisites

* **JDK 11+** (JDK 17 recommended)
* **Maven 3.6.3+**
* **CMake 3.19+** and a C++17-capable compiler (GCC 9+, Clang 12+, MSVC 2019+)
* **ccache** — essential for iterative development. First native build: 30–45 minutes. With ccache, subsequent builds after small changes: \~30 seconds.
* **CUDA toolkit 12.9** (for GPU builds) + compatible NVIDIA driver (525.60+)
* **Project Lombok** IDE plugin — without it your IDE will show false compilation errors

#### CPU build

```bash
mvn -Pcpu \
  -Dlibnd4j.buildthreads=$(nproc) \
  -pl libnd4j,:nd4j-cpu-backend-common,:nd4j-native \
  clean install -DskipTests
```

#### CUDA build

```bash
mvn -Pcuda \
  -Dlibnd4j.chip=cuda \
  -Dlibnd4j.buildthreads=$(nproc) \
  -pl libnd4j,:nd4j-cuda-12.9 \
  clean install -DskipTests
```

To enable Triton JIT compilation (for the `-compile` classifier):

```bash
mvn -Pcuda \
  -Dlibnd4j.chip=cuda \
  -Dlibnd4j.triton=ON \
  -Dlibnd4j.buildthreads=$(nproc) \
  -pl libnd4j,:nd4j-cuda-12.9 \
  clean install -DskipTests
```

#### Java-only module build (no native compilation)

If you're only changing Java code and the native library is already built:

```bash
mvn install -DskipTests -pl <module>
```

#### Build rules

* Always use `install`, never just `compile` — downstream modules need the JAR in your local Maven repo.
* If building C++, always rebuild the Java bindings too (both `libnd4j` AND the backend module).
* Never invoke `make` directly — it skips Java binding regeneration and produces mismatched artifacts.
* **ccache is critical.** Never run `ccache -C` or `ccache --clear`. If you suspect stale results, touch the specific source file to force recompilation of just that file.

#### Building for a different CUDA version

The default CUDA version is 12.9, but `cuda.version` is a Maven property:

```bash
mvn -Pcuda -Dcuda.version=12.6 -Dlibnd4j.chip=cuda \
  -pl libnd4j,:nd4j-cuda-12.6 \
  clean install -DskipTests
```

***

### Platform Tests

**All tests live in `platform-tests/`.** Tests are never placed in the modules being tested — this is a hard project rule.

#### Why tests are centralized

The individual library modules (`nd4j/`, `deeplearning4j/`, `datavec/`) declare only compile-time dependencies and do not include a concrete backend. `platform-tests` is the single place where:

1. A concrete backend (`nd4j-native` or `nd4j-cuda`) is declared as a dependency, making execution possible.
2. Surefire is configured with the memory sizes, JVM flags, and native library hooks needed for testing.
3. JUnit 5 extensions enforce backend-appropriate test selection automatically.
4. The Maven Shade plugin builds a self-contained uber-JAR for benchmark/profiling runs outside Maven.

The root `pom.xml` does **not** include `platform-tests` by default. CI workflows `cd` into the `platform-tests` directory and run `mvn test` there directly.

#### Running tests

Always run from the `platform-tests` directory:

```bash
cd platform-tests

# Run a single test class
mvn test -Dtest=MyTestClass

# Run a single test method
mvn test -Dtest=MyTestClass#myTestMethod

# Run a parameterized test method (requires trailing wildcard)
mvn test -Dtest=MyTestClass#myParameterizedMethod*
```

**Never run `mvn test` from the project root** — it triggers full native rebuilds and runs every test suite, which takes hours.

#### Backend selection

Backend selection is entirely Maven property-driven. Two properties control what backend your tests run against:

| Property              | Default       | What it does                                                  |
| --------------------- | ------------- | ------------------------------------------------------------- |
| `backend.artifactId`  | `nd4j-native` | Selects the ND4J backend JAR (CPU or CUDA)                    |
| `platform.classifier` | Auto-detected | Selects the native binary variant (e.g., `linux-x86_64-avx2`) |

```bash
# CPU (default)
mvn test -Dtest=MyTestClass

# CUDA
mvn test -Dtest=MyTestClass -Dbackend.artifactId=nd4j-cuda-12.9

# CPU with specific classifier
mvn test -Dtest=MyTestClass -Dplatform.classifier=linux-x86_64-onednn-avx2
```

Setting `backend.artifactId` also activates Maven profiles that set backend priority system properties. When `nd4j-native` is selected, `org.nd4j.cpu.priority=10000` and GPU priority is 0, ensuring the CPU backend wins even if CUDA is on the classpath (and vice versa for `nd4j-cuda`).

#### Memory and JVM configuration

`platform-tests` configures Surefire with properties that control JVM heap, off-heap memory, and garbage collection:

| Property            | Default | What it does                                               |
| ------------------- | ------- | ---------------------------------------------------------- |
| `test.heap.size`    | `32g`   | JVM `-Xmx` per Surefire fork                               |
| `test.offheap.size` | `32g`   | JavaCPP max off-heap bytes                                 |
| `test.nogc`         | `true`  | Disables ND4J array GC and JavaCPP pointer GC during tests |
| `surefire.forks`    | `1`     | Number of forked JVM processes                             |
| `surefire.threads`  | `1`     | Threads per fork                                           |

The CUDA profile (`-Dbackend.artifactId=nd4j-cuda`) automatically reduces heap to `14g` and increases threads to `4`.

Override these for local runs if your machine has less memory:

```bash
mvn test -Dtest=MyTestClass -Dtest.heap.size=6g -Dtest.offheap.size=6g
```

Surefire also sets environment variables for deterministic behavior:

* `OMP_NUM_THREADS=1` — single-threaded OpenMP to avoid nondeterminism
* `OPENBLAS_CORETYPE=Haswell` — deterministic BLAS kernel selection
* `CUDA_LAUNCH_BLOCKING=1` — synchronous CUDA for debugging

#### Test organization

Tests are organized under `src/test/java/` (and `src/test/kotlin/` for import framework tests):

| Package                                                   | What it tests                                                  |
| --------------------------------------------------------- | -------------------------------------------------------------- |
| `org.eclipse.deeplearning4j.nd4j.*`                       | ND4J core: array ops, workspaces, datasets, shapes, data types |
| `org.eclipse.deeplearning4j.dl4jcore.*`                   | DL4J layers, training, gradient checks, model persistence      |
| `org.eclipse.deeplearning4j.frameworkimport.keras.*`      | Keras model import                                             |
| `org.eclipse.deeplearning4j.frameworkimport.onnx.*`       | ONNX import (Kotlin)                                           |
| `org.eclipse.deeplearning4j.frameworkimport.tensorflow.*` | TensorFlow import (Kotlin)                                     |
| `org.eclipse.deeplearning4j.integration.*`                | End-to-end integration tests                                   |
| `org.eclipse.deeplearning4j.longrunning.*`                | Long-running stress tests                                      |
| `org.eclipse.deeplearning4j.zoo.*`                        | Model zoo tests                                                |
| `org.datavec.*`                                           | DataVec: API, Arrow, Image, JDBC, Excel                        |
| `org.nd4j.*`                                              | Arrow serde, CUDA allocator, Python4J, TF-Lite                 |

#### Test tags

Tests use JUnit 5 tags (defined in `org.nd4j.common.tests.tags.TagNames`) for selective execution. Pass tags via Maven:

```bash
# Run only SameDiff tests
mvn test -Dtests=samediff

# Exclude long-running tests
mvn test -DexcludedTests="long-running-test,large-resources"
```

Common tags:

| Tag                 | What it selects                 |
| ------------------- | ------------------------------- |
| `samediff`          | SameDiff autodiff tests         |
| `training`          | Model training tests            |
| `onnx`              | ONNX import tests               |
| `keras`             | Keras import tests              |
| `tensorflow`        | TensorFlow import tests         |
| `dl4j-old-api`      | Legacy DL4J API tests           |
| `workspaces`        | Memory workspace tests          |
| `ndarray-indexing`  | Array indexing/slicing tests    |
| `long-running-test` | Tests that take minutes to run  |
| `large-resources`   | Tests that download large files |
| `downloads`         | Tests requiring network access  |
| `spark`             | Distributed training tests      |
| `python`            | Python4J bridge tests           |
| `multi-threaded`    | Concurrent tests                |

CI excludes `long-running-test`, `large-resources`, and `downloads` by default. The `BackendCheckerExtension` additionally disables `multi-threaded`, `spark`, and `python` when running on GPU.

#### JUnit 5 extensions

Three auto-registered extensions (via `META-INF/services`) manage test behavior:

| Extension                 | What it does                                                                                                                                                                                                        |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `BackendCheckerExtension` | Disables resource-heavy test tags when running on GPU. Checks `Nd4j.getEnvironment().isCPU()` and skips `large-resources`, `downloads`, `long-running-test`, `multi-threaded`, `spark`, and `python` tests on CUDA. |
| `TFGraphCheckerExtension` | Conditionally skips TensorFlow graph tests based on an allowlist. When `EXECUTE_ONLY_MODELS` is non-empty, only matching model tests run.                                                                           |
| `DeallocationExtension`   | Manages off-heap memory tracking between tests. Sets `CURRENT_TEST_*` system properties for allocation debugging.                                                                                                   |

#### Base test classes

Most tests extend one of these base classes (from the `nd4j-common-tests` and `deeplearning4j-common-tests` modules):

| Base class                 | Used by                  | What it does                                                                                                                                                |
| -------------------------- | ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `BaseND4JTest`             | ND4J tests               | Sets profiling mode, default data types, thread count. After each test: destroys workspaces, checks for workspace leaks (exits on leak), logs memory stats. |
| `BaseNd4jTestWithBackends` | Parameterized ND4J tests | Extends `BaseND4JTest`. Adds backend parameterization via `@MethodSource("configs")` — tests run once per available backend.                                |
| `BaseDL4JTest`             | DL4J tests               | Configures profiling, data types, thread count. Provides `skipUnlessIntegrationTests()` gated by `DL4J_INTEGRATION_TESTS` env var.                          |

#### Test scripts

`platform-tests/` includes convenience scripts:

| Script                    | What it runs                                                                         |
| ------------------------- | ------------------------------------------------------------------------------------ |
| `run-onnx-tests.sh`       | ONNX SameDiff import tests (`org.nd4j.samediff.frameworkimport.onnx.**`)             |
| `run-tensorflow-tests.sh` | TensorFlow SameDiff import tests (`org.nd4j.samediff.frameworkimport.tensorflow.**`) |
| `run-keras-tests.sh`      | Keras model import tests (`org.deeplearning4j.nn.modelimport.keras.**`)              |
| `run-benchmarks.sh`       | Standalone JUnit console launcher with optional valgrind/compute-sanitizer support   |
| `bootstrap-onnx.sh`       | Downloads \~65 ONNX Zoo models and converts them (not a test runner — data setup)    |

#### Benchmarking and profiling

The Maven Shade plugin builds a self-contained JAR (`platform-tests-1.0.0-SNAPSHOT-shaded.jar`) at package phase. This enables running tests outside Maven Surefire, which is useful for profiling with external tools:

```bash
# Build the shaded JAR
mvn package -DskipTests

# Run a specific test class with the standalone JUnit launcher
java -cp junit-platform-console-standalone.jar \
  org.junit.platform.console.ConsoleLauncher \
  -cp=target/platform-tests-1.0.0-SNAPSHOT-shaded.jar \
  -c=org.eclipse.deeplearning4j.dl4jcore.gradientcheck.CNN1DGradientCheckTest
```

The `bin/java` wrapper script in `platform-tests/` is the injection point for memory analysis tools. Surefire's `<jvm>` config points to this wrapper instead of the system `java`. The wrapper reads `TEST_RUNNER_PREFIX` from the environment:

* **Valgrind**: Generates suppression files for libjvm.so, adds `--track-origins=yes --error-limit=no`
* **Compute-Sanitizer**: Adds `--tool=memcheck --report-api-errors all --show-backtrace yes` (for CUDA memory debugging)

```bash
# Run with valgrind
TEST_RUNNER_PREFIX=valgrind mvn test -Dtest=MyTestClass

# Run with CUDA compute-sanitizer
TEST_RUNNER_PREFIX=compute-sanitizer mvn test -Dtest=MyTestClass \
  -Dbackend.artifactId=nd4j-cuda-12.9
```

#### Test resources

Many tests require pre-trained model files and test fixtures from the external `dl4j-test-resources` artifact (`org.deeplearning4j:dl4j-test-resources`). This must be installed in your local Maven repo before those tests will pass. CI workflows fetch it automatically; for local development, clone and install from [KonduitAI/dl4j-test-resources](https://github.com/KonduitAI/dl4j-test-resources).

#### Numerical gradient checks

Any new layer, loss function, or custom op with a backward pass must pass a numerical gradient check:

```java
boolean passed = GradientCheckUtil.checkGradients(
    new GradientCheckUtil.MLNConfig()
        .net(net)
        .input(input)
        .labels(labels));
assertTrue(passed, "Gradient check failed");
```

Gradient checks confirm that analytic (backprop) gradients match finite-difference numerical gradients. A failing gradient check means there is a bug in the backward pass.

***

### How Backends Work

Understanding the backend architecture is essential before contributing ops or backend-specific code.

#### Backend discovery (Java SPI)

ND4J uses Java's `ServiceLoader` to discover backends at runtime. Each backend JAR ships a `META-INF/services/org.nd4j.linalg.factory.Nd4jBackend` file naming its implementation class:

* **CPU**: `org.nd4j.linalg.cpu.nativecpu.CpuBackend` (in `nd4j-native`)
* **CUDA**: `org.nd4j.linalg.jcublas.JCublasBackend` (in `nd4j-cuda-12.9`)

At startup, `Nd4jBackend.load()` collects all backends via `ServiceLoader`, sorts by priority (configurable via system properties `nd4j.backend.priorityCPU` / `nd4j.backend.priorityGPU`), and calls `isAvailable()` on each in order. The first available one wins. In practice, CUDA wins if GPUs are present because `JCublasBackend` calls `cudaGetDeviceCount` and succeeds, while `CpuBackend.isAvailable()` always returns true as a fallback.

#### Initialization chain

```
Nd4j (static init)
  → Nd4jBackend.load()          // SPI discovery, priority sort
  → initWithBackend(backend)    // reads backend .properties file
    → OpExecutioner             // NativeOpExecutioner (CPU) or CudaExecutioner (CUDA)
    → NativeOpsHolder           // loads JNI bridge: Nd4jCpu or Nd4jCuda
      → NativeOps               // JNI interface to libnd4j C++
        → libnd4j native code   // actual kernel execution
```

Each backend defines its classes in a properties file (`nd4j-native.properties` or `nd4j-jcublas.properties`). `Nd4j.initWithBackend()` reflectively instantiates:

* `opexec` → the `OpExecutioner` implementation
* `native.ops` → the `NativeOps` JNI bridge class

`NativeOpExecutioner` delegates every op call (`execReduceFloat`, `execScalar`, `execCustomOp`) through `NativeOps` to the C++ shared library.

#### Platform helper dispatch (C++ side)

Platform-specific op implementations (oneDNN, cuDNN, ACL, Apple Accelerate) plug in entirely at the C++ level. Java has no role in this dispatch.

The `PLATFORM_IMPL(op_name, ENGINE)` macro (in `libnd4j/include/system/platform_boilerplate.h`) uses a static struct initializer to auto-register the helper with `OpRegistrator` when the shared library loads. At op execution time, `OpRegistrator::getPlatformHelper(hash, engine)` looks up registered helpers. If `isUsable(context)` returns true (correct dtypes, shapes, library available), `invokeHelper(context)` runs the accelerated implementation instead of the generic kernel.

Key C++ files:

* `libnd4j/include/ops/declarable/PlatformHelper.h` — base class
* `libnd4j/include/system/platform_boilerplate.h` — `PLATFORM_IMPL` / `PLATFORM_CHECK` macros
* `libnd4j/include/ops/declarable/OpRegistrator.h` — registry
* `libnd4j/include/execution/Engine.h` — engine enum (`ENGINE_CPU=0`, `ENGINE_CUDA=1`, etc.)

***

### Op Codegen and SameDiff Namespaces

Ops are not hand-written Java classes — they are **code-generated** from a two-phase pipeline. Understanding this pipeline is essential for adding new ops.

#### Phase 1: C++ → Protobuf IR (`libnd4j-gen`)

The `codegen/libnd4j-gen` module scans C++ op source files and extracts argument signatures.

**What it scans:** All files in `libnd4j/include/ops/` containing op declaration macros:

* `CUSTOM_OP_IMPL(NAME, NIN, NOUT, INPLACEABLE, TARGS, IARGS)`
* `OP_IMPL`, `REDUCTION_OP_IMPL`, `BROADCASTABLE_OP_IMPL`
* `BOOLEAN_OP_IMPL`, `LIST_OP_IMPL`, `CONFIGURABLE_OP_IMPL`, `DIVERGENT_OP_IMPL`

**Entry point:** `ParseOpFile.java`, run via `codegen/libnd4j-gen/generate.sh`

**Output:** A protobuf text-format file (`op-ir.proto`) describing every op's argument names, types, and counts. The compiled proto class `OpNamespace.java` lives in `nd4j-api`. A bundled snapshot is stored at `nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/resources/ops.proto`.

This IR is used at runtime for ONNX and TensorFlow import op mapping.

#### Phase 2: Kotlin DSL → Java namespace classes (`op-codegen`)

The `codegen/op-codegen` module generates the Java API surface from a Kotlin DSL.

**Descriptor files:** One Kotlin file per namespace in `codegen/op-codegen/src/main/ops/org/nd4j/codegen/ops/`:

```
Math.kt          NeuralNetwork.kt    CNN.kt           RNN.kt
Random.kt        Linalg.kt           Bitwise.kt       Image.kt
SDBaseOps.kt     SDLoss.kt           Signal.kt        Audio.kt
Training.kt
```

**Example entry** (from `Math.kt`):

```kotlin
Op("abs", transformSame) {
    javaOpClass = "Abs"
    Doc(Language.ANY, DocScope.ALL) {
        "Elementwise absolute value operation: out = abs(x)"
    }
    Input(NUMERIC, "x") { description = "Input variable" }
    Output(NUMERIC, "output") { description = "Output variable" }
}
```

**Generator:** `Nd4jNamespaceGenerator.java` uses JavaPoet to emit `.java` source files.

**Entry point:** `CLI.java -dir <repo_root> -namespaces ALL -projects all`

**Output:** Generated Java classes in `nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/`:

| Namespace      | SameDiff class (`sd.math()`, etc.) | ND4J class (`Nd4j.math()`, etc.) |
| -------------- | ---------------------------------- | -------------------------------- |
| Math           | `SDMath`                           | `NDMath`                         |
| Neural Network | `SDNN`                             | `NDNN`                           |
| CNN            | `SDCNN`                            | `NDCNN`                          |
| RNN            | `SDRNN`                            | `NDRNN`                          |
| Random         | `SDRandom`                         | `NDRandom`                       |
| Linear Algebra | `SDLinalg`                         | `NDLinalg`                       |
| Bitwise        | `SDBitwise`                        | `NDBitwise`                      |
| Image          | `SDImage`                          | `NDImage`                        |
| Base Ops       | `SDBaseOps`                        | `NDBase`                         |
| Loss           | `SDLoss`                           | `NDLoss`                         |
| Signal         | `SDSignal`                         | `NDSignal`                       |
| Audio          | `SDAudio`                          | `NDAudio`                        |
| Training       | `SDTraining`                       | `NDTraining`                     |

Users access ops through these namespaces:

```java
SameDiff sd = SameDiff.create();
SDVariable x = sd.var("x", Nd4j.rand(2, 3));

// sd.math() → SDMath
SDVariable abs = sd.math().abs(x);

// sd.nn() → SDNN
SDVariable relu = sd.nn().relu(x, 0);

// sd.linalg() → SDLinalg
SDVariable det = sd.linalg().det(x);

// sd.audio() → SDAudio
SDVariable mel = sd.audio().melSpectrogram(x, 16000, 512, 256, 80);
```

**Do not edit `SD*.java` or `ND*.java` files directly** — they are generated and will be overwritten. Edit the Kotlin DSL in `codegen/op-codegen/src/main/ops/` instead.

***

### Contributing New Ops

Adding a new native op is a multi-step process that spans C++, codegen, and Java. Here is the full end-to-end flow.

#### Step 1: C++ implementation (libnd4j)

Create the op in `libnd4j/include/ops/declarable/generic/` under the appropriate subdirectory (`nn/`, `transforms/`, `reduce/`, `linalg/`, etc.):

```cpp
// libnd4j/include/ops/declarable/generic/nn/my_new_op.cpp
#include <ops/declarable/CustomOperations.h>

namespace sd {
namespace ops {

CUSTOM_OP_IMPL(my_new_op, 2, 1, false, 0, 0) {
    auto input = INPUT_VARIABLE(0);
    auto weights = INPUT_VARIABLE(1);
    auto output = OUTPUT_VARIABLE(0);

    // Implementation
    // ...

    return sd::Status::OK;
}

DECLARE_SHAPE_FN(my_new_op) {
    auto inShape = inputShape->at(0);
    // Calculate output shape
    return SHAPELIST(ConstantShapeHelper::getInstance().createShapeInfo(
        DataType::FLOAT32, 'c', {outRows, outCols}));
}

DECLARE_TYPES(my_new_op) {
    getOpDescriptor()
        ->setAllowedInputTypes({ALL_FLOATS})
        ->setAllowedOutputTypes({ALL_FLOATS});
}

}  // namespace ops
}  // namespace sd
```

The macro arguments to `CUSTOM_OP_IMPL` are: `(name, numInputs, numOutputs, inPlaceable, numTArgs, numIArgs)`.

If the op needs a backward pass for training, also implement `my_new_op_bp`:

```cpp
CUSTOM_OP_IMPL(my_new_op_bp, 3, 2, false, 0, 0) {
    auto input = INPUT_VARIABLE(0);
    auto weights = INPUT_VARIABLE(1);
    auto gradOut = INPUT_VARIABLE(2);  // gradient from upstream
    auto gradInput = OUTPUT_VARIABLE(0);
    auto gradWeights = OUTPUT_VARIABLE(1);

    // Backward pass implementation
    return sd::Status::OK;
}
```

#### Step 2: Platform-specific implementations (optional)

For performance-critical ops, add accelerated implementations using the `PLATFORM_IMPL` macro. These are auto-registered at library load time — no Java-side wiring needed:

```cpp
// libnd4j/include/ops/declarable/platform/mkldnn/my_new_op_mkldnn.cpp
#include <ops/declarable/PlatformHelper.h>

namespace sd {
namespace ops {
namespace platforms {

PLATFORM_IMPL(my_new_op, ENGINE_CPU) {
    // oneDNN-optimized implementation
    return sd::Status::OK;
}

PLATFORM_CHECK(my_new_op, ENGINE_CPU) {
    auto input = INPUT_VARIABLE(0);
    return input->dataType() == DataType::FLOAT32
        && input->rankOf() == 4;  // only handle 4D inputs
}

}  // namespace platforms
}  // namespace ops
}  // namespace sd
```

Available engines for `PLATFORM_IMPL`:

| Engine constant                | Library             | Platform directory     |
| ------------------------------ | ------------------- | ---------------------- |
| `ENGINE_CPU` / `ENGINE_ONEDNN` | Intel oneDNN        | `platform/mkldnn/`     |
| `ENGINE_CUDA`                  | NVIDIA cuDNN        | `platform/cudnn/`      |
| `ENGINE_ARM`                   | ARM Compute Library | `platform/armcompute/` |
| `ENGINE_ACCELERATE`            | Apple Accelerate    | `platform/accelerate/` |
| `ENGINE_MPS`                   | Apple Metal         | `platform/mps/`        |

#### Step 3: Register launch dimensions (CUDA ops)

If the op runs on CUDA, register its launch configuration in `include/system/LaunchDims.h` and `LaunchDims.cu`.

#### Step 4: Regenerate the op IR

Run the libnd4j-gen scanner to pick up the new op's argument signature:

```bash
cd codegen/libnd4j-gen
bash generate.sh /path/to/libnd4j
```

This updates `op-ir.proto` with the new op's descriptor. The IR is used for ONNX/TF import mapping.

#### Step 5: Add to the Kotlin codegen DSL

Add the op to the appropriate Kotlin file in `codegen/op-codegen/src/main/ops/org/nd4j/codegen/ops/`. For a neural network op, add it to `NeuralNetwork.kt`:

```kotlin
Op("my_new_op") {
    javaOpClass = "MyNewOp"
    Doc(Language.ANY, DocScope.ALL) {
        "Description of what my_new_op does."
    }
    Input(NUMERIC, "input") { description = "Input tensor" }
    Input(NUMERIC, "weights") { description = "Weight tensor" }
    Output(NUMERIC, "output") { description = "Output tensor" }
}
```

#### Step 6: Run the code generator

```bash
cd codegen/op-codegen
# Regenerate all namespace classes
java -cp target/classes org.nd4j.codegen.cli.CLI \
  -dir /path/to/deeplearning4j -namespaces ALL -projects all
```

This regenerates `SDNN.java`, `NDNN.java` (or whichever namespace you added the op to) with your new op included. **Do not edit the generated files directly.**

#### Step 7: Test

Add a test in `platform-tests/`:

```java
@Test
void testMyNewOp() {
    SameDiff sd = SameDiff.create();
    SDVariable input = sd.var("input", Nd4j.rand(2, 3));
    SDVariable weights = sd.var("weights", Nd4j.rand(3, 4));
    SDVariable output = sd.nn().myNewOp(input, weights);

    Map<String, INDArray> result = sd.output(
        Collections.emptyMap(), "output");
    assertNotNull(result.get("output"));
    assertArrayEquals(new long[]{2, 4}, result.get("output").shape());
}
```

For ops with backward passes, also add a gradient check test.

***

### Contributing Examples

Examples live in a separate repository: [github.com/eclipse/deeplearning4j-examples](https://github.com/eclipse/deeplearning4j-examples).

#### Repository structure

Each sub-project is a self-contained Maven project (no aggregate root POM):

| Module                               | Focus                                             |
| ------------------------------------ | ------------------------------------------------- |
| `dl4j-examples`                      | DL4J neural network examples                      |
| `samediff-examples`                  | SameDiff, DSP, LLM generation, PEFT, RL alignment |
| `nd4j-ndarray-examples`              | ND4J array operations                             |
| `data-pipeline-examples`             | DataVec ETL examples                              |
| `onnx-import-examples`               | ONNX and GGML model import, OmniHub               |
| `tensorflow-keras-import-examples`   | TensorFlow/Keras import                           |
| `dl4j-distributed-training-examples` | Spark distributed training                        |
| `android-examples`                   | Android deployment                                |
| `mvn-project-template`               | Minimal starter template                          |

#### Example conventions

Each example is a standalone runnable Java class with a `public static void main(String[] args)` method. Follow the existing pattern:

```java
package org.deeplearning4j.examples.quickstart.modeling;

// ... imports ...

/**
 * Brief description of what the example demonstrates.
 *
 * Key concepts:
 * - First concept
 * - Second concept
 *
 * @author Your Name
 */
public class MyExample {

    public static void main(String[] args) throws Exception {
        // Example code — runnable as-is
    }
}
```

**Guidelines:**

* **Runnable.** The example must compile and run without modification, external data downloads, or special hardware (unless clearly documented at the top).
* **Self-contained.** All configuration, model building, and data loading happen within the `main` method or private helper methods in the same class.
* **Well-commented.** Explain what each section does and why. Examples are learning tools — clarity beats brevity.
* **No test classes.** Examples run directly via `main()`, not as JUnit tests.
* **Apache 2.0 header.** Include the standard Apache 2.0 license header at the top of every file.

#### Organization

Place your example in the appropriate sub-project and tier:

* `quickstart/` — beginner-friendly, demonstrates one concept clearly
  * `modeling/` — building and training models
  * `features/` — specific DL4J features (early stopping, UI, save/load)
  * `datapipeline/` — loading and transforming data
* `advanced/` — more complex, may combine multiple concepts
  * `modelling/` — attention, seq2seq, object detection, style transfer
  * `features/` — custom layers, transfer learning, advanced configuration

#### Submitting example PRs

1. Fork `deeplearning4j-examples`, create a branch.
2. Add your example in the appropriate module and tier.
3. Verify it compiles: `mvn compile` in the sub-project directory.
4. Verify it runs: `mvn exec:java -Dexec.mainClass="org.deeplearning4j.examples...."`.
5. Open a PR to `eclipse/deeplearning4j-examples:master`.

***

### C++ Conventions (libnd4j)

#### Required macros

Do not use raw CUDA/compiler annotations — use the project macros for cross-platform compatibility:

| Raw (banned)               | Use instead                                  |
| -------------------------- | -------------------------------------------- |
| `__host__`                 | `SD_HOST`                                    |
| `__device__`               | `SD_DEVICE`                                  |
| `__global__`               | `SD_KERNEL`                                  |
| `__host__ __device__`      | `SD_HOST_DEVICE`                             |
| `__forceinline__`          | `SD_INLINE`                                  |
| `#pragma omp parallel for` | `PRAGMA_OMP_PARALLEL_FOR` and related macros |

#### Memory management

libnd4j uses raw pointers with manual `delete`. Do not use `std::unique_ptr` or `std::shared_ptr` — they are not used in the codebase and cause ownership confusion with the existing raw-pointer APIs.

#### Deprecated patterns

Do not use `ews()` / `elementWiseStride` — it is deprecated and returns incorrect results for views and non-contiguous arrays. Use stride checks and shape descriptor utilities instead.

***

### Java Conventions

* **Java 11** source compatibility. Do not use Java 17 language features in core modules.
* **4-space indent**, no tabs.
* **Lombok** annotations (`@Data`, `@Builder`, `@Slf4j`) — follow the style of surrounding code.
* No wildcard imports (`import org.nd4j.*`).
* **Javadoc** on all public methods and classes.
* Generated code (JavaCPP presets) must never be edited directly — update the preset configuration instead.

***

### Pull Request Workflow

#### 1. Fork and branch

```bash
git clone https://github.com/YOUR_USERNAME/deeplearning4j.git
cd deeplearning4j
git remote add upstream https://github.com/deeplearning4j/deeplearning4j.git
git fetch upstream
git checkout -b my-feature upstream/master
```

#### 2. Make your changes

Keep commits small and focused. Each commit should compile and pass tests independently.

#### 3. Rebase before submitting

```bash
git fetch upstream
git rebase upstream/master
```

#### 4. Push and open PR

```bash
git push origin my-feature
```

Open a pull request to `deeplearning4j/deeplearning4j:master`. Include:

* **What** was changed and **why**
* How to test the change
* Any relevant issue numbers (`Fixes #1234`)

#### 5. CI and review

CI runs automatically. Address reviewer feedback by pushing additional commits — do not force-push a branch under review. A maintainer will merge once the PR is approved and CI passes.

#### PR checklist

* [ ] ECA signed and email matches GitHub account
* [ ] Branch is rebased on current `upstream/master`
* [ ] Code compiles with `mvn clean install -DskipTests`
* [ ] New or changed behavior is covered by tests in `platform-tests/`
* [ ] New public API has Javadoc
* [ ] Numerical gradient checks pass for any new op with a backward pass
* [ ] C++ code uses project macros (`SD_HOST`, `SD_DEVICE`, `PRAGMA_OMP_*`), not raw annotations

***

### CI/CD Build Environment

Understanding the CI pipeline helps when debugging build failures or adding new build targets. All CI configuration lives in `.github/workflows/`.

#### Build matrix

The project builds native artifacts across multiple platforms, each producing classifier-tagged Maven artifacts:

| Platform                  | Workflow                           | Runners                        |
| ------------------------- | ---------------------------------- | ------------------------------ |
| Linux x86\_64 (CPU)       | `build-deploy-linux-x86_64.yml`    | `ubuntu-22.04`                 |
| Linux x86\_64 (CUDA 12.6) | `build-deploy-linux-cuda-12.6.yml` | Self-hosted                    |
| Linux x86\_64 (CUDA 12.9) | `build-deploy-linux-cuda-12.9.yml` | Self-hosted                    |
| Linux ARM64               | `build-deploy-linux-arm64.yml`     | Self-hosted ARM64              |
| macOS ARM64               | `build-deploy-mac-arm64.yml`       | `macos-14`                     |
| Windows x86\_64           | `build-deploy-windows-x86_64.yml`  | `windows-2022`                 |
| Android ARM64             | `build-deploy-android-arm64.yml`   | `ubuntu-22.04` (cross-compile) |
| Android x86\_64           | `build-deploy-android-x86_64.yml`  | `ubuntu-22.04` (cross-compile) |

#### Classifier system

Each platform build produces artifacts with classifiers that encode the helper library and extension:

```
nd4j-native-1.0.0-SNAPSHOT-linux-x86_64.jar          # base
nd4j-native-1.0.0-SNAPSHOT-linux-x86_64-onednn.jar   # oneDNN helper
nd4j-native-1.0.0-SNAPSHOT-linux-x86_64-avx2.jar     # AVX2 extension
nd4j-native-1.0.0-SNAPSHOT-linux-x86_64-onednn-avx2.jar  # both
```

CPU builds use a **matrix** of helper × extension:

| Dimension     | Values                       | What it means                                                                  |
| ------------- | ---------------------------- | ------------------------------------------------------------------------------ |
| **helper**    | `onednn`, `compile`, (empty) | Helper library linked: oneDNN graph API, MLIR/Triton compile stack, or generic |
| **extension** | `avx2`, `avx512`, (empty)    | x86 SIMD extension level targeted                                              |

The matrix produces up to 9 combinations (3 × 3). The `compile` helper variant requires LLVM/MLIR at build time and produces the Triton JIT compilation backend.

CUDA builds use a simpler matrix:

| Dimension  | Values                      | What it means                                       |
| ---------- | --------------------------- | --------------------------------------------------- |
| **helper** | `cudnn`, `compile`, (empty) | cuDNN helper, Triton compile stack, or generic CUDA |

#### CI environment details

The standard CI build environment for Linux x86\_64:

| Component      | Version / Config                                                           |
| -------------- | -------------------------------------------------------------------------- |
| OS             | Ubuntu 22.04                                                               |
| JDK            | Temurin 11 (build), 17 (test)                                              |
| Maven          | 3.9.x                                                                      |
| CMake          | Latest via `apt`                                                           |
| Compiler cache | **sccache** (not ccache — CI uses sccache for its S3 remote cache support) |
| Protobuf       | libprotobuf-dev (from apt)                                                 |
| Debug symbols  | `libdwarf-dev`, `libelf-dev`, `binutils-dev` (for DWARF stack traces)      |
| LLVM/MLIR      | LLVM 18 (only for `compile` helper variant)                                |
| Swap           | 12 GB swap file (native builds are memory-intensive)                       |

**Difference from local development:** CI uses **sccache** instead of ccache. sccache supports remote S3 caching, which means CI cache is shared across runs. Local developers should still use **ccache**, which is simpler and doesn't require S3 configuration.

CUDA CI builds add:

| Component            | Version / Config                                          |
| -------------------- | --------------------------------------------------------- |
| CUDA toolkit         | 12.6 or 12.9 (installed via `Jimver/cuda-toolkit` action) |
| Compute capabilities | `8.6 9.0` (Ampere + Hopper)                               |
| Build timeout        | 720 minutes (12 hours)                                    |
| Runners              | Self-hosted with NVIDIA GPUs                              |

#### Test infrastructure

Tests run via the `run-tests.yml` workflow, which supports 16 test suites:

| Suite              | What it tests                |
| ------------------ | ---------------------------- |
| `nd4j`             | ND4J core array operations   |
| `samediff`         | SameDiff autodiff engine     |
| `java-cpp`         | JavaCPP bindings             |
| `dl4j-core`        | DL4J neural network layers   |
| `datavec`          | Data pipeline (DataVec)      |
| `keras`            | Keras model import           |
| `onnx`             | ONNX import                  |
| `dl4j-spark`       | Distributed training (Spark) |
| `dsp`              | DSP execution engine         |
| `llm`              | LLM/VLM inference stack      |
| `peft`             | PEFT and RL alignment        |
| `ggml`             | GGML/GGUF model import       |
| `omnihub`          | OmniHub model loading        |
| `python4j`         | Python4J bridge              |
| `tokenizers`       | Tokenizer implementations    |
| `model-evaluation` | LLM evaluation benchmarks    |

The workflow accepts parameters:

```yaml
# Run a single suite
test_suite: "samediff"

# Run all 16 suites
test_suite: "all"

# Quick mode — runs 5 core suites in parallel (nd4j, samediff, java-cpp, dl4j-core, datavec)
test_suite: "quick"

# Backend selection
backend_artifact_id: "nd4j-cuda-12.9"
backend_classifier: "linux-x86_64-onednn-avx2"

# JVM config
heap_size: "6g"
```

Test resources (models, test data) are fetched from `dl4j-test-resources` at the start of each run. Test results are uploaded as artifacts (Surefire XML reports) for each suite.

#### Snapshot deployment

Successful builds on the `master` branch deploy Maven snapshots to [central.sonatype.com](https://central.sonatype.com) (the OSSRH Sonatype snapshot repository). The deployment uses retry with exponential backoff (up to 3 attempts) to handle transient upload failures.

Deployed artifacts include:

* Backend JARs with platform classifiers
* DSP Runtime SDK (native shared libraries)
* SDK JARs (Java, with sources and javadoc)

#### Reproducing CI builds locally

To replicate what CI does for a CPU build with the oneDNN helper and AVX2 extension:

```bash
mvn -Pcpu \
  -Dlibnd4j.buildthreads=$(nproc) \
  -Dlibnd4j.helper=onednn \
  -Dlibnd4j.extension=avx2 \
  -pl libnd4j,:nd4j-cpu-backend-common,:nd4j-native \
  clean install -DskipTests
```

To replicate a CUDA build with cuDNN:

```bash
mvn -Pcuda \
  -Dlibnd4j.chip=cuda \
  -Dlibnd4j.helper=cudnn \
  -Dlibnd4j.buildthreads=$(nproc) \
  -pl libnd4j,:nd4j-cuda-12.9 \
  clean install -DskipTests
```

The key difference: CI uses sccache with remote caching and builds all matrix combinations. Locally you only need to build the one combination you're working on.

***

### Reporting Issues

File bugs and feature requests at [github.com/deeplearning4j/deeplearning4j/issues](https://github.com/deeplearning4j/deeplearning4j/issues).

A useful bug report includes:

* DL4J version (or commit hash if built from source)
* Java version and OS
* Backend (CPU or CUDA, and CUDA toolkit version)
* A minimal reproducible example
* Full stack trace
* Expected vs. actual behavior

For example-specific bugs, use [github.com/eclipse/deeplearning4j-examples/issues](https://github.com/eclipse/deeplearning4j-examples/issues).

***

### Community

| Channel                                                                            | Purpose                       |
| ---------------------------------------------------------------------------------- | ----------------------------- |
| [GitHub Issues](https://github.com/deeplearning4j/deeplearning4j/issues)           | Bug reports, feature requests |
| [GitHub Discussions](https://github.com/deeplearning4j/deeplearning4j/discussions) | Questions, design discussions |


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/deeplearning4j/contributing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
