> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/nd4j/overview-1/cpu.md).

# CPU (nd4j-native)

`nd4j-native` is the CPU backend for ND4J. It executes all array operations through **libnd4j**, a C++ compute engine that calls into BLAS routines and uses SIMD vector instructions where available. This page covers Maven setup, AVX optimization classifiers, BLAS configuration, threading controls, and memory tuning.

## Maven Dependencies

### Recommended: `-platform` artifact

The simplest setup uses the `-platform` artifact, which bundles natives for all supported operating systems and CPU architectures in one dependency block:

```xml
<properties>
  <dl4j.version>1.0.0-rewrite</dl4j.version>
</properties>

<dependencies>
  <dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native-platform</artifactId>
    <version>${dl4j.version}</version>
  </dependency>
</dependencies>
```

This is appropriate for most projects. Maven resolves the native JARs for Linux x86\_64, Windows x86\_64, macOS x86\_64, macOS ARM64, Linux ARM64, and Linux ppc64le simultaneously. At runtime JavaCPP extracts only the library matching the current OS and CPU.

### Minimal: current-platform only

To reduce JAR size for a known deployment target, omit `-platform` and add an explicit classifier:

```xml
<dependencies>
  <!-- API + Java code -->
  <dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>${dl4j.version}</version>
  </dependency>

  <!-- Natives for one platform only -->
  <dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>${dl4j.version}</version>
    <classifier>linux-x86_64</classifier>
  </dependency>
</dependencies>
```

### Using the DL4J BOM

If you are using the broader DL4J stack (DataVec, DL4J training layers, etc.), import the BOM to keep all version numbers consistent:

```xml
<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>org.deeplearning4j</groupId>
      <artifactId>deeplearning4j-parent</artifactId>
      <version>${dl4j.version}</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

<dependencies>
  <dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native-platform</artifactId>
  </dependency>
</dependencies>
```

With the BOM the `<version>` tag on individual dependencies is optional — the BOM pins them.

## Platform Classifiers

Available native classifiers for `nd4j-native` in M2.1:

| Classifier            | OS      | Architecture | Notes                             |
| --------------------- | ------- | ------------ | --------------------------------- |
| `linux-x86_64`        | Linux   | x86-64       | Baseline SSE4.1                   |
| `linux-x86_64-avx2`   | Linux   | x86-64       | AVX2 SIMD (Haswell+)              |
| `linux-x86_64-avx512` | Linux   | x86-64       | AVX-512 (Skylake-X+, Ice Lake+)   |
| `linux-arm64`         | Linux   | AArch64      | AWS Graviton, Raspberry Pi 64-bit |
| `linux-ppc64le`       | Linux   | POWER8/9     | IBM Power Systems                 |
| `windows-x86_64`      | Windows | x86-64       | Baseline SSE4.1                   |
| `windows-x86_64-avx2` | Windows | x86-64       | AVX2 SIMD                         |
| `macosx-x86_64`       | macOS   | x86-64       | Intel Mac                         |
| `macosx-arm64`        | macOS   | AArch64      | Apple Silicon (M1/M2/M3)          |

### `-compile` classifiers (1.0.0-rewrite)

The 1.0.0-rewrite release adds `-compile` variants for each platform. These bundle the DSP JIT compilation stack (Triton, MLIR) into the native binary, enabling kernel fusion and JIT-compiled execution. The base classifiers above run standard ops and CUDA graph capture/replay but do not include JIT fusion.

| Classifier                    | OS      | Architecture | Includes               |
| ----------------------------- | ------- | ------------ | ---------------------- |
| `linux-x86_64-compile`        | Linux   | x86-64       | Triton + MLIR + oneDNN |
| `linux-arm64-compile`         | Linux   | AArch64      | MLIR                   |
| `macosx-arm64-compile`        | macOS   | AArch64      | MLIR + MLX             |
| `android-arm64-compile`       | Android | AArch64      | MLIR                   |
| `android-arm64-compile-nnapi` | Android | AArch64      | MLIR + NNAPI           |

**Trade-off:** `-compile` classifiers produce a larger binary with more native dependencies (LLVM/Triton), but enable the full DSP JIT pipeline for maximum performance. The base classifiers are smaller and simpler to deploy. See [Hardware Backends — Classifier Variants](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/nd4j/backends/hardware-backends/README.md#2-classifier-variants-base-vs-compile) for the complete trade-off guide.

```xml
<!-- CPU with full DSP JIT on Linux x86-64 -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>${dl4j.version}</version>
    <classifier>linux-x86_64-compile</classifier>
</dependency>
```

When using `-platform`, Maven pulls all of these classifiers. When specifying one manually, pick the classifier that matches your deployment target.

## AVX2 and AVX-512 Optimizations

libnd4j is compiled in multiple variants corresponding to SIMD instruction sets. The `-avx2` and `-avx512` classifier variants contain a libnd4j compiled with those instruction sets enabled, delivering significant throughput improvements for element-wise and reduction operations compared to the baseline SSE4.1 build.

### Selecting AVX via Maven classifier

Replace the plain classifier with the AVX-enabled one:

```xml
<!-- AVX2 — works on Intel Haswell (2013+), AMD Ryzen (2017+) -->
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-native</artifactId>
  <version>${dl4j.version}</version>
  <classifier>linux-x86_64-avx2</classifier>
</dependency>

<!-- AVX-512 — works on Intel Skylake-X, Ice Lake, Cascade Lake and newer -->
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-native</artifactId>
  <version>${dl4j.version}</version>
  <classifier>linux-x86_64-avx512</classifier>
</dependency>
```

### Selecting AVX via system property (with `-platform`)

When using the `-platform` artifact all classifier variants are on the classpath. JavaCPP picks the correct native at startup, but you can steer it toward the AVX variant with:

```
-Djavacpp.platform.extension=-avx2
```

or

```
-Djavacpp.platform.extension=-avx512
```

Set this property on the JVM command line before the application starts. The value is appended to the detected platform string (e.g., `linux-x86_64`) to form `linux-x86_64-avx2`, and JavaCPP loads that native library instead of the baseline one.

**Verification:** check that the AVX library was loaded:

```java
System.out.println(System.getProperty("javacpp.platform"));
// linux-x86_64
System.out.println(System.getProperty("javacpp.platform.extension"));
// -avx2
```

If the property is set but the classifier is not on the classpath, JavaCPP falls back to the baseline library and logs a warning.

### Which AVX level to use

| Instruction set   | Minimum CPU                             | Expected speedup over baseline |
| ----------------- | --------------------------------------- | ------------------------------ |
| Baseline (SSE4.1) | Any x86-64 since \~2007                 | —                              |
| AVX2              | Intel Haswell (2013), AMD Ryzen (2017)  | 1.5–2× for float ops           |
| AVX-512           | Intel Skylake-X (2017), Ice Lake (2019) | 2–4× for float ops             |

Run `lscpu | grep -o 'avx[^ ]*'` on Linux to check which instruction sets your CPU supports. Do not specify `-avx512` on a machine that lacks AVX-512 — the process will crash with `SIGILL`.

## BLAS Libraries

BLAS (Basic Linear Algebra Subprograms) is used for matrix multiplication (`mmul`), dot products, and similar level-3 operations. ND4J ships with **OpenBLAS** bundled inside libnd4j and uses it by default.

### OpenBLAS (bundled, default)

No additional configuration is required. OpenBLAS is statically linked into libnd4j. It auto-detects the number of physical CPU cores and sets its internal thread count accordingly.

### Intel MKL (optional, higher performance)

On Intel hardware, Intel MKL typically delivers 10–40% higher throughput for matrix operations compared to OpenBLAS. MKL is not bundled with ND4J but can be provided on the `LD_LIBRARY_PATH` and activated automatically.

Steps to use MKL:

1. Install Intel oneAPI Math Kernel Library (MKL). The free `intel-mkl` package is available via the Intel apt/yum repository or the `mkl` conda package.
2. Add the MKL library path to `LD_LIBRARY_PATH` before launching the JVM:

```bash
export LD_LIBRARY_PATH=/opt/intel/mkl/lib/intel64:$LD_LIBRARY_PATH
java -jar myapp.jar
```

3. Tell ND4J to prefer MKL by setting the BLAS library name:

```
-Dorg.bytedeco.openblas.load=mkl_rt
```

4. Verify MKL is in use at runtime:

```java
String blasLib = NativeOpsHolder.getInstance().getDeviceNativeOps().getBlasLibraryName();
System.out.println(blasLib);  // should contain "mkl"
```

If MKL is not found on `LD_LIBRARY_PATH`, libnd4j silently falls back to the bundled OpenBLAS.

### Disabling multi-threaded BLAS

Some workloads perform many small matrix multiplications in parallel (e.g., batched inference on a thread pool). In this case it is more efficient to use single-threaded BLAS and let your own thread pool provide parallelism:

```bash
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
```

or equivalently:

```
-Dorg.bytedeco.openblas.num.threads=1
```

## Thread Configuration

### OMP\_NUM\_THREADS

libnd4j uses OpenMP for parallelizing element-wise operations and reductions. By default it uses all available logical CPUs. Set `OMP_NUM_THREADS` to restrict it:

```bash
export OMP_NUM_THREADS=4
java -jar myapp.jar
```

Keeping `OMP_NUM_THREADS` at or below the number of **physical** CPU cores (not hyperthreads) usually yields better sustained throughput.

### Nd4j.setNumThreads()

Thread count can also be set programmatically at any point before or during computation:

```java
// Set to 4 threads for all subsequent ops
Nd4j.setNumThreads(4);

// Read the current setting
int n = Nd4j.getNumThreads();
System.out.println("Using " + n + " threads");
```

The programmatic setting overrides the `OMP_NUM_THREADS` environment variable.

### Thread recommendations by workload

| Workload                             | Recommendation                                          |
| ------------------------------------ | ------------------------------------------------------- |
| Single large matmul or FFT           | Use all physical cores (`OMP_NUM_THREADS` = core count) |
| Many small ops in a Java thread pool | Set `OMP_NUM_THREADS=1`, parallelize at the Java level  |
| Training a neural network (DL4J)     | Default (all cores); DL4J manages its own parallelism   |
| Inference server under load          | Tune empirically; often `OMP_NUM_THREADS=2` or `4`      |

## Memory Configuration

ND4J uses off-heap memory (native C heap) for array data, managed by JavaCPP. The following JVM system properties control the memory limits.

### Maximum off-heap allocation

```
-Dorg.bytedeco.javacpp.maxbytes=4g
```

This caps the total off-heap memory JavaCPP will allocate before triggering garbage collection to reclaim unused native arrays. Values accept `k`, `m`, `g` suffixes. Default is unlimited (or a JVM-version-specific default).

### Maximum physical bytes

```
-Dorg.bytedeco.javacpp.maxphysicalbytes=8g
```

Sets a hard ceiling on the combined Java heap + off-heap usage. If the process exceeds this limit, JavaCPP raises `OutOfMemoryError`. Set this slightly below the machine's available RAM to prevent swapping.

### Typical production invocation

```bash
java \
  -Xmx4g \
  -Dorg.bytedeco.javacpp.maxbytes=8g \
  -Dorg.bytedeco.javacpp.maxphysicalbytes=12g \
  -Djavacpp.platform.extension=-avx2 \
  -Dorg.bytedeco.openblas.num.threads=8 \
  -jar myapp.jar
```

Here the Java heap is capped at 4 GB and the ND4J off-heap pool is allowed up to 8 GB, for a combined maximum of 12 GB of physical memory.

### Workspace-based memory management

For training loops and repeated inference, enable ND4J workspaces to reuse off-heap buffers without waiting for garbage collection. See [Memory and Workspaces](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/core-concepts/memory-and-workspaces/README.md) for the full guide.

## Verifying the CPU Backend Is Active

```java
import org.nd4j.linalg.factory.Nd4j;

// Print the backend class name
System.out.println(Nd4j.getBackend().getClass().getName());
// Expected: org.nd4j.linalg.cpu.nativecpu.CpuBackend

// Print the native library that was loaded
System.out.println(Nd4j.getEnvironment().isCPU());
// true

// Create a small array as a smoke test
INDArray x = Nd4j.linspace(0, 9, 10);
System.out.println(x);
// [0.0000, 1.0000, 2.0000, 3.0000, 4.0000, 5.0000, 6.0000, 7.0000, 8.0000, 9.0000]
```

If initialization fails, the most common causes are:

* No `nd4j-native` or `nd4j-native-platform` on the classpath. Add the dependency.
* The AVX classifier was specified but the CPU does not support it. Use the baseline classifier or check `lscpu`.
* A conflicting native library (`libopenblas.so`) on `LD_LIBRARY_PATH` that mismatches the bundled version. Remove or reorder `LD_LIBRARY_PATH`.

## Quick Reference

| Goal                                 | Setting                                               |
| ------------------------------------ | ----------------------------------------------------- |
| Add CPU backend (all platforms)      | `org.nd4j:nd4j-native-platform:1.0.0-rewrite`         |
| Add CPU backend (Linux x86\_64 only) | `org.nd4j:nd4j-native:1.0.0-rewrite:linux-x86_64`     |
| Enable AVX2 via system property      | `-Djavacpp.platform.extension=-avx2`                  |
| Enable AVX-512 via system property   | `-Djavacpp.platform.extension=-avx512`                |
| Set OpenMP thread count              | `export OMP_NUM_THREADS=N` or `Nd4j.setNumThreads(N)` |
| Limit off-heap memory                | `-Dorg.bytedeco.javacpp.maxbytes=Ng`                  |
| Limit total physical memory          | `-Dorg.bytedeco.javacpp.maxphysicalbytes=Ng`          |
| Use Intel MKL instead of OpenBLAS    | `-Dorg.bytedeco.openblas.load=mkl_rt`                 |
| Check active backend                 | `Nd4j.getBackend().getClass().getName()`              |

## See Also

* [Backends Overview](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/nd4j/backends/overview/README.md) — SPI mechanism, backend discovery, classpath rules
* [CUDA Backend (nd4j-cuda)](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/nd4j/backends/cuda/README.md) — GPU setup, multi-GPU, VRAM management
* [Memory and Workspaces](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/core-concepts/memory-and-workspaces/README.md) — off-heap memory, workspace scopes


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/nd4j/overview-1/cpu.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
