1 of 12

How To Guides

Import in to your favorite IDE

Pre requisites

Ensure that you clone the deeplearning4j project locally.

Before importing the project, a few things of note no matter what IDE you use:

One submodule (libnd4j) is a c++ project that uses maven to invoke a cmake build. You may wish to edit libnd4j separately in a cmake oriented IDE like VS Code, Clion, or Eclipse c/c++. In order to build a particular nd4j backend, libnd4j should already be compiled. By default, relevant nd4j backends all look for a pre compiled libnd4j in the libnd4j directory included within the same project.
Maven profiles for deeplearning4j matter a lot. Especially if you want to run tests. Read more on the test profiles . For most code nd4j-tests-cpu should probably be the main profile you use.
Deeplearning4j uses lombok for its dependencies. Ensure you install lombok for your favorite IDE in order to use the project. Please follow the for setting this up in your IDE.

Intellij

Once cloned locally, open intellij. Please follow the guide to import from external maven sources.

Once imported, please give the project time to download associated dependencies. You can verify the status of the project in the bottom right corner.

In order to enable the project to work, the following modifications need to be made.

Shaded modules

Eclipse Deeplearning4j has a set of shaded modules. Shaded modules are artifacts that re namespace a dependency to a different location in order to use it as a set of private dependencies that do not clash with other libraries that may also share the dependency.

Intellij does not handle this very well. In order to work around this, you need to exclude all projects under the nd4j/nd4j-shade folder individually. Right click on each folder. Go to Maven -> Ignore Projects.

Assuming you follow the other steps above (lombok,libdn4j,..) then you should be able to run any module you want.

Eclipse

When first finishing import of the project, a number of maven connector errors should be highlighted. Afterwards, just click resolve all later and finish. Let eclipse finish downloading sources and javadoc.

As of the latest version of eclipse, build errors may occur.

Contribute

How to contribute to the Eclipse Deeplearning4j source code.

Prerequisites

Before contributing, make sure you know the structure of all of the Eclipse Deeplearning4j libraries. As of early 2018, all libraries now live in the Deeplearning4j . These include:

DeepLearning4J: Contains all of the code for learning neural networks, both on a single machine and distributed.
ND4J: “N-Dimensional Arrays for Java”. ND4J is the mathematical backend upon which DL4J is built. All of DL4J’s neural networks are built using the operations (matrix multiplications, vector operations, etc) in ND4J. ND4J is how DL4J supports both CPU and GPU training of networks, without any changes to the networks themselves. Without ND4J, there would be no DL4J.
DataVec: DataVec handles the data import and conversion side of the pipeline. If you want to import images, video, audio or simply CSV data into DL4J: you probably want to use DataVec to do this.
RL4J: Reinforcement Learning for Java. This set of libraries contains the ability to do reinforcement learning built on the deeplearning4j library.
Samediff: Built within the nd4j library, this library contains a tensorflow/pytorch like library for building data flow graphs.

We also have an extensive examples repository at .

Ways to contribute

There are numerous ways to contribute to DeepLearning4J (and related projects), depending on your interests and experince. Here’s some ideas:

Add new types of neural network layers (for example: different types of RNNs, locally connected networks, etc)
Add a new training feature
Bug fixes
DL4J examples: Is there an application or network architecture that we don’t have examples for?
Testing performance and identifying bottlenecks or areas to improve
Improve website documentation (or write tutorials, etc)
Improve the JavaDocs

There are a number of different ways to find things to work on. These include:

Looking at the issue trackers:
Reviewing our Roadmap
Reviewing recent papers and blog posts on training features, network architectures and applications
Reviewing the website and examples - what seems missing, incomplete, or would simply be useful (or cool) to have?

General guidelines

Before you dive in, there’s a few things you need to know. In particular, the tools we use:

Maven: a dependency management and build tool, used for all of our projects. See this for details on Maven.
Git: the version control system we use
Project Lombok: Project Lombok is a code generation/annotation tool that is aimed to reduce the amount of ‘boilerplate’ code (i.e., standard repeated code) needed in Java. To work with source, you’ll need to install the Project Lombok plugin for your IDE
VisualVM: A profiling tool, most useful to identify performance issues and bottlenecks.
IntelliJ IDEA: This is our IDE of choice, though you may of course use alternatives such as Eclipse and NetBeans. You may find it easier to use the same IDE as the developers in case you run into any issues. But this is up to you.

Things to keep in mind:

Code should be Java 7 compliant
If you are adding a new method or class: add JavaDocs
You are welcome to add an author tag for significant additions of functionality. This can also help future contributors, in case they need to ask questions of the original author. If multiple authors are present for a class: provide details on who did what (“original implementation”, “added feature x” etc)
Provide informative comments throughout your code. This helps to keep all code maintainable.
Any new functionality should include unit tests (using JUnit) to test your code. This should include edge cases.
If you add a new layer type, you must include numerical gradient checks, as per these unit tests. These are necessary to confirm that the calculated gradients are correct
If you are adding significant new functionality, consider also updating the relevant section(s) of the website, and providing an example. After all, functionality that nobody knows about (or nobody knows how to use) isn’t that helpful. Adding documentation is definitely encouraged when appropriate, but strictly not required.

Eclipse Contributors

IP/Copyright requirements for Eclipse Foundation Projects

This page explains steps required to contribute code to the projects in the eclipse/deeplearning4j GitHub repository:

Contributors (anyone who wants to commit code to the repository) need to do two things, before their code can be merged:

Sign the Eclipse Contributor Agreement (once)
Sign commits (each time)

Why Is This Required?

These two requirements must be satisfied for all Eclipse Foundation projects, not just DL4J and ND4J. A full list of Eclipse Foundation Projects can be found here:

By signing the ECA, you are essentially asserting that the code you are submitting is something that either you wrote, or that you have the right to contribute to the project. This is a necessary legal protection to avoid copyright issues.

By signing your commits, you are asserting that the code in that particular commit is your own.

Signing the Eclipse Contributor Agreement

You only need to sign the Eclipse Contributor Agreement (ECA) once. Here's the process:

Step 1: Sign up for an Eclipse account

This can be done at

Note: You must register using the same email as your GitHub account (the GitHub account you want to submit pull requests from).

Step 2: Sign the ECA

Go to and follow the instructions.

Signing Your Commits

Signing a New Commit

There are a few ways to sign commits. Note that you can use any of these aoptions.

Option 1: Use -s When Committing on Command Line

Signing commits here is simple:

Note the use of -s (lower case s) - upper-case S (i.e., -S) is for GPG signing (see below).

Option 2: Set up Bash Alias (or Windows cmd Alias) for Automated Signing

For example, you could set up the following alias in Bash:

Then committing would be done with the following:

One simple way is to create a gcm.bat file with the following contents, and add it to your system path:

You can then commit using the same process as above (i.e., gcm "My Commit")

Option 3: Use GPG Signing

Note that this option can be combined with aliases (above), as in alias gcm='git commit -S -m' - note the upper case -S for GPG signing.

Option 4: Commit using IntelliJ with Auto Signing

Checking If A Commit Is Signed

After performing a commit, you can check in a few different ways. One way is to use git log --show-signature -1 to show the signature for the last commit (use -5 to show the last 5 commits, for example)

The output will look like:

The top commit is unsigned, and the bottom commit is signed (note the presence of the Signed-off-by).

If You Forget to Sign a Commit - Amending the Last Commit

If you forgot to sign the last commit, you can use the following command:

If You Forget to Sign Multiple Commits

Suppose your branch has 3 new commits, all of which are unsigned:

One simple way is to squash and sign these commits. To do this for the last 3 commits, use the following: (note you might want to make a backup first)

The result:

You can confirm that the commit is signed using git log -1 --show-signature as shown earlier.

Note that your commits will be squashed once they are merged to master anyway, so the loss of the commit history does not matter.

If you are updating an existing PR, you may need to force push using -f (as in git push X -f).

Developer Docs

Github Actions/Build Infra

Github actions Configuration Overview

Overview of a Github Actions Configuration

Each github actions workflow has 10 parameters for manually invoking builds. The reason this is manual is due to the different ways a release can break. Being manual also allows us to re invoke only the parts of a build we need, rather than the whole release pipeline.

Most workflows implement a matrix structure for handling different combinations of builds related to the following: 1. Platform specific optimizations: On windows/linux/mac we allow cpu + optional linking against mkldnn. Each combination is enumerated and ran as part of a matrix build on github actions.

Cuda, optional cudnn: We also allow optional linking against cudnn for gpu routines.

Input parameters:

buildThreads: This is the number of builds threads used for compilation in linbnd4j. This is the equivalent of make -j. For specific platforms that use more memory, 1 is the recommended value. On self hosted setups, you may use more threads to make builds run faster.
deployToReleaseStaging: 0 or 1. If 1, this will create a staging repository on oss sonatype. Otherwise, it will deploy to ossrh snapshots. Snapshots is the default.
releaseVersion: This is the intended release version to be converted to from snapshots. The update-versions.sh script is run converting the versions of every module to that specific version intended for release. This is what will get uploaded to a staging repository for release. Otherwise, all intended versions should be SNAPSHOT.
snapshotVersion: The current in development snapshot version
releaseRepoId: If blank, then a new staging repository for a version is created. Otherwise, a staging repository id should be obtained from the ossrh nexus sonatype. This releaseRepoId should be passed to subsequent builds so all of the artifacts associated with a version get propagated to 1 place.
serverId: This should be ossrh 90% of the time. A github profile is also available for use with github actions.
modules: The maven modules to build. This is fairly raw and error prone. The intended usage is with the -pl/--projects flag Typical usage is to skip libnd4j builds with something like:
```
--pl !libnd4j
```
to skip a libnd4j compile. This can speed builds up significantly.
libnd4jDownload/libnd4jUrl: In tandem with modules, you can specify a libnd4j zip file distribution that was compiled before for download. The builds will download a libnd4j distribution and use that for linking. This can be handy when recompiling the nd4j-native/nd4j-cuda backends for a specific platform without needing to recompile the whole c++ codebase. A url in a matrix build will be sourced from a hard coded file name from this repo - each file name will be updated to point to a zip file distribution appropriate for an individual matrix build. This was done because 1 url is not going to be suitable for individual matrix builds.
runsOn: This is the operating system upon which to run the build. For linux, this defaults to ubuntu-16.04. For windows, windows-2019. self-hosted can also be specified for faster builds.

Matrix builds

Many configurations on cpu and cuda require a matrix based build structure to capture the various combinations of optimization and software versions people may want to use. In order to accomodate these workflows, we need to attach variables proxying the values of the manual inputs to the individual matrix workers themselves. These parameters are analogous to the above described parameters. Note we will not repeat the descriptions here, but the values can be seen from their values in the form of $ where SOME_VALUE is one of the values above.

The configuration to look for is as follows:

          - mvn_ext: ${{ github.event.inputs.mvnFlags }}
            experimental: true
            name: Extra maven flags

          - debug_enabled: ${{ github.event.inputs.debug_enabled }}
            experimental: true
            name: Debug enabled

          - runs_on: ${{ github.event.inputs.runsOn }}
            experimental: true
            name: OS to run on

          - libnd4j_file_download: ${{ github.event.inputs.libnd4jDownload }}
            experimental: true
            name: OS to run on

          - deploy_to_release_staging: ${{ github.event.inputs.deployToReleaseStaging }}
            experimental: true
            name: Whether to deploy to release staging or not

          - release_version: ${{ github.event.inputs.releaseVersion }}
            experimental: true
            name: Release version

          - snapshot_version: ${{ github.event.inputs.snapshotVersion }}
            experimental: true
            name: Snapshot version

          - server_id: ${{ github.event.inputs.serverId }}
            experimental: true
            name: Server id

          - release_repo_id: ${{ github.event.inputs.releaseRepoId }}
            experimental: true
            name: The release repository to run on

          - mvn_flags: ${{ github.event.inputs.mvnFlags }}
            experimental: true
            name: Extra maven flags to use as part of the build

          - build_threads: ${{ github.event.inputs.buildThreads }}
            experimental: true
            name: The number of threads to build libnd4j with

Expected timings

CUDA: Most cuda builds take 4-5 hours. Both windows and linux on GH actions just download the cuda distribution and compile things on their respective platforms.
CPU builds: From scratch libnd4j + cpu builds typically take 1-2 hours max. Anything more than that, your build may have something wrong.

Build error causes

Out of disk: It is very common for a github actions VM to run out of disk. If a build fails with no logs after and all steps terminated, this maybe one of the reasons.
Out of memory: Sometimes builds run out of memory. A few common causes include:
- Clang out of memory on android, depending on the number of builds threads assigned, it is easy for clang to run out of memory
- Maven javadoc: The maven javadoc plugin for bigger projects can use a ton of ram and crash a job
Network failures: Maven can sometimes (rarely) fail to download certain dependencies in the middle of a job

Environment variables:

MAVEN_GPG_KEY: The maven gpg key secret for a release
CROSS_COMPILER_DIR: For the pi_build.sh script in libnd4j. This contains the root directory
for cross compiler invocation. We need this because all cross compilation for various libnd4j builds happens
on x86. We cross compile for speed reasons also easily allowing us to run on github actions.
Debian frontend: This is to ensure that all debian commands by default don't prompt for yes/no
GITHUB_TOKEN: This is for authentication with github actions
BUILD_USING_MAVEN: This is for pi_build.sh. This toggles (0 or 1) whether to use maven or buildnativeoperation.sh
in the libnd4j root directory directly.
NDK_VERSION: Default is r21d. Libnd4j's android is compiled with the android r21 currently.
CURRENT_TARGET: This variable is for pi_build.sh. It tells pi_build.sh which architecture to build for.
PUBLISH_TO: The repo to publish to for releases or snapshots. Valid values are github or ossrh.
These are repositories defined in the deeplearning4j root pom.
OPENBLAS_PATH: We compile libnd4j against openblas for several different cpus. Openblas is manually downloaded and linked against.
This specifies the path to the download for the libnd4j cmake invocation.
MAVEN_USERNAME: The user name to login to for the ossrh maven repository
MAVEN_PASSWORD: The password to login to for the ossrh maven repository
MAVEN_GPG_PASSPHRSE: The gpg password for signing artifacts for uploading to maven central
DEPLOY_TO> Valid values are either ossrh or github.
LIBND4J_BUILD_THREADS: This is the equivalent of make -j. It specifies the number of threads
that should be used to compile libnd4j
PERFORM_RELEASE: Whether to perform a release or not (0 or 1)
RELEASE_VERSION: The version to be released to maven central. change-versions.sh will be run
to change versions throughout the code base from the snapshot verison to the intended release version.
SNAPSHOT_VERSION: The current snapshot version to be changed when performing a release.
After a release is conducted, this should generally be the next development version.
RELEASE_REPO_ID: Leave this empty when first creating a release repository in combination with
DEPLOY set to 1. Afterwards, note which staging repository id gets created in the ossrh interface when publishing
to maven central. Use that id for further buidls to ensure that all uploads for 1 version are synchronized to 1 staging repository.
MODULES: Extra maven flags for pi_build.sh if more flags are needed (such as for debugging or only building specific modules)
LIBND4J_URL: Used when building nd4j-native. If a user does not want to recompile libnd4j for their particular build, you can instead
skip this step and specify a libnd4j zip file download (generally built with the maven assembly plugin)

Javacpp

DL4J and Javacpp

DL4J and Javacpp overview

DL4J heavily depends on for its interop between java and platform optimized c++ libraries. However, due to our usage of JNI this comes with certain complexities in the build anyone should be aware of.

The following modules rely on javacpp as part of their build process: 1. nd4j-native 2. nd4j-native-presets 3. nd4j-cuda 4. nd4j-cuda-presets

Each of these libraries are what comprise our nd4j backends. Leveraging [libnd4j], javacpp handles linking each nd4j-backend against the libnd4j c++ codebase. This linking is done using a libnd4j home. This will contain all of the include files and necessary binary files for specific platforms. By default, nd4j backends and the libnd4j code base are compiled within the same build step. This is the recommended default, but for specific circumstances. A libnd4j release is also uploaded to maven central as a zip file and can be used in place of libnd4j compilation. See our for more information on this.

Each backend consists of 2 modules

The codebase: This represents the actual nd4j backend logic for specific platforms. Conceptually, this logic will be anything that a developer should need to control such as memory management, environment variables, or other execution logic.
The presets: This is a similar concept in spirit to the In order to avoid a race condition between the backend and the presets compilation, this is a separate dependency that just exists to handle interop between the libnd4j code base and the java frontend. The above backend then contains the rest of the logic needed for execution of the math operations on specific platforms.

Compilation flow

After a libnd4j build is executed for a specific platform, we need to leverage javacpp to actually link against libnd4j to create a complete libnd4j backend. When invoking a maven build, the is used to actually invoke a build. The presets will be compiled first. Generally the presets are just 1 or 2 classes containing a description of how to map the actual nd4j code base to the libnd4j codebase.

Next, the actual backend is compiled with a dependency on the above presets code base. The javacpp plugin will leverage the description from the presets we specify as a dependency and facilitate linking against a LIBND4J_HOME (a folder which contains the platform specific libnd4j binaries and include sources) specified by the user. In the actual plugin declaration on the backend pom.xml we include the target presets class to use for our particular backend.

Note: This still requires the native platform specific tools to be installed since binaries are generated for each platform. Please see our github actions for instructions on specific platforms.

-platform dependencies

Nd4j reuses javacpp's notion of a -platform library. This is a curated set of dependencies most users will use as part of a build. Each backend will have an associated -platform artifact so users don't have to deal with maven classifiers. See for how to leverage this artifact.

Caution to users: By default, this means that a large number of dependencies for all platforms will be included. If you do not need dependencies for all platforms, then please read the above documentation to figure out how to build a jar for your specific platform.

Generally, the main thing to know is when you build your application, use:

Javacpp platform specific profiles

Running javacpp on termux + android/lineagos

In order to bootstrap this environment, a from scratch install of the latest lineageos flashed on an sd card using the raspberry pi is suggested.

Afterwards, install

In order to properly setup the test environment,

you need to execute your test from the command line as follows:

A proper execution environment after the above jdk is installed involves manually setting the environment as follows:

This will setup the jdk + maven to ignore ssl errors due to issues with cacerts + termux. This is largely irrelevant for our small testing use case, but not recommended for production environments.

Redist artifacts

Redist artifacts are easy ways of distributing dependencies without installation.

Note that for the presets that are part of nd4j (nd4j-cuda-presets and nd4j-native-presets) only the latest versions support redist artifacts. The presets preload versions only support pre loading (eg: linking against libraries from the javacpp cache) against the latest version. This is because during pre loading, certain version numbers are checked for.

Release

How to conduct a release to Maven Central

Deeplearning4j has several steps to a release. Below is a brief outline with follow on descriptions.

Compile libnd4j for different cpu architectures
Ensure the current javacpp dependencies such as python, mkldnn, cuda, .. are up to date
Run all integration tests on core platforms (windows, mac, linux) with both cpu and gpu
Create a staging repository for testing using github actions running manually on each platform
Update the examples to be compatible with the latest release
Run the deeplearning4j-examples as a litmus tests on all platforms (including embedded)
to sanity check platform specific numerical bugs using the staging repository
Double check any user related bugs to see if they should block a release
Hit release button
Perform follow up release of -platform projects under same version
Tag release

Compile libnd4j on different cpu architectures

Compiling libnd4j on different cpu architectures ensures there is platform optimized math in c++ for each platform. The single code base is a self contained cmake project that can be run on different platforms. In each github actions workflow there are steps for deploying for each platform.

At the core of compiling from source for libnd4j is a maven pom.xml that is run as part of the overall build process that invokes our build script with various parameters that then get passed to our overall cmake structure for compilation. This script exists to formalize some of the required parameters for invokving cmake. Any developer is welcome to invoke cmake directly.

Platform compatibility
We currently compile libnd4j on ubuntu 16.04. This means glibc 2.23.
For our cuda builds, we use gcc7.
Users of older glibc versions may need to compile from source. For our standard release, we try to keep it reasonably old, but do not support end of lifed
end of linux distributions for public builds.
Platform specific helpers

Each build of libnd4j links against an accelerated backend for blas and convolution operations such as onednn, cudnn, or armcompute The implementations for each platform can be found here

Ensure the current javacpp dependencies such as python, mkldnn, cuda, .. are up to date

This is a step that just ensures that the dl4j release matches the current state of the dependencies provided by javacpp on maven central. This affects every module including python4j, nd4j-native/cuda, datavec-image, among others. The versions of everything can be found in the top level deeplearning4j pom The general convention is library version followed by a - and the version of javacpp that that version uses.

Of note here is that certain older versions of libraries can use older javacpp versions. It is recommended that that the desired version be up to date if possible. Otherwise, if an older version of javacpp is the only version available, this is generally ok.

Run all integration tests on core platforms (windows, mac, linux) with both cpu and gpu

We run all of the major integration tests on the core major platforms where higher end compute is accessible. This is generally a bigger machine. It is expected that some builds can take up to 2 hours depending on the specs of the desired machine.

This step may also involve invoking tests with specific tags if only running a subset of tests is desired. This can be achived using the surefire plugin -Dgroups flag.

Update the examples to be compatible with the latest release

To ensure the examples stay compatible with the current release, we also tag the release version to be the latest version found on maven central. This step may also involve adding or removing examples for new or deprecated features respectivley.

Ensure different classifiers work

Different supported cuda versions with and without cudnn
Onednn and associated classifiers per platform

Android

Ensure testing happens on the android emulator.

Run the deeplearning4j-examples as a litmus tests on all platforms (including embedded)

The examples contain a set of tests which just allow us to run maven clean test on a small number of examples. Instead of us picking examples manually, we can just run mvn clean test on any platform we need by just specifying a version of dl4j to depend on and usually a staging repository

Generally, sometimes users will raise issues right before a release that can be critical. It is the sole discretion of the maintainers to ask the user to use snapshots or to wait for a follow on version. For certain fixes, we will publish quick bugfix releases. If your team has specific requirements on a release, please contact us on the community forums

Hit release button

This means after closing a staging repository, hitting the release button initiating a sync of the staging repository with the desired version to maven central. Sync usually takes 2 hours or less.

Ensure a tag exists

After a release happens, a version update to the stable version + a github tag needs to happen. This is achived in the desktop app by going to: 1. History 2. Right click on target commit you want to tag 3. Click tag 4. Push the revision 5. Update the version back to snapshot after tag.

Testing

How to conduct a release to Maven Central

Parameters for testing

test.heap.size: The heap size used for maven surefire plugin sub processes
test.offheap.size: The off heap size used for maven surefire sub processes. This is very important for
configuration (especially on gpu systems)

Test resources

In order to run the deeplearning4j tests, many pretrained models and other resources are required. Ensure dl4j test resources as a dependency on your classpath. It is a big repository that needs to be mvn clean installed in order to run the tests properly. You can do this by adding -Ptestresources to your test execution when running the tests from maven.

Test profiles for enabling nd4j backends

When running deeplearning4j's tests, there are 2 main profiles to be aware of: nd4j-tests-cpu and nd4j-tests-cuda. These each enable running cpu or gpu tests respectively across the whole code base. Please ensure one of these is selected when running tests.

testresources: Used to add the test resources used for nd4j.

Test categories

Deeplearning4j uses' junit 5's tags to categorize tests in to different types. All of the tag names used throughout the code base can be found here Nd4j-common-tests is included as a dependency for all tests and has a few reusable utilities used throughout the code base for tests. This makes it a great location to put common utilities we want to use throughout the code base. The tag names are mainly there to categorize tests that can take longer or use more resources so we can avoid running those dynamically depending on the size of the machine we are running tests on.

GPUs and multi threaded boxes

Note when running gpu tests on a box with more than 1 gpu, it can/will run out of memory if test.heap.size is at not at least 4g. Also of note, is when running tests

Build From Source

Instructions to build all DL4J libraries from source.

A reference for building dl4j from source can be found for every platform in our workflows. For maintenance reasons, we would prefer to have a canonical source of up to date build information for users rather than out of date install instructions in this guide. This guide will contain specific long lived tips for how to interpret the workflows and what to consider when building.

For an overview of the GitHub actions workflows see the overview doc

This document will cover the specific components of the build by platform rather than step through what's already in the workflows. If you have suggestions for improving this document, please comment over at the community forums

Core steps:

Building libnd4j for your specific platform
Linking the nd4j backend you want to compile for against libnd4j via JavaCPP
Compiling the rest of the code in to jar files

Key concepts

Libnd4j is a CMake based c++ project that supports running optimized math code on different architectures. Its sole focus is being a tiny self contained library for running math kernels. It can link against optimized BLAS routines, platform specific CNN libraries such as OneDNN and CuDNN, and contains hundreds of math kernels for implementing neural networks and other math routines.
Maven: Maven is the core build tool for deeplearning4j. Understanding maven is key to building deeplearning4j from source
Maven and CMake: For compiling libnd4j, we invoke a buildnativeoperations.sh wrapper script via maven. buildnativeoperations.sh in turn automatically sets up CMake to then build the c++ project
pi_build.sh: This is our build script for embedded and ARM based platforms. It focuses on cross compilation running on a Linux x86 based platform.
buildnativeoperations.sh: The main build script for libnd4j. It initializes CMake and invokes CMake compilation for the user on whatever platform the user is currently on unless the user specifies an alternative platform. Specifying a different platform is possible for android for example.

Building for x86_64

The main considerations for building on x86_64 are:

Whether to compile for avx2 or avx512
Whether to use OpenBLAS or MKL
Whether to link against OneDNN

From there, the normal platform specific libraries should be installed before hand. Up to date install instructions can be found in our CPU builds for Windows, Mac and Linux

Building for ARM

ARM based builds all link against the armcompute library by default and, as mentioned above, use the pi_build.sh script for building libnd4j on specific platforms. Note that pi_build.sh can also be used to compile all of dl4j for a specific project.

pi_build.sh mainly focuses on cross compilation.

In order to properly use the pi_build.sh script, a number of environment variables should be set. Per platform, you can find these environment variables in the final build step under the environment section.

If you would like to compile deeplearning4j on an actual ARM device, please use the normal buildnativeoperations.sh workflow.

Building for CUDA

In order to compile deeplearning4j for a particular version, you must first invoke change-cuda-versions.sh in the root directory:

./change-cuda-versions.sh $YOUR_CUDA_VERSION

This will ensure that all library versions are set to the appropriate version. Ensure that the CUDA toolkit you need is installed. If you intend on using CuDNN, ensure that is also installed correctly. For installing CUDA, consider using our install scripts as a reference if you intend on doing automated installs.

Jetson nano users: please see this thread for successfully compiling deeplearning4j on Jetson nano.

In short: It relies on CUDA 10.0. The JavaCPP presets for CUDA are also only compiled for arm64 for CUDA 10.0. You can find the supported CUDA versions for CUDA 10.0 here If you would like something more up to date, please feel free to contact us over at our forums As of 1.0.0-M1.1 you can also use updated dependencies:

<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-cuda-10.2</artifactId>
  <version>1.0.0-M1.1</version>
</dependency>

Note for windows users

We use msys2 for compiling libnd4j. CUDA requires MSVC in order to be installed in order to properly compile CUDA kernels. If you want to compile libnd4j for CUDA from source, please ensure you first invoke the vcvars.bat script in a cmd terminal, then launch msys2 manually. For more specifics, please see our Windows CUDA 11 and 11.2 build files.

Benchmark

General guidelines for benchmarking in DL4J and ND4J.

General Benchmarking Guidelines

Guideline 1: Run Warm-Up Iterations Before Benchmarking

A warm-up period is where you run a number of iterations (for example, a few hundred) of your benchmark without timing, before commencing timing for further iterations.

Why is a warm-up required? The first few iterations of any ND4J/DL4J execution may be slower than those that come later, for a number of reasons:

In the initial benchmark iterations, the JVM has not yet had time to perform just-in-time compilation of code. Once JIT has completed, code is likely to execute faster for all subsequent operations
ND4J and DL4J (and, some other libraries) have some degree of lazy initialization: the first operation may trigger some one-off execution code.
DL4J or ND4J (when using workspaces) can take some iterations to learn memory requirements for execution. During this learning phase, performance will be lower than after its completion.

Guideline 2: Run Multiple Iterations of All Benchmarks

Your benchmark isn't the only thing running on your computer (not to mention if you are using cloud hardware, that might have shared resources). And operation runtime is not perfectly deterministic.

For benchmark results to be reliable, it is important to run multiple iterations - and ideally report both mean and standard deviation for the runtime. Without this, it's impossible to compare the performance of operations, as performance differences may simply be due to random variation.

Guideline 3: Pay Careful Attention to What You Are Benchmarking

This is especially important when comparing frameworks. Before you declare that "performance on operation X is Y" or "A is faster than B", make sure that:

You are bench-marking only the operations of interest.

If your goal is to check the performance of an operation, make sure that only this operation is being timed.

You should carefully check whether you unintentionally including other things - for example, does it include: JVM initialization time? Library initialization time? Result array allocation time? Garbage collection time? Data loading time?

Ideally, these should be excluded from any timing/performance results you report. If they cannot be excluded, make sure you note this whenever making performance claims.

What native libraries are you using?
For example: what BLAS implementation (MKL, OpenBLAS, etc)? If you are using CUDA, are you using CuDNN? ND4J and DL4J can use these libraries (MKL, CuDNN) when they are available - but are not always available by default. If they are not made available, performance can be lower - sometimes considerably.
This is especially important when comparing results between libraries: for example, if you compared two libraries (one using OpenBLAS, another using MKL) your results may simply reflect the performance differences it the BLAS library being used - and not the performance of the libraries being tested. Similarly, one library with CuDNN and another without CuDNN may simply reflect the performance benefit of using CuDNN.
How are things configured?
For better or worse, DL4J and ND4J allow a lot of configuration. The default values for a lot of this configuration is adequate for most users - but sometimes manual configuration is required for optimal performance. This can be especially true in some benchmarks! Some of these configuration options allow users to trade off higher memory use for better performance, for example. Some configuration options of note: (a) Memory configuration (b) Workspaces and garbage collection (c) CuDNN (d) DL4J Cache Mode (enable using .cacheMode(CacheMode.DEVICE))

If you aren't sure if you are only measuring what you intend to measure when running DL4J or ND4J code, you can use a profiler such as VisualVM or YourKit Profilers.

What versions are you using? When benchmarking, you should use the latest version of whatever libraries you are benchmarking. There's no point identifying and reporting a bottleneck that was fixed 6 months ago. An exception to this would be when you are comparing performance over time between versions. Note also that snapshot versions of DL4J and ND4J are also available - these may contain performance improvements (feel free to ask)

Guideline 4: Focus on Real-World Use Cases - And Run a Range of Sizes

Consider for example a benchmark a benchmark that adds two numbers:

double x = 0;
//<start timing>
x += 1.0;
//<end timing>

And something equivalent in ND4J:

INDArray x = Nd4j.create(1);
//<start timing>
x.addi(1.0);
//<end timing>

Of course, the ND4J benchmark above is going to be much slower - method calls are required, input validation is performed, native code has to be called (with context switching overhead), and so on. One must ask the question, however: is this what users will actually be doing with ND4J or an equivalent linear algebra library? It's an extreme example - but the general point is a valid one.

Note also that performance on mathematical operations can be size - and shape - specific. For example, if you are benchmarking the performance on matrix multiplication - the matrix dimensions can matter a lot. In some internal benchmarks, we found that different BLAS implementations (MKL vs OpenBLAS) - and different backends (CPU vs GPU) - can perform very differently with different matrix dimensions. None of the BLAS implementations (OpenBLAS, MKL, CUDA) we have tested internally were uniformly faster than others for all input shapes and sizes.

Therefore - whenever you are running benchmarks, it's important to run those benchmarks with multiple different input shapes/sizes, to get the full performance picture.

Guideline 5: Understand Your Hardware

When comparing different hardware, it's important to be aware of what it excels at. For example, you might find that neural network training performs faster on a CPU with minibatch size 1 than on a GPU - yet larger minibatch sizes show exactly the opposite. Similarly, small layer sizes may not be able to adequately utilize the power of a GPU.

Furthermore, some deep learning distributions may need to be specifically compiled to provide support for hardware features such as AVX2 (note that recent version of ND4J are packaged with binaries for CPUs that support these features). When running benchmarks, the utilization (or lack there-of) of these features can make a considerable difference to performance.

Guideline 6: Make It Reproducible

When running benchmarks, it's important to make your benchmarks reproducible. Why? Good or bad performance may only occur under certain limited circumstances.

And finally - remember that (a) ND4J and DL4J are in constant development, and (b) benchmarks do sometimes identify performance bottlenecks (after all we - ND4J includes literally hundreds of distinct operations). If you identify a performance bottleneck, great - we want to know about it - so we can fix it. Any time a potential bottleneck is identified, we first need to reproduce it - so that we can study it, understand it and ultimately fix it.

Guideline 7: Understand the Limitations of Your Benchmarks

Linear algebra libraries contain hundreds of distinct operations. Neural network libraries contain dozens of layer types. When benchmarking, it's important to understand the limitations of those benchmarks. Benchmarking one type of operation or layer cannot tell you anything about the performance on other types of layers or operations - unless they share code that has been identified to be a performance bottleneck.

Guideline 8: If You Aren't Sure - Ask

The DL4J/ND4J developers are available on discourse. You can ask questions about benchmarking and performance there: https://community.konduit.ai/c/dl4j

And if you do happen to find a performance issue - let us know!

ND4J Specific Benchmarking

A Note on BLAS and Array Orders

BLAS - or Basic Linear Algebra Subprograms - refers to an interface and set of methods used for linear algebra operations. Some examples include 'gemm' - General Matrix Multiplication - and 'axpy', which implements Y = a*X+b.

ND4J can use multiple BLAS implementations - versions up to and including 1.0.0-beta6 have defaulted to OpenBLAS. However, if Intel MKL (free versions are available here) is installed an available, ND4J will link with it for improved performance in many BLAS operations.

Note that ND4J will log the BLAS backend used when it initializes. For example:

14:17:34,169 INFO  ~ Loaded [CpuBackend] backend
14:17:34,672 INFO  ~ Number of threads used for NativeOps: 8
14:17:34,823 INFO  ~ Number of threads used for BLAS: 8
14:17:34,831 INFO  ~ Backend used: [CPU]; OS: [Windows 10]
14:17:34,831 INFO  ~ Cores: [16]; Memory: [7.1GB];
14:17:34,831 INFO  ~ Blas vendor: [OPENBLAS]

Performance can depend on the available BLAS library - in internal tests, we have found that OpenBLAS has been between 30% faster and 8x slower than MKL - depending on the array sizes and array orders.

Regarding array orders, this also matters for performance. ND4J has the possibility of representing arrays in either row major ('c') or column major ('f') order. See this Wikipedia page for more details. Performance in operations such as matrix multiplication - but also more general ND4J operations - depends on the input and result array orders.

For matrix multiplication, this means there are 8 possible combinations of array orders (c/f for each of input 1, input 2 and result arrays). Performance won't be the same for all cases.

Similarly, an operation such as element-wise addition (i.e., z=x+y) will be much faster for some combinations of input orders than others - notably, when x, y and z are all the same order. In short, this is due to memory striding: it's cheaper to read a sequence of memory addresses when those memory addresses are adjacent to each other in memory, as compared to being spread far apart.

Note that, by default, ND4J expects result arrays (for matrix multiplication) to be defined in column major ('f') order, to be consistent across backends, given that CuBLAS (i.e., NVIDIA's BLAS library for CUDA) requires results to be in f order. As a consequence, some ways of performing matrix multiplication with the result array being in c order will have lower performance than if the same operation was executed with an 'f' order array.

Finally, when it comes to CUDA: array orders/striding can matter even more than when running on CPU. For example, certain combinations of orders can be much faster than others - and input/output dimensions that are even multiples of 32 or 64 typically perform faster (sometimes considerably) than when input/output dimensions are not multiples of 32.

DL4J Specific Benchmarking

Most of what has been said for ND4J also applies to DL4J.

In addition:

If you are using the nd4j-native (CPU) backend, ensure you are using Intel MKL. This is faster than the default of OpenBLAS in most cases.
If you are using CUDA, ensure you are using CuDNN (link)
Check the Workspaces and Memory guides. The defaults are usually good - but sometimes better performance can be obtained with some tweaking. This is especially important if you have a lot of Java objects (such as, Word2Vec vectors) in memory while training.
Watch out for ETL bottlenecks. You can add PerformanceListener to your network training to see if ETL is a bottleneck.
Don't forget that performance is dependent on minibatch sizes. Don't benchmark with minibatch size 1 - use something more realistic.
If you need multi-GPU training or inference support, use ParallelWrapper or ParallelInference.
Don't forget that CuDNN is configurable: you can specify DL4J/CuDNN to prefer performance - at the expense of memory - using .cudnnAlgoMode(ConvolutionLayer.AlgoMode.PREFER_FASTEST) configuration on convolution layers
When using GPUs, multiples of 8 (or 32) for input sizes and layer sizes may perform better.
When using RNNs (and manually creating INDArrays), use 'f' ordered arrays for both features and (RnnOutputLayer) labels. Otherwise, use 'c' ordered arrays. This is for faster memory access.

Common Benchmark Mistakes

Finally, here's a summary list of common benchmark mistakes:

Not using the latest version of ND4J/DL4J (there's no point identifying a bottleneck that was fixed many releases back). Consider trying snapshots to get the latest performance improvements.
Not paying attention to what native libraries (MKL, OpenBLAS, CuDNN etc) are being used
Providing no warm-up period before benchmarking begins
Running only a single (or too few) iterations, or not reporting mean, standard deviation and number of iterations
Not configuring workspaces, garbage collection, etc
Running only one possible case - for example, benchmarking a single set of array dimensions/orders when benchmarking BLAS operations
Running unusually small inputs - for example, minibatch size 1 on a GPU (which might be slower - but isn't realistic!)
Not measuring exactly - and only - what you claim to be measuring (for example, not accounting for array allocation, initialization or garbage collection time)
Not making your benchmarks reproducible (does the benchmark conclusion generalize? are there problems with the benchmark? what can we do to fix it?)
Comparing results across different hardware, not accounting for differences (for example, testing on one machine with AVX2 support, and on another without)
Not asking the devs (via Discourse - we are happy to provide suggestions and investigate if performance isn't where it should be!

How to Run Deeplearning4j Benchmarks - A Guide

Total training time is always ETL plus computation. That is, both the data pipeline and the matrix manipulations determine how long a neural network takes to train on a dataset.

When programmers familiar with Python try to run benchmarks comparing Deeplearning4j to well-known Python frameworks, they usually end up comparing ETL + computation on DL4J to just computation on the Python framework. That is, they're comparing apples to oranges. We'll explain how to optimize several parameters below.

The JVM has knobs to tune, and if you know how to tune them, you can make it a very fast environment for deep learning. There are several things to keep in mind on the JVM. You need to:

Increase the heap space
Get garbage collection right
Make ETL asynchronous
Presave datasets (aka pickling)

Setting Heap Space

Users have to reconfigure their JVMs themselves, including setting the heap space. We can't give it to you preconfigured, but we can show you how to do it. Here are the two most important knobs for heap space.

Xms sets the minimum heap space
Xmx sets the maximum heap space

You can set these in IDEs like IntelliJ and Eclipse, as well as via the CLI like so:

    java -Xms256m -Xmx1024m YourClassNameHere

In IntelliJ, this is a VM parameter, not a program argument. When you hit run in IntelliJ (the green button), that sets up a run-time configuration. IJ starts a Java VM for you with the configurations you specify.

What’s the ideal amount to set Xmx to? That depends on how much RAM is on your computer. In general, allocate as much heap space as you think the JVM will need to get work done. Let’s say you’re on a 16G RAM laptop — allocate 8G of RAM to the JVM. A sound minimum on laptops with less RAM would be 3g, so

    java -Xmx3g

It may seem counterintuitive, but you want the min and max to be the same; i.e. Xms should equal Xmx. If they are unequal, the JVM will progressively allocate more memory as needed until it reaches the max, and that process of gradual allocation slows things down. You want to pre-allocate it at the beginning. So

    java -Xms3g -Xmx3g YourClassNameHere

IntelliJ will automatically specify the Java main class in question.

Another way to do this is by setting your environmental variables. Here, you would alter your hidden .bash_profile file, which adds environmental variables to bash. To see those variables, enter env in the command line. To add more heap space, enter this command in your console:

    echo "export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=512m"" > ~/.bash_profile

We need to increase heap space because Deeplearning4j loads data in the background, which means we're taking more RAM in memory. By allowing more heap space for the JVM, we can cache more data in memory.

Garbage Collection

A garbage collector is a program which runs on the JVM and gets rid of objects no longer used by a Java application. It is automatic memory management. Creating a new object in Java takes on-heap memory: A new Java object takes up 8 bytes of memory by default. So every new DatasetIterator you create takes another 8 bytes.

You may need to alter the garbage collection algorithm that Java is using. This can be done via the command line like so:

    java -XX:+UseG1GC

Better garbage collection increases throughput. For a more detailed exploration of the issue, please read this InfoQ article.

DL4J is tightly linked to the garbage collector. JavaCPP, the bridge between the JVM and C++, adheres to the heap space you set with Xmx and works extensively with off-heap memory. The off-heap memory will not surpass the amount of heap space you specify.

JavaCPP, created by a Skymind engineer, relies on the garbage collector to tell it what has been done. We rely on the Java GC to tell us what to collect; the Java GC points at things, and we know how to de-allocate them with JavaCPP. This applies equally to how we work with GPUs.

The larger the batch size you use, the more RAM you’re taking in memory.

ETL & Asynchronous ETL

In our dl4j-examples repo, we don't make the ETL asynchronous, because the point of examples is to keep them simple. But for real-world problems, you need asynchronous ETL, and we'll show you how to do it with examples.

Data is stored on disk and disk is slow. That’s the default. So you run into bottlenecks when loading data onto your hard drive. When optimizing throughput, the slowest component is always the bottleneck. For example, a distributed Spark job using three GPU workers and one CPU worker will have a bottleneck with the CPU. The GPUs have to wait for that CPU to finish.

The Deeplearning4j class DatasetIterator hides the complexity of loading data on disk. The code for using any Datasetiterator will always be the same, invoking looks the same, but they work differently.

one loads from disk
one loads asynchronously
one loads pre-saved from RAM

Here's how the DatasetIterator is uniformly invoked for MNIST:

        while(mnistTest.hasNext()){
                DataSet ds = mnistTest.next();
                INDArray output = model.output(ds.getFeatures(), false);
                eval.eval(ds.getLabels(), output);
        }

You can optimize by using an asynchronous loader in the background. Java can do real multi-threading. It can load data in the background while other threads take care of compute. So you load data into the GPU at the same time that compute is being run. The neural net trains even as you grab new data from memory.

This is the relevant code, in particular the third line:

    MultiDataSetIterator iterator;
    if (prefetchSize > 0 && source.asyncSupported()) {
        iterator = new AsyncMultiDataSetIterator(source, prefetchSize);
    } else iterator = source;

There are actually two types of asynchronous dataset iterators. The AsyncDataSetIterator is what you would use most of the time. It's described in the Javadoc here.

For special cases such as recurrent neural nets applied to time series, or for computation graphs, you would use a AsyncMultiDataSetIterator, described in the Javadoc here.

Notice in the code above that prefetchSize is another parameter to set. Normal batch size might be 1000 examples, but if you set prefetchSize to 3, it would pre-fetch 3,000 instances.

ETL: Comparing Python frameworks With Deeplearning4j

In Python, programmers are converting their data into pickles, or binary data objects. And if they're working with a smallish toy dataset, they're loading all those pickles into RAM. So they're effectively sidestepping a major task in dealing with larger datasets. At the same time, when benchmarking against Dl4j, they're not loading all the data onto RAM. So they're effectively comparing Dl4j speed for training computations + ETL against only training computation time for Python frameworks.

But Java has robust tools for moving big data, and if compared correctly, is much faster than Python. The Deeplearning4j community has reported up to 3700% increases in speed over Python frameworks, when ETL and computation are optimized.

Deeplearning4j uses DataVec as it ETL and vectorization library. Unlike other deep-learning tools, DataVec does not force a particular format on your dataset. (Caffe forces you to use hdf5, for example.)

We try to be more flexible. That means you can point DL4J at raw photos, and it will load the image, run the transforms and put it into an NDArray to generate a dataset on the fly.

But if your training pipeline is doing that every time, Deeplearning4j will seem about 10x slower than other frameworks, because you’re spending your time creating datasets. Every time you call fit, you're recreating a dataset, over and over again. We allow it to happen for ease of use, but we can show you how to speed things up. There are ways to make it just as fast.

One way is to pre-save the datasets, in a manner similar to the Python frameworks. (Pickles are pre-formatted data.) When you pre-save the dataset, you create a separate class.

Here’s how you pre-save datasets.

A Recordreaderdatasetiterator talks to Datavec and outputs datasets for DL4J.

Here’s how you load a pre-saved dataset.

Line 90 is where you see the asynchronous ETL. In this case, it's wrapping the pre-saved iterator, so you're taking advantage of both methods, with the asynch loading the pre-saved data in the background as the net trains.

MKL and Inference on CPUs

If you are running inference benchmarks on CPUs, make sure you are using Deeplearning4j with Intel's MKL library, which is available via a clickwrap; i.e. Deeplearning4j does not bundle MKL like Anaconda, which is used by libraries like PyTorch.

Beginners

Road map for beginners new to deep learning.

How Do I Start Using Deep Learning?

Where you start depends on what you already know.

The prerequisites for really understanding deep learning are linear algebra, calculus and statistics, as well as programming and some machine learning. The prerequisites for applying it are just learning how to deploy a model.

In the case of Deeplearning4j, you should know Java well and be comfortable with tools like the IntelliJ IDE and the automated build tool Maven.

Below you'll find a list of resources. The sections are roughly organized in the order they will be useful.

Free Machine- and Deep-learning Courses Online

(For those interested in a survey of artificial intelligence.)
(For those interested in image recognition.)

Math

The math involved with deep learning is basically linear algebra, calculus and probility, and if you have studied those at the undergraduate level, you will be able to understand most of the ideas and notation in deep-learning papers. If haven't studied those in college, never fear. There are many free resources available (and some on this website).

Programming

If you do not know how to program yet, you can start with Java, but you might find other languages easier. Python and Ruby resources can convey the basic ideas in a faster feedback loop. "Learn Python the Hard Way" and "Learn to Program (Ruby)" are two great places to start.

Python

Java

Once you have programming basics down, tackle Java, the world's most widely used programming language. Most large organizations in the world operate on huge Java code bases. (There will always be Java jobs.) The big data stack -- Hadoop, Spark, Kafka, Lucene, Solr, Cassandra, Flink -- have largely been written for Java's compute environment, the JVM.

Deeplearning4j

Other Resources

Benchmark

General guidelines for benchmarking in DL4J and ND4J.

General Benchmarking Guidelines

Guideline 1: Run Warm-Up Iterations Before Benchmarking

A warm-up period is where you run a number of iterations (for example, a few hundred) of your benchmark without timing, before commencing timing for further iterations.

Why is a warm-up required? The first few iterations of any ND4J/DL4J execution may be slower than those that come later, for a number of reasons:

In the initial benchmark iterations, the JVM has not yet had time to perform just-in-time compilation of code. Once JIT has completed, code is likely to execute faster for all subsequent operations
ND4J and DL4J (and, some other libraries) have some degree of lazy initialization: the first operation may trigger some one-off execution code.
DL4J or ND4J (when using workspaces) can take some iterations to learn memory requirements for execution. During this learning phase, performance will be lower than after its completion.

Guideline 2: Run Multiple Iterations of All Benchmarks

Your benchmark isn't the only thing running on your computer (not to mention if you are using cloud hardware, that might have shared resources). And operation runtime is not perfectly deterministic.

Guideline 3: Pay Careful Attention to What You Are Benchmarking

This is especially important when comparing frameworks. Before you declare that "performance on operation X is Y" or "A is faster than B", make sure that:

You are bench-marking only the operations of interest.

If your goal is to check the performance of an operation, make sure that only this operation is being timed.

Ideally, these should be excluded from any timing/performance results you report. If they cannot be excluded, make sure you note this whenever making performance claims.

What native libraries are you using?
For example: what BLAS implementation (MKL, OpenBLAS, etc)? If you are using CUDA, are you using CuDNN? ND4J and DL4J can use these libraries (MKL, CuDNN) when they are available - but are not always available by default. If they are not made available, performance can be lower - sometimes considerably.
This is especially important when comparing results between libraries: for example, if you compared two libraries (one using OpenBLAS, another using MKL) your results may simply reflect the performance differences it the BLAS library being used - and not the performance of the libraries being tested. Similarly, one library with CuDNN and another without CuDNN may simply reflect the performance benefit of using CuDNN.
How are things configured?
For better or worse, DL4J and ND4J allow a lot of configuration. The default values for a lot of this configuration is adequate for most users - but sometimes manual configuration is required for optimal performance. This can be especially true in some benchmarks! Some of these configuration options allow users to trade off higher memory use for better performance, for example. Some configuration options of note: (a) Memory configuration (b) Workspaces and garbage collection (c) CuDNN (d) DL4J Cache Mode (enable using .cacheMode(CacheMode.DEVICE))

If you aren't sure if you are only measuring what you intend to measure when running DL4J or ND4J code, you can use a profiler such as VisualVM or YourKit Profilers.

What versions are you using? When benchmarking, you should use the latest version of whatever libraries you are benchmarking. There's no point identifying and reporting a bottleneck that was fixed 6 months ago. An exception to this would be when you are comparing performance over time between versions. Note also that snapshot versions of DL4J and ND4J are also available - these may contain performance improvements (feel free to ask)

Guideline 4: Focus on Real-World Use Cases - And Run a Range of Sizes

Consider for example a benchmark a benchmark that adds two numbers:

double x = 0;
//<start timing>
x += 1.0;
//<end timing>

And something equivalent in ND4J:

INDArray x = Nd4j.create(1);
//<start timing>
x.addi(1.0);
//<end timing>

Therefore - whenever you are running benchmarks, it's important to run those benchmarks with multiple different input shapes/sizes, to get the full performance picture.

Guideline 5: Understand Your Hardware

Guideline 6: Make It Reproducible

When running benchmarks, it's important to make your benchmarks reproducible. Why? Good or bad performance may only occur under certain limited circumstances.

Guideline 7: Understand the Limitations of Your Benchmarks

Guideline 8: If You Aren't Sure - Ask

The DL4J/ND4J developers are available on discourse. You can ask questions about benchmarking and performance there: https://community.konduit.ai/c/dl4j

And if you do happen to find a performance issue - let us know!

ND4J Specific Benchmarking

A Note on BLAS and Array Orders

Note that ND4J will log the BLAS backend used when it initializes. For example:

14:17:34,169 INFO  ~ Loaded [CpuBackend] backend
14:17:34,672 INFO  ~ Number of threads used for NativeOps: 8
14:17:34,823 INFO  ~ Number of threads used for BLAS: 8
14:17:34,831 INFO  ~ Backend used: [CPU]; OS: [Windows 10]
14:17:34,831 INFO  ~ Cores: [16]; Memory: [7.1GB];
14:17:34,831 INFO  ~ Blas vendor: [OPENBLAS]

For matrix multiplication, this means there are 8 possible combinations of array orders (c/f for each of input 1, input 2 and result arrays). Performance won't be the same for all cases.

DL4J Specific Benchmarking

Most of what has been said for ND4J also applies to DL4J.

In addition:

If you are using the nd4j-native (CPU) backend, ensure you are using Intel MKL. This is faster than the default of OpenBLAS in most cases.
If you are using CUDA, ensure you are using CuDNN (link)
Check the Workspaces and Memory guides. The defaults are usually good - but sometimes better performance can be obtained with some tweaking. This is especially important if you have a lot of Java objects (such as, Word2Vec vectors) in memory while training.
Watch out for ETL bottlenecks. You can add PerformanceListener to your network training to see if ETL is a bottleneck.
Don't forget that performance is dependent on minibatch sizes. Don't benchmark with minibatch size 1 - use something more realistic.
If you need multi-GPU training or inference support, use ParallelWrapper or ParallelInference.
Don't forget that CuDNN is configurable: you can specify DL4J/CuDNN to prefer performance - at the expense of memory - using .cudnnAlgoMode(ConvolutionLayer.AlgoMode.PREFER_FASTEST) configuration on convolution layers
When using GPUs, multiples of 8 (or 32) for input sizes and layer sizes may perform better.
When using RNNs (and manually creating INDArrays), use 'f' ordered arrays for both features and (RnnOutputLayer) labels. Otherwise, use 'c' ordered arrays. This is for faster memory access.

Common Benchmark Mistakes

Finally, here's a summary list of common benchmark mistakes:

Not using the latest version of ND4J/DL4J (there's no point identifying a bottleneck that was fixed many releases back). Consider trying snapshots to get the latest performance improvements.
Not paying attention to what native libraries (MKL, OpenBLAS, CuDNN etc) are being used
Providing no warm-up period before benchmarking begins
Running only a single (or too few) iterations, or not reporting mean, standard deviation and number of iterations
Not configuring workspaces, garbage collection, etc
Running only one possible case - for example, benchmarking a single set of array dimensions/orders when benchmarking BLAS operations
Running unusually small inputs - for example, minibatch size 1 on a GPU (which might be slower - but isn't realistic!)
Not measuring exactly - and only - what you claim to be measuring (for example, not accounting for array allocation, initialization or garbage collection time)
Not making your benchmarks reproducible (does the benchmark conclusion generalize? are there problems with the benchmark? what can we do to fix it?)
Comparing results across different hardware, not accounting for differences (for example, testing on one machine with AVX2 support, and on another without)
Not asking the devs (via Discourse - we are happy to provide suggestions and investigate if performance isn't where it should be!

How to Run Deeplearning4j Benchmarks - A Guide

Total training time is always ETL plus computation. That is, both the data pipeline and the matrix manipulations determine how long a neural network takes to train on a dataset.

The JVM has knobs to tune, and if you know how to tune them, you can make it a very fast environment for deep learning. There are several things to keep in mind on the JVM. You need to:

Increase the heap space
Get garbage collection right
Make ETL asynchronous
Presave datasets (aka pickling)

Setting Heap Space

Xms sets the minimum heap space
Xmx sets the maximum heap space

You can set these in IDEs like IntelliJ and Eclipse, as well as via the CLI like so:

    java -Xms256m -Xmx1024m YourClassNameHere

    java -Xmx3g

    java -Xms3g -Xmx3g YourClassNameHere

IntelliJ will automatically specify the Java main class in question.

    echo "export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=512m"" > ~/.bash_profile

Garbage Collection

You may need to alter the garbage collection algorithm that Java is using. This can be done via the command line like so:

    java -XX:+UseG1GC

Better garbage collection increases throughput. For a more detailed exploration of the issue, please read this InfoQ article.

The larger the batch size you use, the more RAM you’re taking in memory.

ETL & Asynchronous ETL

one loads from disk
one loads asynchronously
one loads pre-saved from RAM

Here's how the DatasetIterator is uniformly invoked for MNIST:

        while(mnistTest.hasNext()){
                DataSet ds = mnistTest.next();
                INDArray output = model.output(ds.getFeatures(), false);
                eval.eval(ds.getLabels(), output);
        }

This is the relevant code, in particular the third line:

    MultiDataSetIterator iterator;
    if (prefetchSize > 0 && source.asyncSupported()) {
        iterator = new AsyncMultiDataSetIterator(source, prefetchSize);
    } else iterator = source;

There are actually two types of asynchronous dataset iterators. The AsyncDataSetIterator is what you would use most of the time. It's described in the Javadoc here.

For special cases such as recurrent neural nets applied to time series, or for computation graphs, you would use a AsyncMultiDataSetIterator, described in the Javadoc here.

Notice in the code above that prefetchSize is another parameter to set. Normal batch size might be 1000 examples, but if you set prefetchSize to 3, it would pre-fetch 3,000 instances.

ETL: Comparing Python frameworks With Deeplearning4j

We try to be more flexible. That means you can point DL4J at raw photos, and it will load the image, run the transforms and put it into an NDArray to generate a dataset on the fly.

One way is to pre-save the datasets, in a manner similar to the Python frameworks. (Pickles are pre-formatted data.) When you pre-save the dataset, you create a separate class.

Here’s how you pre-save datasets.

A Recordreaderdatasetiterator talks to Datavec and outputs datasets for DL4J.

Here’s how you load a pre-saved dataset.

How To Guides

Import in to your favorite IDE

Pre requisites

Intellij

Shaded modules

Eclipse

Contribute

Prerequisites

Ways to contribute

General guidelines

Eclipse Contributors

Why Is This Required?

Signing the Eclipse Contributor Agreement

Signing Your Commits

Signing a New Commit

Checking If A Commit Is Signed

If You Forget to Sign a Commit - Amending the Last Commit

If You Forget to Sign Multiple Commits

Developer Docs

Github Actions/Build Infra

Overview of a Github Actions Configuration

Input parameters:

Matrix builds

Expected timings

Build error causes

Environment variables:

Javacpp

DL4J and Javacpp overview

Compilation flow

-platform dependencies

Javacpp platform specific profiles

Running javacpp on termux + android/lineagos

Redist artifacts

Release

Compile libnd4j on different cpu architectures

Ensure the current javacpp dependencies such as python, mkldnn, cuda, .. are up to date

Run all integration tests on core platforms (windows, mac, linux) with both cpu and gpu

Update the examples to be compatible with the latest release

Ensure different classifiers work

Android

Run the deeplearning4j-examples as a litmus tests on all platforms (including embedded)

Double check any user related bugs to see if they should block a release

Hit release button

Ensure a tag exists

Testing

Parameters for testing

Test resources

Test profiles for enabling nd4j backends

Test categories

GPUs and multi threaded boxes

Build From Source

Key concepts

Building for x86_64

Building for ARM

Building for CUDA

Note for windows users

Benchmark

General Benchmarking Guidelines

ND4J Specific Benchmarking

DL4J Specific Benchmarking

Common Benchmark Mistakes

How to Run Deeplearning4j Benchmarks - A Guide

Setting Heap Space

Garbage Collection

ETL & Asynchronous ETL

ETL: Comparing Python frameworks With Deeplearning4j

MKL and Inference on CPUs

Beginners

How Do I Start Using Deep Learning?

Free Machine- and Deep-learning Courses Online

Math

Programming

Python

Java

Deeplearning4j

Other Resources

Release

Compile libnd4j on different cpu architectures

Ensure the current javacpp dependencies such as python, mkldnn, cuda, .. are up to date

Run all integration tests on core platforms (windows, mac, linux) with both cpu and gpu