1 of 7

Configuration

Backends

Hardware setup for Eclipse Deeplearning4j, including GPUs and CUDA.

ND4J works atop so-called backends, or linear-algebra libraries, such as Native nd4j-native and nd4j-cuda-10.2 (GPUs), which you can select by pasting the right dependency into your project’s POM.xml file.

ND4J backends for GPUs and CPUs

You can choose GPUs or native CPUs for your backend linear algebra operations by changing the dependencies in ND4J's POM.xml file. Your selection will affect both ND4J and DL4J being used in your application.

If you have CUDA v9.2+ installed and NVIDIA-compatible hardware, then your dependency declaration will look like:

<dependency>
 <groupId>org.nd4j</groupId>
 <artifactId>nd4j-cuda-11.2</artifactId>
 <version>1.0.0-M1.1</version>
</dependency>

As of now, the artifactId for the CUDA versions can be one of nd4j-cuda-11.0,nd4j-cuda-11.2. Generally, the last 2 cuda versions are supported for a given release.

You can also find the available CUDA versions via Maven Central search or in the Release Notes.

Otherwise you will need to use the native implementation of ND4J as a CPU backend:

<dependency>
 <groupId>org.nd4j</groupId>
 <artifactId>nd4j-native</artifactId>
 <version>1.0.0-M1.1</version>
</dependency>

Building for Multiple Operating Systems

If you are developing your project on multiple operating systems/system architectures, you can add -platform to the end of your artifactId which will download binaries for most major systems.

<dependency>
 ...
 <artifactId>nd4j-native-platform</artifactId>
 ...
</dependency>

Bundling multiple Backends

For enabling different backends at runtime, you set the priority with your environment via the environment variable

BACKEND_PRIORITY_CPU=SOME_NUM
BACKEND_PRIORITY_GPU=SOME_NUM

Relative to the priority, it will allow you to dynamically set the backend type.

CuDNN

See our page on CuDNN.

CUDA Installation

Check the NVIDIA guides for instructions on setting up CUDA on the NVIDIA website.

Troubleshooting

Nd4jBackend$NoAvailableBackendException

 org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: https://deeplearning4j.konduit.ai/nd4j/backend
    at org.nd4j.linalg.factory.Nd4jBackend.load(Nd4jBackend.java:221)
    at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5091)
    ... 2 more

There are multiple reasons why you might run into this error message.

You haven't configured an ND4J backend at all.
You have a jar file that doesn't contain a backend for your platform.
You have a jar file that doesn't contain service loader files.

You haven't configured any ND4J Backend

Read this page and add a ND4J Backend to your dependencies:

You have a jar file that doesn't contain a backend for your platform.

This happens when you use a non -platform type backend dependency definition. In this case, only the Backend for the system that the jar file was built on will be included.

To solve this issue, use nd4j-native-platform instead of nd4j-native, if you are running on CPU and nd4j-cuda-11.2-platform instead of nd4j-cuda-11.2 when using the GPU backend.

If the jar file only contains the GPU backend, but your system has no CUDA capable (CC >= 3.5) GPU or CUDA isn't installed on the system, the CPU Backend should be used instead.

You have a jar file that doesn't contain service loader files.

ND4J uses the Java ServiceLoader in order to detect which backends are available on the class path. Depending on your uberjar packaging configuration, those files might be stripped away or broken.

To double check that the required files are included, open your uberjar and make sure it contains /META-INF/services/org.nd4j.linalg.factory.Nd4jBackend. Then open the file, and make sure there are entries for all of your configured backends.

If your uberjar does not contain that file, or if not all of the configured backends are listed there, you will have to reconfigure your shade plugin. See ServicesResourceTransformer documentation for how to do that.

Performance Issues

How to Debug Performance Issues

This page is a how-to guide for debugging performance issues encountered when training neural networks with Deeplearning4j. Much of the information also applies to debugging performance issues encountered when using ND4J.

Deeplearning4j and ND4J provide excellent performance in most cases (utilizing optimized c++ code for all numerical operations as well as high performance libraries such as NVIDIA cuDNN and Intel MKL). However, sometimes bottlenecks or misconfiguration issues may limit performance to well below the maximum. This page is intended to be a guide to help users identify the cause of poor performance, and provide steps to fix these issues.

Performance issues may include:

Poor CPU/GPU utilization
Slower than expected training or operation execution

To start, here’s a summary of some possible causes of performance issues:

Wrong ND4J backend is used (for example, CPU backend when GPU backend is expected)
Not using cuDNN when using CUDA GPUs
ETL (data loading) bottlenecks
Garbage collection overheads
Small batch sizes
Multi-threaded use of MultiLayerNetwork/ComputationGraph for inference (not thread safe)
Double precision floating point data type used when single precision should be used
Not using workspaces for memory management (enabled by default)
Poorly configured network
Layer or operation is CPU-only
CPU: Lack of hardware support for modern AVX etc extensions
Other processes using CPU or GPU resources
CPU: Lack of configuration of OMP_NUM_THREADS when using many models/threads simultaneously

Step 1: Check if correct backend is used

ND4J (and by extension, Deeplearning4j) can perform computation on either the CPU or GPU. The device used for computation is determined by your project dependencies - you include nd4j-native-platform to use CPUs for computation or nd4j-cuda-x.x-platform to use GPUs for computation (where x.x is your CUDA version - such as 9.2, 10.0 etc).

It is straightforward to check which backend is used. ND4J will log the backend upon initialization.

For CPU execution, you will expect output that looks something like:

For CUDA execution, you would expect the output to look something like:

Pay attention to the Loaded [X] backend and Backend used: [X] messages to confirm that the correct backend is used. If the incorrect backend is being used, check your program dependencies to ensure tho correct backend has been included.

Step 2: Check for cuDNN

If you are using CPUs only (nd4j-native backend) then you can skip to step 3 as cuDNN only applies when using NVIDIA GPUs (nd4j-cuda-x.x-platform dependency).

cuDNN is NVIDIA’s library for accelerating neural network training on NVIDIA GPUs. Deeplearning4j can make use of cuDNN to accelerate a number of layers - including ConvolutionLayer, SubsamplingLayer, BatchNormalization, Dropout, LocalResponseNormalization and LSTM. When training on GPUs, cuDNN should always be used if possible as it is usually much faster than the built-in layer implementations.

How to determine if CuDNN is used or

Not all DL4J layer types are supported in cuDNN. DL4J layers with cuDNN support include ConvolutionLayer, SubsamplingLayer, BatchNormalization, Dropout, LocalResponseNormalization and LSTM.

To check if cuDNN is being used, the simplest approach is to look at the log output when running inference or training: If cuDNN is NOT available when you are using a layer that supports it, you will see a message such as:

If cuDNN is available and was loaded successfully, no message will be logged.

Alternatively, you can confirm that cuDNN is used by using the following code:

Note that you will need to do at least one forward pass or fit call to initialize the cuDNN layer helper.

If cuDNN is available and was loaded successfully, you will see the following printed:

whereas if cuDNN is not available or could not be loaded successfully (you will get a warning or error logged also):

Step 3: Check for ETL (Data Loading) Bottlenecks

Neural network training requires data to be in memory before training can proceed. If the data is not loaded fast enough, the network will have to wait until data is available. DL4J uses asynchronous prefetch of data to improve performance by default. Under normal circumstances, this asynchronous prefetching means the network should never be waiting around for data (except on the very first iteration) - the next minibatch is loaded in another thread while training is proceeding in the main thread.

However, when data loading takes longer than the iteration time, data can be a bottleneck. For example, if a network takes 100ms to perform fitting on a single minibatch, but data loading takes 200ms, then we have a bottleneck: the network will have to wait 100ms per iteration (200ms loading - 100ms loading in parallel with training) before continuing the next iteration. Conversely, if network fit operation was 100ms and data loading was 50ms, then no data loading bottleck will occur, as the 50ms loading time can be completed asynchronously within one iteration.

How to check for ETL / data loading bottlenecks

The way to identify ETL bottlenecks is simple: add PerformanceListener to your network, and train as normal. For example:

When training, you will see output such as:

The above output shows that there is no ETL bottleneck (i.e., ETL: 0 ms). However, if ETL time is greater than 0 consistently (after the first iteration), an ETL bottleneck is present.

How to identify the cause of an ETL bottleneck

There are a number of possible causes of ETL bottlenecks. These include (but are not limited to):

Slow hard drives
Network latency or throughput issues (when reading from remote or network storage)
Computationally intensive or inefficient ETL (especially for custom ETL pipelines)

Step 4: Check for Garbage Collection Overhead

Even though DL4J/ND4J array memory is off-heap, garbage collection can still cause performance issues.

In summary:

Garbage collection will sometimes (temporarily and briefly) pause/stop application execution (“stop the world”)
These GC pauses slow down program execution
The overall performance impact of GC pauses depends on both the frequency of GC pauses, and the duration of GC pauses
The frequency is controllable (in part) by ND4J, using Nd4j.getMemoryManager().setAutoGcWindow(10000); and Nd4j.getMemoryManager().togglePeriodicGc(false);
Not every GC event is caused by or controlled by the above ND4J configuration.

In our experience, garbage collection time depends strongly on the number of objects in the JVM heap memory. As a rough guide:

Less than 100,000 objects in heap memory: short GC events (usually not a performance problem)
100,000-500,000 objects: GC overhead becomes noticeable, often in the 50-250ms range per full GC event
500,000 or more objects: GC can be a bottleneck if performed frequently. Performance may still be good if GC events are infrequent (for example, every 10 seconds or less).
10 million or more objects: GC is a major bottleneck even if infrequently called, with each full GC takes multiple seconds

How to configure ND4J garbage collection settings

In simple terms, there are two settings of note:

How to determine GC impact using PerformanceListener

NOTE: this feature was added after 1.0.0-beta3 and will be available in future releases To determine the impact of garbage collection using PerformanceListener, you can use the following:

This will report GC activity:

The garbage collection activity is reported for all available garbage collectors - the GC: [PS Scavenge: 2 (1ms)], [PS MarkSweep: 2 (24ms)] means that garbage collection was performed 2 times since the last PerformanceListener reporting, and took 1ms and 24ms total respectively for the two GC algorithms, respectively.

Keep in mind: PerformanceListener reports GC events every N iterations (as configured by the user). Thus, if PerformanceListener is configured to report statistics every 10 iterations, the garbage collection stats would be for the period of time corresponding to the last 10 iterations.

How to determine GC impact using -verbose:gc

When these options are enabled, you will have information reported on each GC event, such as:

This information can be used to determine the frequency, cause (System.gc() calls, allocation failure, etc) and duration of GC events.

How to determine GC impact using a profiler

An alternative approach is to use a profiler to collect garbage collection information.

How to determine number (and type) of JVM heap objects using memory dumps

If you determine that garbage collection is a problem, and suspect that this is due to the number of objects in memory, you can perform a heap dump.

To perform a heap dump:

Step 1: Run your program
Step 2: While running, determine the process ID
- One approach is to use jps:
  - For basic details, run jps on the command line. If jps is not on the system PATH, it can be found (on Windows) at C:\Program Files\Java\jdk<VERSION>\bin\jps.exe
  - For more details on each process, run jps -lv instead
- Alternatively, you can use the top command on Linux or Task Manager (Windows) to find the PID (on Windows, the PID column may not be enabled by default)
Step 3: Create a heap dump using jmap -dump:format=b,file=file_name.hprof 123 where 123 is the process id (PID) to create the heap dump for

After a memory dump has been collected, it can be opened in tools such as YourKit profiler and VisualVM to determine the number, type and size of objects. With this information, you should be able to pinpoint the cause of the large number of objects and make changes to your code to reduce or eliminate the objects that are causing the garbage collection overhead.

Step 5: Check Minibatch Size

Another common cause of performance issues is a poorly chosen minibatch size. A minibatch is a number of examples used together for one step of inference and training. Minibatch sizes of 32 to 128 are commonly used, though smaller or larger are sometimes used.

In summary:

If minibatch size is too small (for example, training or inference with 1 example at a time), poor hardware utilization and lower overall throughput is expected
If minibatch size is too large
- Hardware utilization will usually be good
- Iteration times will slow down
- Memory utilization may be too high (leading to out-of-memory errors)

For inference, avoid using minibatch size of 1, as throughput will suffer. Unless there are strict latency requirements, you should use larger minibatch sizes as this will give you the best hardware utilization and hence throughput, and is especially important for GPUs.

For training, you should never use a minibatch size of 1 as overall performance and hardware utilization will be reduced. Network convergence may also suffer. Start with a minibatch size of 32-128, if memory will allow this to be used.

Step 6: Ensure you are not using a single MultiLayerNetwork/ComputationGraph for inference from multiple threads

MultiLayerNetwork and ComputationGraph are not considered thread-safe, and should not be used from multiple threads. That said, most operations such as fit, output, etc use synchronized blocks. These synchronized methods should avoid hard to understand exceptions (race conditions due to concurrent use), they will limit throughput to a single thread (though, note that native operation parallelism will still be parallelized as normal). In summary, using the one network from multiple threads should be avoided as it is not thread safe and can be a performance bottleneck.

Step 7: Check Data Types

As of 1.0.0-beta3 and earlier, ND4J has a global datatype setting that determines the datatype of all arrays. The default value is 32-bit floating point. The data type can be set using Nd4j.setDataType(DataBuffer.Type.FLOAT); for example.

Performance on CPUs can also be reduced for double precision due to the additional memory batchwidth requirements vs. float precision.

You can check the data type setting using:

Step 8: Check workspace configuration for memory management (enabled by default)

In summary, workspaces are enabled by default for all Deeplearning4j networks, and enabling them improves performance and reduces memory requirements. There are very few reasons to disable workspaces.

You can check that workspaces are enabled for your MultiLayerNetwork using:

or for a ComputationGraph using:

You want to see the output as ENABLED output for both training and inference. To change the workspace configuration, use the setter methods, for example: net.getLayerWiseConfigurations().setTrainingWorkspaceMode(WorkspaceMode.ENABLED);

Step 9: Check for a badly configured network or network with layer bottlenecks

Another possible cause (especially for newer users) is a poorly designed network. A network may be poorly designed if:

It has too many layers. A rough guideline:
- More than about 100 layers for a CNN may be too many
- More than about 10 layers for a RNN/LSTM network may be too many
- More than about 20 feed-forward layers may be too many for a MLP
The input/activations are too large
- For CNNs, inputs in the range of 224x224 (for image classification) to 600x600 (for object detection and segmentation) are used. Large image sizes (such as 500x500) are computationally demanding, and much larger than this should be considered too large in most cases.
The output number of classes is too large
- Classification with more than about 10,000 classes can become a performance bottleneck with standard softmax output layers
The layers are too large
- For CNNs, most layers have kernel sizes in the range 2x2 to 7x7, with channels equal to 32 to 1024 (with larger number of channels appearing later in the network). Much larger than this may cause a performance bottleneck.
- For MLPs, most layers have at most 2048 units/neurons (often much smaller). Much larger than this may be too large.
- For RNNs such as LSTMs, layers are typically in the range of 128 to 512, though the largest RNNs may use around 1024 units per layer.
The network has too many parameters
- This is usually a consequence of the other issues already mentioned - too many layers, too large input, too many output classes
- For comparison, less than 1 million parameters would be considered small, and more than about 100 million parameters would be considered very large.
- You can check the number of parameters using MultiLayerNetwork/ComputationGraph.numParams() or MultiLayerNetwork/ComputationGraph.summary()

Note that these are guidelines only, and some reasonable network may exceed the numbers specified here. Some networks can become very large, such as those commonly used for imagenet classification or object detection. However, in these cases, the network is usually carefully designed to provide a good tradeoff between accuracy and computation time.

If your network architecture is significantly outside of the guidelines specified here, you may want to reconsider the design to improve performance.

Step 10: Check for CPU-only ops (when using GPUs)

If you are using CPUs only (nd4j-native backend), you can skip this step, as it only applies when using the GPU (nd4j-cuda) backend.

As of 1.0.0-beta3, a handful of recently added operations do not yet have GPU implementations. Thus, when these layer are used in a network, they will execute on CPU only, irrespective of the nd4j-backend used. GPU support for these layers will be added in an upcoming release.

The layers without GPU support as of 1.0.0-beta3 include:

Convolution3D
Upsampling1D/2D/3D
Deconvolution2D
LocallyConnected1D/2D
SpaceToBatch
SpaceToDepth

Unfortunately, there is no workaround or fix for now, until these operations have GPU implementations completed.

Step 11: Check CPU support for hardware extensions (AVX etc)

If you are running on a GPU, this section does not apply.

When running on older CPUs or those that lack modern AVX extensions such as AVX2 and AVX512, performance will be reduced compared to running on CPUs with these features. Though there is not much you can do about the lack of such features, it is worth knowing about if you are comparing performance between different CPU models.

In summary, CPU models with AVX2 support will perform better than those without it; similarly, AVX512 is an improvement over AVX2.

Step 12: Check other processes using CPU or GPU resources

Another obvious cause of performance issues is other processes using CPU or GPU resources.

For CPU, it is straightforward to see if other processes are using resources using tools such as top (for Linux) or task managed (for Windows).

For NVIDIA CUDA GPUs, nvidia-smi can be used. nvidia-smi is usually installed with the NVIDIA display drivers, and (when run) shows the overall GPU and memory utilization, as well as the GPU utilization of programs running on the system.

On Linux, this is usually on the system path by default. On Windows, it may be found at C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi

Step 13: Check OMP_NUM_THREADS performing concurrent inference using CPU in multiple threads simultaneously

If you are using GPUs (nd4j-cuda backend), you can skip this section.

One issue to be aware of when running multiple DL4J networks (or ND4J operations generally) concurrently in multiple threads is the OpenMP number of threads setting. In summary, in ND4J we use OpenMP pallelism at the c++ level to increase operation performance. By default, ND4J will use a value equal to the number of physical CPU cores (not logical cores) as this will give optimal performance

This also applies if the CPU resources are shared with other computationally demanding processes.

Debugging Performance Issues with JVM Profiling

Profiling is a process whereby you can trace how long each method in your code takes to execute, to identify and debug performance bottlenecks.

A full guide to profiling is beyond the scope of this page, but the summary is that you can trace how long each method takes to execute (and where it is being called from) using a profiling tool. This information can then be used to identify bottlenecks (and their causes) in your program.

How to Perform Profiling

The YourKit profiling documentation is quite good. To perform profiling with YourKit:

Install and start YourKit Profiler
Collect a snapshot and analyze

Profiling on Spark

When debugging performance issues for Spark training or inference jobs, it can often be useful to perform profiling here also.

One approach that we have used internally is to combine manual profiling settings (-agentpath JVM argument) with spark-submit arguments for YourKit profiler.

To perform profiling in this manner, 5 steps are required:

Download YourKit profiler to a location on each worker (must be the same location on each worker) and (optionally) the driver
[Optional] Copy the profiling configuration onto each worker (must be the same location on each worker)
Create a local output directory for storing the profiling result files on each worker
Launch the Spark job with the appropriate configuration (see example below)
The snapshots will be saved when the Spark job completes (or is cancelled) to the specified directories.

For example, to perform tracing on both the driver and the workers,

The configuration (tracing_settings_path) is optional. A sample tracing settings file is provided below:

CPU

CPU and AVX support in ND4J/Deeplearning4j

What is AVX, and why does it matter?

AVX (Advanced Vector Extensions) is a set of CPU instructions for accelerating numerical computations. See Wikipedia for more details.

Note that AVX only applies to nd4j-native (CPU) backend for x86 devices, not GPUs and not ARM/PPC devices.

Why AVX matters: performance. You want to use the version of ND4J compiled with the highest level of AVX supported by your system.

AVX support for different CPUs - summary:

Most modern x86 CPUs: AVX2 is supported
Some high-end server CPUs: AVX512 may be supported
Old CPUs (pre 2012) and low power x86 (Atom, Celeron): No AVX support (usually)

Note that CPUs supporting later versions of AVX include all earlier versions also. This means it's possible run a generic x86 or AVX2 binary on a system supporting AVX512. However it is not possible to run binaries built for later versions (such as avx512) on a CPU that doesn't have support for those instructions.

In version 1.0.0-beta6 and later you may get a warning as follows, if AVX is not configured optimally:

*********************************** CPU Feature Check Warning ***********************************
Warning: Initializing ND4J with Generic x86 binary on a CPU with AVX/AVX2 support
Using ND4J with AVX/AVX2 will improve performance. See deeplearning4j.org/cpu for more details
Or set environment variable ND4J_IGNORE_AVX=true to suppress this warning
************************************************************************************************

This warning has been removed in more recent versions as it's more confusing to users and out of date.

Configure mkl usage

When using the nd4j-native backend on intel platforms, our openblas bindings give the ability to also use mkl instead. In order to use mkl, set the system property as follows eitehr on launch or before Nd4j is initialized with Nd4j.create():

 System.setProperty("org.bytedeco.openblas.load", "mkl");

Configuring AVX in ND4J/DL4J

As noted earlier, for best performance you should use the version of ND4J that matches your CPU's supported AVX level.

ND4J defaults configuration (when just including the nd4j-native or nd4j-native-platform dependencies without maven classifier configuration) is "generic x86" (no AVX) for nd4j/nd4j-platform dependencies.

To configure AVX2 and AVX512, you need to specify a classifier for the appropriate architecture.

The following binaries (nd4j-native classifiers) are provided for x86 architectures:

Generic x86 (no AVX): linux-x86_64, windows-x86_64, macosx-x86_64
AVX2: linux-x86_64-avx2, windows-x86_64-avx2, macosx-x86_64-avx2
AVX512: linux-x86_64-avx512

As of 1.0.0-M1, the following combinations are also possible with onednn:

Generic x86 (no AVX): linux-x86_64-onednn, windows-x86_64-onednn, macosx-x86_64-onednn
AVX2: linux-x86_64-onednn-avx2, windows-x86_64-onednn-avx2, macosx-x86_64-onednn-avx2
AVX512: linux-x86_64-onednn-avx512

Example: Configuring AVX2 on Windows (Maven pom.xml)

<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>${nd4j.version}</version>
</dependency>

<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>${nd4j.version}</version>
    <classifier>windows-x86_64-avx2</classifier>
</dependency>

Example: Configuring AVX512 on Linux (Maven pom.xml)

<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>${nd4j.version}</version>
</dependency>

<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>${nd4j.version}</version>
    <classifier>linux-x86_64-avx512</classifier>
</dependency>

Example: Configuring AVX512 on Linux with onednn(Maven pom.xml)

<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>${nd4j.version}</version>
</dependency>

<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>${nd4j.version}</version>
    <classifier>linux-x86_64-onednn-avx512</classifier>
</dependency>

Note that you need both nd4j-native dependencies - with and without the classifier.

In the examples above, it is assumed that a Maven property nd4j.version is set to an appropriate ND4J version such as 1.0.0-M1.1

Cudnn

Using the NVIDIA cuDNN library with DL4J.

Using Deeplearning4j with cuDNN

There are 2 ways of using cudnn with deeplearning4j. One is an older way described below that is built in to the various deeplearning4j layers at the java level.

The other is to use the new nd4j cuda bindings that link to cudnn at the c++ level. Both will be described below. The newer way first, followed by the old way.

Cudnn setup

The actual library for cuDNN is not bundled, so be sure to download and install the appropriate package for your platform from NVIDIA:

Note there are multiple combinations of cuDNN and CUDA supported. Deeplearning4j's cuda support is based on . The way to read the versioning is: cuda version - cudnn version - javacpp version. For example, if the cuda version is set to 11.2, you can expect us to support cudnn 8.1.

To install, simply extract the library to a directory found in the system path used by native libraries. The easiest way is to place it alongside other libraries from CUDA in the default directory (/usr/local/cuda/lib64/ on Linux, /usr/local/cuda/lib/ on Mac OS X, and C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\, C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin\, or C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\ on Windows).

Alternatively, in the case of the most recent supported cuda version, cuDNN comes bundled with the "redist" package of the . , we can add the following dependencies instead of installing CUDA and cuDNN:

The same versioning scheme for redist applies to the cuda bindings that leverage an installed cuda.

Using cuDNN via nd4j

Similar to our avx bindings, nd4j leverages our c++ library libnd4j for running mathematical operations. In order to use cudnn, all you need to do is change the cuda backend dependency from:

or for cuda 11.0:

For jetson nano cuda 10.2:

Note that we are only adding an additional dependency. The reason we use an additional classifier is to pull in an optional dependency on cudnn based routines. The default does not use cudnn, but instead built in standalone routines for various operations implemented in cudnn such as conv2d and lstm.

For users of the -platform dependencies such as nd4j-cuda-11.2-platform, this classifier is still required. The -platform dependencies try to set sane defaults for each platform, but give users the option to include whatever they want. If you need optimizations, please become familiar with this.

Using cudnn via deeplearning4j

Deeplearning4j supports CUDA but can be further accelerated with cuDNN. Most 2D CNN layers (such as ConvolutionLayer, SubsamplingLayer, etc), and also LSTM and BatchNormalization layers support CuDNN.

The only thing we need to do to have DL4J load cuDNN is to add a dependency on deeplearning4j-cuda-11.0, or deeplearning4j-cuda-11.2, for example:

Memory

Setting available Memory/RAM for a DL4J application

Memory Management for ND4J/DL4J: How does it work?

ND4J uses off-heap memory to store NDArrays, to provide better performance while working with NDArrays from native code such as BLAS and CUDA libraries.

"Off-heap" means that the memory is allocated outside of the JVM (Java Virtual Machine) and hence isn't managed by the JVM's garbage collection (GC). On the Java/JVM side, we only hold pointers to the off-heap memory, which can be passed to the underlying C++ code via JNI for use in ND4J operations.

To manage memory allocations, we use two approaches:

JVM Garbage Collector (GC) and WeakReference tracking
MemoryWorkspaces - see Workspaces guide for details

Despite the differences between these two approaches, the idea is the same: once an NDArray is no longer required on the Java side, the off-heap associated with it should be released so that it can be reused later. The difference between the GC and MemoryWorkspaces approaches is in when and how the memory is released.

For JVM/GC memory: whenever an INDArray is collected by the garbage collector, its off-heap memory will be deallocated, assuming it is not used elsewhere.
For MemoryWorkspaces: whenever an INDArray leaves the workspace scope - for example, when a layer finished forward pass/predictions - its memory may be reused without deallocation and reallocation. This results in better performance for cyclical workloads like neural network training and inference.

Configuring Memory Limits

With DL4J/ND4J, there are two types of memory limits to be aware of and configure: The on-heap JVM memory limit, and the off-heap memory limit, where NDArrays live. Both limits are controlled via Java command-line arguments:

-Xms - this defines how much memory JVM heap will use at application start.
-Xmx - this allows you to specify JVM heap memory limit (maximum, at any point). Only allocated up to this amount (at the discretion of the JVM) if required.
-Dorg.bytedeco.javacpp.maxbytes - this allows you to specify the off-heap memory limit. This can also be a percentage, in which case it would apply to maxMemory.
-Dorg.bytedeco.javacpp.maxphysicalbytes - this specifies the maximum bytes for the entire process - usually set to maxbytes plus Xmx plus a bit extra, in case other libraries require some off-heap memory also. Unlike setting maxbytes setting maxphysicalbytes is optional. This can also be a percentage (>100%), in which case it would apply to maxMemory.

Example: Configuring 1GB initial on-heap, 2GB max on-heap, 8GB off-heap, 10GB maximum for process:

-Xms1G -Xmx2G -Dorg.bytedeco.javacpp.maxbytes=8G -Dorg.bytedeco.javacpp.maxphysicalbytes=10G

Gotchas: A few things to watch out for

With GPU systems, the maxbytes and maxphysicalbytes settings currently also effectively defines the memory limit for the GPU, since the off-heap memory is mapped (via NDArrays) to the GPU - read more about this in the GPU-section below.
For many applications, you want less RAM to be used in JVM heap, and more RAM to be used in off-heap, since all NDArrays are stored there. If you allocate too much to the JVM heap, there will not be enough memory left for the off-heap memory.
If you get a "RuntimeException: Can't allocate [HOST] memory: xxx; threadId: yyy", you have run out of off-heap memory. You should most often use a WorkspaceConfiguration to handle your NDArrays allocation, in particular in e.g. training or evaluation/inference loops - if you do not, the NDArrays and their off-heap (and GPU) resources are reclaimed using the JVM GC, which might introduce severe latency and possible out of memory situations.
If you don't specify JVM heap limit, it will use 1/4 of your total system RAM as the limit, by default.
If you don't specify off-heap memory limit, the JVM heap limit (Xmx) will be used by default. i.e. -Xmx8G will mean that 8GB can be used by JVM heap, and an additional 8GB can be used by ND4j in off-heap.
In limited memory environments, it's usually a bad idea to use high -Xmx value together with -Xms option. That is because doing so won't leave enough off-heap memory. Consider a 16GB system in which you set -Xms14G: 14GB of 16GB would be allocated to the JVM, leaving only 2GB for the off-heap memory, the OS and all other programs.

Memory-mapped files

ND4J supports the use of a memory-mapped file instead of RAM when using the nd4j-native backend. On one hand, it's slower then RAM, but on other hand, it allows you to allocate memory chunks in a manner impossible otherwise.

Here's sample code:

WorkspaceConfiguration mmap = WorkspaceConfiguration.builder()
                .initialSize(1000000000)
                .policyLocation(LocationPolicy.MMAP)
                .build();

try (MemoryWorkspace ws = Nd4j.getWorkspaceManager().getAndActivateWorkspace(mmap, "M2")) {
    INDArray x = Nd4j.create(10000);
}

In this case, a 1GB temporary file will be created and mmap'ed, and NDArray x will be created in that space. Obviously, this option is mostly viable for cases when you need NDArrays that can't fit into your RAM.

GPUs

When using GPUs, oftentimes your CPU RAM will be greater than GPU RAM. When GPU RAM is less than CPU RAM, you need to monitor how much RAM is being used off-heap. You can check this based on the JavaCPP options specified above.

We allocate memory on the GPU equivalent to the amount of off-heap memory you specify. We don't use any more of your GPU than that. You are also allowed to specify heap space greater than your GPU (that's not encouraged, but it's possible). If you do so, your GPU will run out of RAM when trying to run jobs.

We also allocate off-heap memory on the CPU RAM as well. This is for efficient communicaton of CPU to GPU, and CPU accessing data from an NDArray without having to fetch data from the GPU each time you call for it.

If JavaCPP or your GPU throw an out-of-memory error (OOM), or even if your compute slows down due to GPU memory being limited, then you may want to either decrease batch size or increase the amount of off-heap memory that JavaCPP is allowed to allocate, if that's possible.

Try to run with an off-heap memory equal to your GPU's RAM. Also, always remember to set up a small JVM heap space using the Xmx option.

Note that if your GPU has < 2g of RAM, it's probably not usable for deep learning. You should consider using your CPU if this is the case. Typical deep-learning workloads should have 4GB of RAM at minimum. Even that is small. 8GB of RAM on a GPU is recommended for deep learning workloads.

It is possible to use HOST-only memory with a CUDA backend. That can be done using workspaces.

Example:

WorkspaceConfiguration basicConfig = WorkspaceConfiguration.builder()
    .policyAllocation(AllocationPolicy.STRICT)
    .policyLearning(LearningPolicy.FIRST_LOOP)
    .policyMirroring(MirroringPolicy.HOST_ONLY) // <--- this option does this trick
    .policySpill(SpillPolicy.EXTERNAL)
    .build();

It's not recommended to use HOST-only arrays directly, since they will dramatically reduce performance. But they might be useful as in-memory cache pairs with the INDArray.unsafeDuplication() method.

Workspaces

Workspaces are an efficient model for memory paging in DL4J.

What are workspaces?

ND4J offers an additional memory-management model: workspaces. That allows you to reuse memory for cyclic workloads without the JVM Garbage Collector for off-heap memory tracking. In other words, at the end of the workspace loop, all INDArrays' memory content is invalidated. Workspaces are integrated into DL4J for training and inference.

The basic idea is simple: You can do what you need within a workspace (or spaces), and if you want to get an INDArray out of it (i.e. to move result out of the workspace), you just call INDArray.detach() and you'll get an independent INDArray copy.

Neural Networks

For DL4J users, workspaces provide better performance out of the box, and are enabled by default from 1.0.0-alpha onwards. Thus for most users, no explicit worspaces configuration is required.

To benefit from worspaces, they need to be enabled. You can configure the workspace mode using:

.trainingWorkspaceMode(WorkspaceMode.SEPARATE) and/or .inferenceWorkspaceMode(WorkspaceMode.SINGLE) in your neural network configuration.

The difference between SEPARATE and SINGLE workspaces is a tradeoff between the performance & memory footprint:

SEPARATE is slightly slower, but uses less memory.
SINGLE is slightly faster, but uses more memory.

That said, it’s fine to use different modes for training & inference (i.e. use SEPARATE for training, and use SINGLE for inference, since inference only involves a feed-forward loop without backpropagation or updaters involved).

With workspaces enabled, all memory used during training will be reusable and tracked without the JVM GC interference. The only exclusion is the output() method that uses workspaces (if enabled) internally for the feed-forward loop. Subsequently, it detaches the resulting INDArray from the workspaces, thus providing you with independent INDArray which will be handled by the JVM GC.

Please note: After the 1.0.0-alpha release, workspaces in DL4J were refactored - SEPARATE/SINGLE modes have been deprecated, and users should use ENABLED instead.

Garbage Collector

If your training process uses workspaces, we recommend that you disable (or reduce the frequency of) periodic GC calls. That can be done like so:

// this will limit frequency of gc calls to 5000 milliseconds
Nd4j.getMemoryManager().setAutoGcWindow(5000)

// OR you could totally disable it
Nd4j.getMemoryManager().togglePeriodicGc(false);

Put that somewhere before your model.fit(...) call.

ParallelWrapper & ParallelInference

For ParallelWrapper, the workspace-mode configuration option was also added. As such, each of the trainer threads will use a separate workspace attached to the designated device.

ParallelWrapper wrapper = new ParallelWrapper.Builder(model)
      // DataSets prefetching options. Buffer size per worker.
      .prefetchBuffer(8)

      // set number of workers equal to number of GPUs.
      .workers(2)

      // rare averaging improves performance but might reduce model accuracy
      .averagingFrequency(5)

      // if set to TRUE, on every averaging model score will be reported
      .reportScoreAfterAveraging(false)

      // 3 options here: NONE, SINGLE, SEPARATE
      .workspaceMode(WorkspaceMode.SINGLE)

      .build();

Iterators

We provide asynchronous prefetch iterators, AsyncDataSetIterator and AsyncMultiDataSetIterator, which are usually used internally.

These iterators optionally use a special, cyclic workspace mode to obtain a smaller memory footprint. The size of the workspace, in this case, will be determined by the memory requirements of the first DataSet coming out of the underlying iterator, whereas the buffer size is defined by the user. The workspace will be adjusted if memory requirements change over time (e.g. if you’re using variable-length time series).

Caution: If you’re using a custom iterator or the RecordReader, please make sure you’re not initializing something huge within the first next() call. Do that in your constructor to avoid undesired workspace growth.

Caution: With AsyncDataSetIterator being used, DataSets are supposed to be used before calling the next() DataSet. You are not supposed to store them, in any way, without the detach() call. Otherwise, the memory used for INDArrays within DataSet will be overwritten within AsyncDataSetIterator eventually.

If for some reason you don’t want your iterator to be wrapped into an asynchronous prefetch (e.g. for debugging purposes), special wrappers are provided: AsyncShieldDataSetIterator and AsyncShieldMultiDataSetIterator. Basically, those are just thin wrappers that prevent prefetch.

Evaluation

Usually, evaluation assumes use of the model.output() method, which essentially returns an INDArray detached from the workspace. In the case of regular evaluations during training, it might be better to use the built-in methods for evaluation. For example:

Evaluation eval = new Evaluation(outputNum);
ROC roceval = new ROC(outputNum);
model.doEvaluation(iteratorTest, eval, roceval);

This piece of code will run a single cycle over iteratorTest, and it will update both (or less/more if required by your needs) IEvaluation implementations without any additional INDArray allocation.

Workspace Destruction

There are also some situations, say, where you're short on RAM, and might want do release all workspaces created out of your control; e.g. during evaluation or training.

That could be done like so: Nd4j.getWorkspaceManager().destroyAllWorkspacesForCurrentThread();

This method will destroy all workspaces that were created within the calling thread. If you've created workspaces in some external threads on your own, you can use the same method in that thread, after the workspaces are no longer needed.

Workspace Exceptions

If workspaces are used incorrectly (such as a bug in a custom layer or data pipeline, for example), you may see an error message such as:

org.nd4j.linalg.exception.ND4JIllegalStateException: Op [set] Y argument uses leaked workspace pointer from workspace [LOOP_EXTERNAL]
For more details, see the ND4J User Guide: nd4j.org/userguide#workspaces-panic

DL4J's LayerWorkspaceMgr

DL4J's Layer API includes the concept of a "layer workspace manager".

The idea with this class is that it allows us to easily and precisely control the location of a given array, given different possible configurations for the workspaces. For example, the activations out of a layer may be placed in one workspace during inference, and another during training; this is for performance reasons. However, with the LayerWorkspaceMgr design, implementers of layers don't need to worry about this.

What does this mean in practice? Usually it's quite simple...

When returning activations (activate(boolean training, LayerWorkspaceMgr workspaceMgr) method), make sure the returned array is defined in ArrayType.ACTIVATIONS (i.e., use LayerWorkspaceMgr.create(ArrayType.ACTIVATIONS, ...) or similar)
When returning activation gradients (backpropGradient(INDArray epsilon, LayerWorkspaceMgr workspaceMgr)), similarly return an array defined in ArrayType.ACTIVATION_GRAD

You can also leverage an array defined in any workspace to the appropriate workspace using, for example, LayerWorkspaceMgr.leverageTo(ArrayType.ACTIVATIONS, myArray)

Note that if you are not implementing a custom layer (and instead just want to perform forward pass for a layer outside of a MultiLayerNetwork/ComputationGraph) you can use LayerWorkspaceMgr.noWorkspaces().

Performance Issues

How to Debug Performance Issues

Performance issues may include:

Poor CPU/GPU utilization
Slower than expected training or operation execution

To start, here’s a summary of some possible causes of performance issues:

Wrong ND4J backend is used (for example, CPU backend when GPU backend is expected)
Not using cuDNN when using CUDA GPUs
ETL (data loading) bottlenecks
Garbage collection overheads
Small batch sizes
Multi-threaded use of MultiLayerNetwork/ComputationGraph for inference (not thread safe)
Double precision floating point data type used when single precision should be used
Not using workspaces for memory management (enabled by default)
Poorly configured network
Layer or operation is CPU-only
CPU: Lack of hardware support for modern AVX etc extensions
Other processes using CPU or GPU resources
CPU: Lack of configuration of OMP_NUM_THREADS when using many models/threads simultaneously

Finally, this page has a short section on

Step 1: Check if correct backend is used

It is straightforward to check which backend is used. ND4J will log the backend upon initialization.

For CPU execution, you will expect output that looks something like:

o.n.l.f.Nd4jBackend - Loaded [CpuBackend] backend
o.n.n.NativeOpsHolder - Number of threads used for NativeOps: 8
o.n.n.Nd4jBlas - Number of threads used for BLAS: 8
o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CPU]; OS: [Windows 10]
o.n.l.a.o.e.DefaultOpExecutioner - Cores: [16]; Memory: [7.1GB];
o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [MKL]

For CUDA execution, you would expect the output to look something like:

13:08:09,042 INFO  ~ Loaded [JCublasBackend] backend
13:08:13,061 INFO  ~ Number of threads used for NativeOps: 32
13:08:14,265 INFO  ~ Number of threads used for BLAS: 0
13:08:14,274 INFO  ~ Backend used: [CUDA]; OS: [Windows 10]
13:08:14,274 INFO  ~ Cores: [16]; Memory: [7.1GB];
13:08:14,274 INFO  ~ Blas vendor: [CUBLAS]
13:08:14,274 INFO  ~ Device Name: [TITAN X (Pascal)]; CC: [6.1]; Total/free memory: [12884901888]

Step 2: Check for cuDNN

If you are using CPUs only (nd4j-native backend) then you can skip to step 3 as cuDNN only applies when using NVIDIA GPUs (nd4j-cuda-x.x-platform dependency).

Instructions for configuring CuDNN can be found . In summary, include the deeplearning4j-cuda-x.x dependency (where x.x is your CUDA version - such as 9.2 or 10.0). The network configuration does not need to change to utilize cuDNN - cuDNN simply needs to be available along with the deeplearning4j-cuda module.

How to determine if CuDNN is used or

Not all DL4J layer types are supported in cuDNN. DL4J layers with cuDNN support include ConvolutionLayer, SubsamplingLayer, BatchNormalization, Dropout, LocalResponseNormalization and LSTM.

o.d.n.l.c.ConvolutionLayer - cuDNN not found: use cuDNN for better GPU performance by including the deeplearning4j-cuda module. For more information, please refer to: https://deeplearning4j.org/cudnn
java.lang.ClassNotFoundException: org.deeplearning4j.nn.layers.convolution.CudnnConvolutionHelper
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)

If cuDNN is available and was loaded successfully, no message will be logged.

Alternatively, you can confirm that cuDNN is used by using the following code:

MultiLayerNetwork net = ...
LayerHelper h = net.getLayer(0).getHelper();    //Index 0: assume layer 0 is a ConvolutionLayer in this example
System.out.println("Layer helper: " + (h == null ? null : h.getClass().getName()));

Note that you will need to do at least one forward pass or fit call to initialize the cuDNN layer helper.

If cuDNN is available and was loaded successfully, you will see the following printed:

Layer helper: org.deeplearning4j.nn.layers.convolution.CudnnConvolutionHelper

whereas if cuDNN is not available or could not be loaded successfully (you will get a warning or error logged also):

Layer helper: null

Step 3: Check for ETL (Data Loading) Bottlenecks

How to check for ETL / data loading bottlenecks

The way to identify ETL bottlenecks is simple: add PerformanceListener to your network, and train as normal. For example:

MultiLayerNetwork net = ...
net.setListeners(new PerformanceListener(1));       //Logs ETL and iteration speed on each iteration

When training, you will see output such as:

.d.o.l.PerformanceListener - ETL: 0 ms; iteration 16; iteration time: 65 ms; samples/sec: 492.308; batches/sec: 15.384;

The above output shows that there is no ETL bottleneck (i.e., ETL: 0 ms). However, if ETL time is greater than 0 consistently (after the first iteration), an ETL bottleneck is present.

How to identify the cause of an ETL bottleneck

There are a number of possible causes of ETL bottlenecks. These include (but are not limited to):

Slow hard drives
Network latency or throughput issues (when reading from remote or network storage)
Computationally intensive or inefficient ETL (especially for custom ETL pipelines)

One useful way to get more information is to perform profiling, as described in the later in this page. For custom ETL pipelines, adding logging for the various stages can help. Finally, another approach to use a process of elimination - for example, measuring the latency and throughput of reading raw files from disk or from remote storage vs. measuring the time to actually process the data from its raw format.

Step 4: Check for Garbage Collection Overhead

Java uses garbage collection for management of on-heap memory (see for example for an explanation). Note that DL4J and ND4J use off-heap memory for storage of all INDArrays (see the for details).

Even though DL4J/ND4J array memory is off-heap, garbage collection can still cause performance issues.

In summary:

Garbage collection will sometimes (temporarily and briefly) pause/stop application execution (“stop the world”)
These GC pauses slow down program execution
The overall performance impact of GC pauses depends on both the frequency of GC pauses, and the duration of GC pauses
The frequency is controllable (in part) by ND4J, using Nd4j.getMemoryManager().setAutoGcWindow(10000); and Nd4j.getMemoryManager().togglePeriodicGc(false);
Not every GC event is caused by or controlled by the above ND4J configuration.

In our experience, garbage collection time depends strongly on the number of objects in the JVM heap memory. As a rough guide:

Less than 100,000 objects in heap memory: short GC events (usually not a performance problem)
100,000-500,000 objects: GC overhead becomes noticeable, often in the 50-250ms range per full GC event
500,000 or more objects: GC can be a bottleneck if performed frequently. Performance may still be good if GC events are infrequent (for example, every 10 seconds or less).
10 million or more objects: GC is a major bottleneck even if infrequently called, with each full GC takes multiple seconds

How to configure ND4J garbage collection settings

In simple terms, there are two settings of note:

Nd4j.getMemoryManager().setAutoGcWindow(10000);             //Set to 10 seconds (10000ms) between System.gc() calls
Nd4j.getMemoryManager().togglePeriodicGc(false);            //Disable periodic GC calls

If you suspect garbage collection overhead is having an impact on performance, try changing these settings. The main downside to reducing the frequency or disabling periodic GC entirely is when you are not using , though workspaces are enabled by default for all neural networks in Deeplearning4j.

Side note: if you are using DL4J for training on Spark, setting these values on the master/driver will not impact the settings on the worker. Instead, see .

How to determine GC impact using PerformanceListener

NOTE: this feature was added after 1.0.0-beta3 and will be available in future releases To determine the impact of garbage collection using PerformanceListener, you can use the following:

int listenerFrequency = 1;
boolean reportScore = true;
boolean reportGC = true;
net.setListeners(new PerformanceListener(listenerFrequency, reportScore, reportGC));

This will report GC activity:

o.d.o.l.PerformanceListener - ETL: 0 ms; iteration 30; iteration time: 17 ms; samples/sec: 588.235; batches/sec: 58.824; score: 0.7229335801186025; GC: [PS Scavenge: 2 (1ms)], [PS MarkSweep: 2 (24ms)];

How to determine GC impact using -verbose:gc

Another useful tool is the -verbose:gc, -XX:+PrintGCDetails -XX:+PrintGCTimeStamps command line options. For more details, see and

These options can be passed to the JVM on launch (when using java -jar or java -cp) or can be added to IDE launch options (for example, in IntelliJ: these should be placed in the “VM Options” field in Run/Debug Configurations - see )

When these options are enabled, you will have information reported on each GC event, such as:

5.938: [GC (System.gc()) [PSYoungGen: 5578K->96K(153088K)] 9499K->4016K(502784K), 0.0006252 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
5.939: [Full GC (System.gc()) [PSYoungGen: 96K->0K(153088K)] [ParOldGen: 3920K->3911K(349696K)] 4016K->3911K(502784K), [Metaspace: 22598K->22598K(1069056K)], 0.0117132 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]

This information can be used to determine the frequency, cause (System.gc() calls, allocation failure, etc) and duration of GC events.

How to determine GC impact using a profiler

An alternative approach is to use a profiler to collect garbage collection information.

For example, can be used to determine both the frequency and duration of garbage collection - see for more details.

, such as VisualVM can also be used to monitor GC activity.

How to determine number (and type) of JVM heap objects using memory dumps

If you determine that garbage collection is a problem, and suspect that this is due to the number of objects in memory, you can perform a heap dump.

To perform a heap dump:

Step 1: Run your program
Step 2: While running, determine the process ID
- One approach is to use jps:
  - For basic details, run jps on the command line. If jps is not on the system PATH, it can be found (on Windows) at C:\Program Files\Java\jdk<VERSION>\bin\jps.exe
  - For more details on each process, run jps -lv instead
- Alternatively, you can use the top command on Linux or Task Manager (Windows) to find the PID (on Windows, the PID column may not be enabled by default)
Step 3: Create a heap dump using jmap -dump:format=b,file=file_name.hprof 123 where 123 is the process id (PID) to create the heap dump for

A number of alternatives for generating heap dumps can be found .

Step 5: Check Minibatch Size

In summary:

If minibatch size is too small (for example, training or inference with 1 example at a time), poor hardware utilization and lower overall throughput is expected
If minibatch size is too large
- Hardware utilization will usually be good
- Iteration times will slow down
- Memory utilization may be too high (leading to out-of-memory errors)

For serving predictions in multi-threaded applications (such as a web server), should be used.

Step 6: Ensure you are not using a single MultiLayerNetwork/ComputationGraph for inference from multiple threads

For inference from multiple threads, you should use one model per thread (as this avoids locks) or for serving predictions in multi-threaded applications (such as a web server), use .

Step 7: Check Data Types

For best performance, this value should be left as its default. If 64-bit floating point precision (double precision) is used instead, performance can be significantly reduced, especially on GPUs - most consumer NVIDIA GPUs have very poor double precision performance (and half precision/FP16). On Tesla series cards, double precision performance is usually much better than for consumer (GeForce) cards, though is still usually half or less of the single precision performance. Wikipedia has a summary of the single and double precision performance of NVIDIA GPUs .

Performance on CPUs can also be reduced for double precision due to the additional memory batchwidth requirements vs. float precision.

You can check the data type setting using:

System.out.println("ND4J Data Type Setting: " + Nd4j.dataType());

Step 8: Check workspace configuration for memory management (enabled by default)

For details on workspaces, see the .

You can check that workspaces are enabled for your MultiLayerNetwork using:

System.out.println("Training workspace config: " + net.getLayerWiseConfigurations().getTrainingWorkspaceMode());
System.out.println("Inference workspace config: " + net.getLayerWiseConfigurations().getInferenceWorkspaceMode());

or for a ComputationGraph using:

System.out.println("Training workspace config: " + cg.getConfiguration().getTrainingWorkspaceMode());
System.out.println("Inference workspace config: " + cg.getConfiguration().getInferenceWorkspaceMode());

Step 9: Check for a badly configured network or network with layer bottlenecks

Another possible cause (especially for newer users) is a poorly designed network. A network may be poorly designed if:

It has too many layers. A rough guideline:
- More than about 100 layers for a CNN may be too many
- More than about 10 layers for a RNN/LSTM network may be too many
- More than about 20 feed-forward layers may be too many for a MLP
The input/activations are too large
- For CNNs, inputs in the range of 224x224 (for image classification) to 600x600 (for object detection and segmentation) are used. Large image sizes (such as 500x500) are computationally demanding, and much larger than this should be considered too large in most cases.
- For RNNs, the sequence length matters. If you are using sequences longer than a few hundred steps, you should use if possible.
The output number of classes is too large
- Classification with more than about 10,000 classes can become a performance bottleneck with standard softmax output layers
The layers are too large
- For CNNs, most layers have kernel sizes in the range 2x2 to 7x7, with channels equal to 32 to 1024 (with larger number of channels appearing later in the network). Much larger than this may cause a performance bottleneck.
- For MLPs, most layers have at most 2048 units/neurons (often much smaller). Much larger than this may be too large.
- For RNNs such as LSTMs, layers are typically in the range of 128 to 512, though the largest RNNs may use around 1024 units per layer.
The network has too many parameters
- This is usually a consequence of the other issues already mentioned - too many layers, too large input, too many output classes
- For comparison, less than 1 million parameters would be considered small, and more than about 100 million parameters would be considered very large.
- You can check the number of parameters using MultiLayerNetwork/ComputationGraph.numParams() or MultiLayerNetwork/ComputationGraph.summary()

If your network architecture is significantly outside of the guidelines specified here, you may want to reconsider the design to improve performance.

Step 10: Check for CPU-only ops (when using GPUs)

If you are using CPUs only (nd4j-native backend), you can skip this step, as it only applies when using the GPU (nd4j-cuda) backend.

The layers without GPU support as of 1.0.0-beta3 include:

Convolution3D
Upsampling1D/2D/3D
Deconvolution2D
LocallyConnected1D/2D
SpaceToBatch
SpaceToDepth

Unfortunately, there is no workaround or fix for now, until these operations have GPU implementations completed.

Step 11: Check CPU support for hardware extensions (AVX etc)

If you are running on a GPU, this section does not apply.

In summary, CPU models with AVX2 support will perform better than those without it; similarly, AVX512 is an improvement over AVX2.

For more details on AVX, see the

Step 12: Check other processes using CPU or GPU resources

Another obvious cause of performance issues is other processes using CPU or GPU resources.

For CPU, it is straightforward to see if other processes are using resources using tools such as top (for Linux) or task managed (for Windows).

On Linux, this is usually on the system path by default. On Windows, it may be found at C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi

Step 13: Check OMP_NUM_THREADS performing concurrent inference using CPU in multiple threads simultaneously

If you are using GPUs (nd4j-cuda backend), you can skip this section.

This also applies if the CPU resources are shared with other computationally demanding processes.

In either case, you may see better overall throughput by reducing the number of OpenMP threads by setting the OMP_NUM_THREADS environment variable - see for details.

One reason for reducing OMP_NUM_THREADS improving overall performance is due to reduced .

Debugging Performance Issues with JVM Profiling

Profiling is a process whereby you can trace how long each method in your code takes to execute, to identify and debug performance bottlenecks.

How to Perform Profiling

Multiple options are available for performing profiling locally. We suggest using either or for profiling.

The YourKit profiling documentation is quite good. To perform profiling with YourKit:

Install and start YourKit Profiler
Start your application with the profiler enabled. For details, see and
- Note that IDE integrations are available - see
Collect a snapshot and analyze

Note that YourKit provides multiple different types of profiling: Sampling, tracing, and call counting. Each type of profiling has different pros and cons, such as accuracy vs. overhead. For more details, see

VisualVM also supports profiling - see the Profiling Applications section of the for more details.

Profiling on Spark

When debugging performance issues for Spark training or inference jobs, it can often be useful to perform profiling here also.

One approach that we have used internally is to combine manual profiling settings (-agentpath JVM argument) with spark-submit arguments for YourKit profiler.

To perform profiling in this manner, 5 steps are required:

Download YourKit profiler to a location on each worker (must be the same location on each worker) and (optionally) the driver
[Optional] Copy the profiling configuration onto each worker (must be the same location on each worker)
Create a local output directory for storing the profiling result files on each worker
Launch the Spark job with the appropriate configuration (see example below)
The snapshots will be saved when the Spark job completes (or is cancelled) to the specified directories.

For example, to perform tracing on both the driver and the workers,

spark-submit
    --conf 'spark.executor.extraJavaOptions=-agentpath:/home/user/YourKit-JavaProfiler-2018.04/bin/linux-x86-64/libyjpagent.so=tracing,port=10001,dir=/home/user/yourkit_snapshots/executor/,tracing_settings_path=/home/user/yourkitconf.txt'
    --conf 'spark.driver.extraJavaOptions=-agentpath:/home/user/YourKit-JavaProfiler-2018.04/bin/linux-x86-64/libyjpagent.so=tracing,port=10001,dir=/home/user/yourkit_snapshots/driver/,tracing_settings_path=/home/user/yourkitconf.txt'
    <other spark submit arguments>

The configuration (tracing_settings_path) is optional. A sample tracing settings file is provided below:

walltime=*
adaptive=true
adaptive_min_method_invocation_count=1000
adaptive_max_average_method_time_ns=100000