1 of 6

How To Guides

Building on Windows

All of these instructions assume you are on a 64-bit system

libnd4j depends on some Unix utilities for compilation. So in order to compile it you will need to install Msys2.

After you have setup Msys2 by following their instructions, you will have to install some additional development packages. Start the msys2 shell and setup the dev environment with:

pacman -S mingw-w64-x86_64-gcc mingw-w64-x86_64-cmake mingw-w64-x86_64-extra-cmake-modules make pkg-config grep sed gzip tar mingw64/mingw-w64-x86_64-openblas mingw-w64-x86_64-lz4 mingw-w64-x86_64-gdb mingw-w64-x86_64-make mingw-w64-x86_64-ninja

This will install the needed dependencies for use in the msys2 shell. You will have to use the msys2 shell (especially c:\msys64\mingw64.exe) for the whole compilation process.

You will also need to setup your PATH environment variable to include C:\msys64\mingw64\bin (or where ever you have decided to install msys2). If you have IntelliJ (or another IDE) open, you will have to restart it before this change takes effect for applications started through them. If you don't, you probably will see a "Can't find dependent libraries" error.

For cpu, we recommend openblas. We will be adding instructions for mkl and other cpu implementations later.

Send us a pull request or file an issue if you have something in particular you are looking for.

Building libnd4j

libnd4j and nd4j go hand in hand, and libnd4j is required for two out of the three currently supported backends (nd4j-native and nd4j-cuda). For this reason they should always be rebuild together.

Additional build arguments

There's few additional arguments for buildnativeoperations.sh script you could use:

 -a // shortcut for -march/-mtune, i.e. -a native
 -b release OR -b debug // enables/desables debug builds. release is considered by default
 -j XX // this argument defines how many threads will be used to binaries on your box. i.e. -j 8
 -cc // CUDA-only argument, builds only binaries for target GPU architecture. use this for fast builds
 -h cudnn // (EXPERIMENTAL: enable cuDNN support)

Building the CPU Backend

Now clone this repository, and in that directory run the following to build the dll for the cpu backend:

./buildnativeoperations.sh

Building the CUDA Backend

The CUDA Backend has some additional requirements before it can be built:

CUDA SDK
Visual Studio 2015 or 2017 or 2019 (Please note: Visual Studio 2017 is NOT SUPPORTED by CUDA 8.0 and below, Visual Studio 2019 is supported since CUDA 10.2)

In order to build the CUDA backend you will have to setup some more environment variables first, by calling vcvars64.bat. But first, set the system environment variable SET_FULL_PATH to true, so all of the variables that vcvars64.bat sets up, are passed to the mingw shell. Additionally, you need to open the mingw64.ini in your msys64 installation folder and add the command: MSYS2_PATH_TYPE=inherit. Replace YOUR VERSION with the target version. 14.0 is known to work.

Inside a normal cmd.exe command prompt, run C:\Program Files (x86)\Microsoft Visual Studio *YOUR VERSION*\VC\bin\amd64\vcvars64.bat
Run c:\msys64\mingw64.exe inside that
Change to your libnd4j folder
./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

This builds the CUDA nd4j.dll.

Building nd4j

While still in the libnd4j folder, run:

export LIBND4J_HOME=`pwd`

or

If you want to use Control Panel for that: if you have libnd4j path looking like 'c:\Users\username\libnd4j' set LIBND4J_HOME to '/Users/username/libnd4j'

Now leave the libnd4j directory and clone the repository. Run the following to compile nd4j with support for both the native cpu backend as well as the cuda backend:

mvn clean install -DskipTests -Dmaven.javadoc.skip=true

If you don't want the cuda backend, e.g. because you didn't or can't build it, you can skip it:

mvn clean install -DskipTests -Dmaven.javadoc.skip=true -pl '!org.nd4j:nd4j-cuda-9.0,!org.nd4j:nd4j-cuda-9.0-platform,!org.nd4j:nd4j-tests'

Please notice the single quotes around the last parameter, if you leave them out or use double quotes you will get an error about event not found from your shell. If this doesn't work, make sure you have a current version of maven installed.

Also, if you're going to build DeepLearning4j without CUDA available, you'll have to deeplearning4j-cuda-9.0 (or 8.0) artifact as well:

mvn clean install -DskipTests=true -Dmaven.javadoc.skip=true -pl '!org.deeplearning4j:deeplearning4j-cuda-9.0'

Using the Native Backend

In order to use your new shiny backends you will have to switch your application to use the version of ND4J that you just compiled and to use the native backend.

For this you change the version of all your ND4J dependencies to version you've built, i.e: "0.9.2-SNAPSHOT".

CPU Backend

Use nd4j-native backend like that:

org.nd4jnd4j-native0.9.2-SNAPSHOT

CUDA Backend

Exchange nd4j-native for nd4j-cuda-9.0 (or nd4j-cuda-8.0) like that:

org.nd4jnd4j-cuda-9.00.9.2-SNAPSHOT

Troubleshooting

When I start my application, I still see a "Can't find dependent libraries" error

If your application continues to run, then you are just seeing an artefact of one way we try to load the native library, but your application should run just fine.

If your application crashes (and you see that error more than once) then you probably have a problem with your PATH environment variable. Please make sure that you have your msys2 bin directory on the PATH and that you restarted your IDE. Maybe even try to restart the system.

I'm having trouble downloading or updating packages using pacman

There are a number of things that can potentially go wrong. First, try updating packman using the following commands:

pacman -Syy
pacman -Syu
pacman -S pacman-mirrors

Note that you might need to restart the msys2 shell between/after these steps.

One user has reported issues downloading packages using the default downloader (timeouts and "error: failed retrieving file" messages). If you are experiencing these issues, it may help to switch to using the wget downloader. To do this, install wget using

pacman -S wget

then uncomment (remove the # symbol) the following line in the /etc/pacman.conf configuration file:

XferCommand = /usr/bin/wget --passive-ftp -c -O %o %u

"buildnativeoperations.sh blas cpu" can't find BLAS libraries

First, make sure you have BLAS libraries intalled. Typically, this involves building OpenBLAS by downloading OpenBLAS and running the commands 'make', 'make install' in msys2.

Running the buildnativeoperations.sh script in the MinGW-w64 Win64 Shell instead of the standard msys2 shell may resolve this issue.

I'm getting other errors not listed here

Depending on how your build environment and PATH environment variable is set up, you might experience some other issues. Some situations that may be problematic include:

Having older (or multiple) MinGW installs on your PATH (check: type "where c++" or "where gcc" into msys2)
Having older (or multiple) cmake installs on your PATH (check: "where cmake" and "cmake --version")
Having multiple BLAS libraries on your PATH (check: "where libopenblas.dll", "where libblas.dll" and "where liblapack.dll")

I'm getting `jniNativeOps.dll: Can't find dependent libraries` errors

This is usually due to an incorrectly setup PATH (see "I'm getting other errors not listed here"). As the PATH using the msys2 shell is a little bit different then for other applications, you can check that the PATH is really the problem by running the following test program:

public class App {
    public static void main(String[] args){
        System.loadLibrary("libopenblas.dll");
    }
}

If this also crashes with the Can't find dependent libraries error, then you have to setup your PATH correctly (see the introduction to this document).

Note: Another possible cause of "...jniNativeOps.dll: Can't find dependent libraries" seems to be having an old or incompatible version of libstc++-6.dll on your PATH. You want this file to be pulled in from mingw via you PATH environment variable. To check your PATH/environment, run where libstdc++-6.dll and where libgcc_s_seh-1.dll; these should list the msys/mingw directories (and/or list them first, if there are other copies on the PATH).

Finally, using dumpbin (from Visual Studio) can help to show required dependencies for jniNativeOps.dll:

dumpbin /dependents [path to jniNativeOps.dll]

My application crashes on the first usage of ND4J with the CUDA Backend (Windows)

Exception in thread "main" java.lang.RuntimeException: Can't allocate [HOST] memory: 32

If the Exception you are getting looks anything like this, and you see this upon startup:

o.n.j.c.CudaEnvironment - Device [0]: Free: 0 Total memory: 0

Then you are most probably trying to use a mobile GPU (like 970m) and Optimus is trying to ruin the day. First you should try to force the usage of the GPU through normal means, like setting the JVM to run on your GPU via the Nvidia System Panel or by disabling the iGPU in your BIOS. If this still isn't enough, you can try the following workaround, that while not recommended for production, should allow you to still use your GPU.

You will have to add JOGL to your dependencies:

    <dependency>
      <groupId>org.jogamp.gluegen</groupId>
      <artifactId>gluegen-rt-main</artifactId>
      <version>2.3.1</version>
    </dependency>
    <dependency>
      <groupId>org.jogamp.jogl</groupId>
      <artifactId>jogl-all-main</artifactId>
      <version>2.3.1</version>
    </dependency>

And as the very first thing in your main method you will need to add:

        GLProfile.initSingleton();

This should allow ND4J to work correctly (you still have to set that the JVM has to use the GPU in the Nvidia System Panel).

My Display Driver / System crashes when I use the CUDA Backend (Windows)

ND4J is meant to be used with pure compute cards (i.e. the Tesla series). On consumer GPUs that are mainly meant for gaming, this results in a usage that can conflict with the cards primary work: Displaying your Desktop.

Microsoft has added the Timeout Detection and Recovery (TDR) to detect malfunctioning drivers and improper usage, which now interferes with the compute tasks of ND4J, by killing them if they occupy the GPU for longer then a few seconds. This results in the "Display driver stopped responding and has recovered" message. This results in a perceived driver crash along with a crash of your application. If you try to run it again TDR may decide that something is messing with the display driver and force a reboot.

If you really want to use your display GPU for compute with ND4J (not recommended), you will have to disable TDR by setting TdrLevel=0 (see https://msdn.microsoft.com/en-us/library/windows/hardware/ff569918%28v=vs.85%29.aspx). If you do this you will have display freezes, which, depending on your workload, can stay quite a long time.

My JVM is crashing with the problematic frame being in `cygwin1.dll`

If you have any cygwin related dlls in the crash log, this means that you have build libnd4j or nd4j with cygwin being on the PATH before Msys2. This results in successful compilation, but crashes the JVM with some usecases.

In order to fix this problem, all you have to do is to remove cygwin from your PATH while building libnd4j and nd4j.

If you want to inspect your path you can do this by running:

    echo $PATH

If you want to set your PATH temporarily, you can do so with:

    export PATH=... # Replace ... with what ever you want to have there

CUDA build is failing with cmake/nmake errors

Some errors such as the following can appear if the visual studio vcvars64.bat file is run before attempting the cuda build.

  The parameter is incorrectRC Pass 1 failed to run.
  NMAKE : fatal error U1077: 'C:\msys64\mingw64\bin\cmake.exe' : return code '0xffffffff'
  NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\nmake.exe"' : return code '0x2'

To resolve this, ensure that you haven't run vcvars64/vcvarsall in the msys2 shell before building.

MSI Installer

To build an MSI Installer run: ./buildnativeoperations.sh -p msi

For gpu run: ./buildnativeoperations.sh -p msi -c cuda

BLAS Impls

Openblas: Ensure that you set up $MSYSROOT/opt/OpenBLAS/lib. If you built OpenBLAS in msys2 (make, make install), then you should not need to do anything else.

Note: our informal/unscientific testing suggests that Intel MKL can be about equal with, and up to about 40% faster than OpenBLAS on some matrix multiply (gemm) operations, on some machines. Installing MKL is recommended but not required.

MKL Setup

To build libnd4j with MKL:

Download MKL from https://software.intel.com/en-us/articles/free_mkl and install. Registration is required (free).
Add the \redist\intel64_win\mkl directory to your system PATH environment variable. This will be in a location such as C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2016.3.207\windows\redist\intel64_win\mkl\

Then build libnd4j as before. You may have to be careful about having multiple BLAS implementations on your path. Ideally, have only MKL on the path while building libnd4j.

Note: you may be able to get some additional performance on hyperthreaded processors by setting the system/environment variable MKL_DYNAMIC to have the value 'false'.

float16_nhcw float16_nhwc

Building for raspberry pi or Jetson Nano

bash pi_build.sh using this helper script one can cross build libnd4j and dl4j with arm COMPUTE LIBRARY . it will download cross compiler and arm compute library.

example: bash pi_build.sh --arch android-arm64 --mvn

to change version of the arm COMPUTE LIBRARY modify this line in the script

    ARMCOMPUTE_TAG=v20.05

old one

Please follow following instructions to build nd4j for raspberry PI:

download cross compilation tools for Raspberry PI

 $ apt-get/yum install git cmake
 (You may substitute any path you prefer instead of $HOME/raspberrypi in the following two steps)
 $ mkdir $HOME/raspberrypi
 $ export RPI_HOME=$HOME/raspberrypi
 $ cd $RPI_HOME
 $ git clone git://github.com/raspberrypi/tools.git
 $ export PATH=$PATH:$RPI_HOME/tools/arm-bcm2708/arm-rpi-4.9.3-linux-gnueabihf/bin

download deeplearning4j:

 $ cd $HOME
 $ git clone https://github.com/eclipse/deeplearning4j.git

build libnd4j:

 $ cd deeplearning4j/libnd4j
 $ ./buildnativeoperations.sh -o linux-armhf

build nd4j

 $ export LIBND4J_HOME=<pathTond4JNI>
 $ cd $HOME/deeplearning4j/nd4j
 $ mvn clean install -Djavacpp.platform=linux-armhf -Djavacpp.platform.compiler=$HOME/raspberrypi/tools/arm-bcm2708/arm-rpi-4.9.3-linux-gnueabihf/bin/arm-linux-gnueabihf-g++ -DskipTests  -Dmaven.javadoc.skip=true  -pl '!:nd4j-cuda-9.1,!:nd4j-cuda-9.1-platform,!:nd4j-tests'

Building on ios

Used LLVM 4.0 to build+ ios-arm, ios-x86, and ios-x86_64.

When building on ios, a static library is assembled. For more on how to build for ios, please see gluon's guide.

How to Add Operations

There's multiple different Ops designs supported in libND4j, and in this guide we'll try to explain how to build your very own operation.

XYZ operations

This kind of operations is actually split into multiple subtypes, based on element-access and result type:

Transform operations: These operations typically take some NDArray in, and change each element independent of others.
Reduction operations: These operations take some NDArray and dimensions, and return reduced NDArray (or scalar) back. I.e. sum along dimension(s).
Scalar operations: These operations are similar to transforms, but they only do arithmetic operations, and second operand is scalar. I.e. each element in given NDArray will add given scalar value.
Pairwise operations: These operations are between regular transform opeartions and scalar operations. I.e. element-wise addition of two NDArrays.
Random operations: Most of these operations related to random numbers distributions: Uniform, Gauss, Bernoulli etc.

Despite differences between these operations, they are all using XZ/XYZ three-operand design, where X and Y are inputs, and Z is output. Data access in these operations is usually trivial, and loop based. I.e. most trivial loop for scalar transform will look like this:

for (Nd4jLong i = start; i < end; i++) {
    result[i] = OpType::op(x[i], scalar, extraParams);
}

Operation used in this loop will be template-driven, and compiled statically. There are another loops implementation, depending on op group or strides within NDArrays, but idea will be the same all the time: each element of the NDArray will be accessed within loop.

Now, let's take a look into typical XYZ op implementation. Here's how Add operation will look like:

template<typename T>
class Add {
public:
    op_def static T op(T d1, T d2) {
        return d1 + d2;
    }

    // this signature will be used in Scalar loops
    op_def static T op(T d1, T d2, T *params) {
        return d1 + d2;
    }

    // this signature will be used in reductions
    op_def static T op(T d1) {
        return d1;
    }

    // op for MetaOps
    op_def static T op(T d1, T *params) {
        return d1 + params[0];
    }
};

This particular operation is used in different XYZ op groups, but you see the idea: element-wise operation, which is invoked on each element in given NDArray. So, if you want to add new XYZ operation to libnd4j, you should just add operation implementation to file includes/ops/ops.h, and assign it to specific ops group in file includes/loops/legacy_ops.h together with some number unique to this ops group, i.e.: (21, simdOps::Add)

After libnd4j is recompiled, this op will become available for legacy execution mechanism, NDArray wrappers, and LegacyOp wrappers (those are made to map legacy operations to CustomOps design for Graph).

Custom operations

Custom operations is a new concept, added recently and mostly suits SameDiff/Graph needs. For CustomOps we defined universal signature, with variable number of input/output NDArrays, and variable number of floating-point and integer arguments.

However, there are some minor difference between various CustomOp declarations:

DECLARE_OP(string, int, int, bool): these operations take no fp/int arguments, and output shape equals to input shape.
DECLARE_CONFIGURABLE_OP(string, int, int, bool, int, int): these operations do take fp/int output arguments, and output shape equals to input shape.
DECLARE_REDUCTION_OP(string, int, int, bool, int, int): these operations do take fp/int output arguments, and output shape is calculated as Reduction.
DECLARE_CUSTOM_OP(string, int, int, bool, int, int): these operations return NDArray with custom shape, that usually depends on input and arguments.
DECLARE_BOOLEAN_OP(string, int, bool): these operations take some NDArrays and return scalar, where 0 is False, and other values are treated as True.

Let's take a look at example CustomOp:

CUSTOM_OP_IMPL(tear, 1, -1, false, 0, -1) {
    auto input = INPUT_VARIABLE(0);

    REQUIRE_TRUE(!block.getIArguments()->empty(), 0, "At least 1 dimension should be specified for Tear");

    std::vector<int> dims(*block.getIArguments());

    for (auto &v: dims)
        REQUIRE_TRUE(v >= 0 && v < input->rankOf(), 0, "Tear dimensions should be non-negative values, and lower then input rank. Got %i instead", v);

    auto tads = input->allTensorsAlongDimension(dims);
    for (int e = 0; e < tads->size(); e++) {
        auto outE = OUTPUT_VARIABLE(e);
        outE->assign(tads->at(e));

        this->storeResult(block, e, *outE);
    }

    delete tads;

    return ND4J_STATUS_OK;
}

DECLARE_SHAPE_FN(tear) {
    auto inShape = inputShape->at(0);

    std::vector<int> dims(*block.getIArguments());

    if (dims.size() > 1)
        std::sort(dims.begin(), dims.end());

    shape::TAD tad(inShape, dims.data(), (int) dims.size());
    tad.createTadOnlyShapeInfo();
    Nd4jLong numTads = shape::tadLength(inShape, dims.data(), (int) dims.size());

    auto result = SHAPELIST();
    for (int e = 0; e < numTads; e++) {
        int *newShape;
        COPY_SHAPE(tad.tadOnlyShapeInfo, newShape);
        result->push_back(newShape);
    }

    return result;
}

In the example above, we declare tear CustomOp implementation, and shape function for this op. So, at the moment of op execution, we assume that we will either have output array(s) provided by end-user, or they will be generated with shape function.

You can also see number of macros used, we'll cover those later as well. Beyond that - op execution logic is fairly simple & linear: Each new op implements protected member function DeclarableOp<T>::validateAndExecute(Block<T>& block), and this method is eventually called either from GraphExecutioner, or via direct call, like DeclarableOp<T>::execute(Block<T>& block).

Important part of op declaration is input/output description for the op. I.e. as shown above: CUSTOM_OP_IMPL(tear, 1, -1, false, 0, -1). This declaration means:

Op name: tear
Op expects at least 1 NDArray as input
Op returns unknown positive number of NDArrays as output
Op can't be run in-place, so under any circumstances original NDArray will stay intact
Op doesn't expect any T (aka floating point) arguments
Op expects unknown positive number of integer arguments. In case of this op it's dimensions to split input NDArray.

Here's another example: DECLARE_CUSTOM_OP(permute, 1, 1, true, 0, -2); This declaration means:

Op name: permute
Op expects at least 1 NDArray as input
Op returns 1 NDArray as output
Op can be run in-place if needed (it means: input == output, and input is modified and returned as output)
Op doesn't expect any T arguments
Op expects unknown number of integer arguments OR no integer arguments at all.

Note on parameters: Negative values (-1,-2) mean very specific things. When op validation is invoked (checking the parameters) either the exact number of parameters in the descriptor must be present for each type or the following:

-1 means at least 1 of the expected parameter will be present
-2 means an unknown number of parameters. Use this in situations where inputs of certain types maybe optional. A common use case is when a parameter maybe passed in as an ndrray or as a TARG or IARG (floating point or integer arguments respectively)

c++11 syntactic sugar

In ops you can easily use c++11 features, including lambdas. In some cases it might be easiest way to build your custom op (or some part of it) via NDArray::applyLambda or NDArray::applyPairwiseLambda:

auto lambda = LAMBDA_TT(_x, _y) {
    return (_x + _y) * 2;
};

x.applyPairwiseLambda(&y, lambda);

In this simple example, each element of NDArray x will get values set to x[e] = (x[e] + y[e]) * 2.

Tests

For tests libnd4j uses Google Tests suit. All tests are located at tests_cpu/layers_tests folder. Here's simple way to run those from command line:

cd tests_cpu
cmake -G "Unix Makefiles"
make -j 4
./layers_tests/runtests

You can also use your IDE (i.e. Jetbrains CLion) to run tests via GUI.

PLEASE NOTE: if you're considering submitting your new op to libnd4j repository via pull request - consider adding tests for it. Ops without tests won't be approved.

Backend-specific operation

GPU/MPI/whatever to be added soon.

Utility macros

We have number of utility macros, suitable for custom ops. Here they are:

INPUT_VARIABLE(int): this macro returns you NDArray at specified input index.
OUTPUT_VARIABLE(int): this macro returns you NDArray at specified output index.
STORE_RESULT(NDArray): this macro stores result to VariableSpace.
STORE_2_RESULTS(NDArray, NDArray): this macro stores results accordingly to VariableSpace.
INT_ARG(int): this macro returns you specific Integer argument passed to the given op.
T_ARG(int): this macro returns you specific T argument passed to the given op.
ALLOCATE(...): this macro check if Workspace is available, and either uses Workspace or direct memory allocation if Workspace isn't available.
RELEASE(...): this macro is made to release memory allocated with ALLOCATE() macro.
REQUIRE_TRUE(...): this macro takes condition, and evaluates it. If evaluation doesn't end up as True - exception is raised, and specified message is printed out.
LAMBDA_T(X) and LAMBDA_TT(X, Y): lambda declaration for NDArray::applyLambda and NDArray::applyPairwiseLambda
COPY_SHAPE(SRC, TGT): this macro allocates memory for TGT pointer and copies shape from SRC pointer
ILAMBDA_T(X) and ILAMBDA_TT(X, Y): lambda declaration for indexed lambdas, index argument is passed in as Nd4jLong (aka long long)
FORCEINLINE: platform-specific definition for functions inlining

Explicit template instantiations in helper methods.

We should explicitly instantiate template methods for different data types in libraries. Furethemore, to speed up parallel compilation we need to add those template instantiations in separate source files. Besides, another reason is that: some compilers are choked when these template instantiations are many in one translation unit. To ease this cumbersome operation we have Cmake helper and macros helpers. Example: Suppose we have such function:

    template<typename X, typename Z>
    void  argMin_(const NDArray& input, NDArray& output, const std::vector<int>& dimensions);

We should write this to explicitly instantiate it.

BUILD_DOUBLE_TEMPLATE(template void argMin_, (const NDArray& input, NDArray& output, const std::vector<int>& dimensions),
               LIBND4J_TYPES, INDEXING_TYPES);

Here:

LIBND4J_TYPES means we want to use all types in the place of X
INDEXING_TYPES means we will use index types ( int, int64_t) as Z type

But to speed up compilation process and also helping compilers we can further separate it into different source files. Firstly we rename the original template source with hpp extension: Secondly we add file with the suffix cpp.in (or cu.in for cuda) that will include that hpp header and place it in the apropriate compilation units folder. in our case it will be in ./libnd4j/include/ops/declarable/helpers/cpu/compilation_units folder with the name argmax.cpp.in . Later we decide which type we want to separate into different sources. In our case we want to split LIBND4J_TYPES (other ones: INT_TYPE , FLOAT_TYPE, PAIRWISE_TYPE ). We hint cmake that case with this (adding _GEN suffix):

#cmakedefine LIBND4J_TYPE_GEN

Then we just add _@FL_TYPE_INDEX@ as suffix in type name and it will split those types for us and generate cpp files inside ${CMAKE_BINARY_DIR}/compilation_units folder.

LIBND4J_TYPE_@FL_TYPE_INDEX@

Here how the complete cpp.in file will look like:

#cmakedefine LIBND4J_TYPE_GEN 
//this header is where our template functions resides
#include <ops/declarable/helpers/cpu/indexReductions.hpp>
namespace sd {
    namespace ops {
        namespace helpers {
            BUILD_DOUBLE_TEMPLATE(template void argMax_, (const NDArray& input, NDArray& output, const std::vector<int>& dimensions), 
               LIBND4J_TYPES_@FL_TYPE_INDEX@, INDEXING_TYPES);
        }
    }
}

How to Setup CLion

Setting up clion for modifying the libnd4j code base

Overview

In order to setup clion, we need to configure the cmake defaults to ensure that tests can be built. Normally this is setup by the libnd4j build script. A general tutorial on how to configure cmake profiles for clion can be found here. When configuring cmake to run, there are generally a few steps to follow.

This will in general include setting up the cmake gtest integration (we use google test for our test suite)

Setup a toolchain. This will depend on your OS. More can be found here. This will cover the compiler, debugger and other needed additional software to enable clion to manage your cmake project.
After configuration, let cmake build/index the files. It takes time to setup the files, auto complete and other expected functionality provided by the IDE. Ensure this is done by keeping an eye on the bottom right of the IDE to ensure all tasks are complete.
After your clion environment is setup, you may put the following cmake configuration for CPU: -DSD_CPU=true -DSD_BUILD_TESTS=true -DSD_X86_BUILD=true -DSD_ALL_OPS=true -DSD_ARCH=x86-64 -DSD_X86_BUILD=true -DSD_SHARED_LIB=true
The above will configure your IDE to build a shared library for intel cpus as well as configure the gtest setup to run. Please ensure that you read about how to configure cmake profiles (as linked above)
Now you should be able to make modifications backed by the IDE. Note that you can also run tests under testscpu/layerstests
In order to run all tests, you may run the AllTests.cpp entry point. In order to run the tests (or even a specific test) just right click on any test and click the green arrow option that appears in the dropdown similar to "Run AllTests.." or something like that. For more information on this, please see here.
When running tests, note that it may take a while. It will build a whole executable compiling the relevant parts of the code base (or if necessary, the whole code base) in order to run the tests.

Building on Windows

All of these instructions assume you are on a 64-bit system

libnd4j depends on some Unix utilities for compilation. So in order to compile it you will need to install Msys2.

After you have setup Msys2 by following their instructions, you will have to install some additional development packages. Start the msys2 shell and setup the dev environment with:

pacman -S mingw-w64-x86_64-gcc mingw-w64-x86_64-cmake mingw-w64-x86_64-extra-cmake-modules make pkg-config grep sed gzip tar mingw64/mingw-w64-x86_64-openblas mingw-w64-x86_64-lz4 mingw-w64-x86_64-gdb mingw-w64-x86_64-make mingw-w64-x86_64-ninja

This will install the needed dependencies for use in the msys2 shell. You will have to use the msys2 shell (especially c:\msys64\mingw64.exe) for the whole compilation process.

For cpu, we recommend openblas. We will be adding instructions for mkl and other cpu implementations later.

Send us a pull request or file an issue if you have something in particular you are looking for.

Building libnd4j

libnd4j and nd4j go hand in hand, and libnd4j is required for two out of the three currently supported backends (nd4j-native and nd4j-cuda). For this reason they should always be rebuild together.

Additional build arguments

There's few additional arguments for buildnativeoperations.sh script you could use:

 -a // shortcut for -march/-mtune, i.e. -a native
 -b release OR -b debug // enables/desables debug builds. release is considered by default
 -j XX // this argument defines how many threads will be used to binaries on your box. i.e. -j 8
 -cc // CUDA-only argument, builds only binaries for target GPU architecture. use this for fast builds
 -h cudnn // (EXPERIMENTAL: enable cuDNN support)

Building the CPU Backend

Now clone this repository, and in that directory run the following to build the dll for the cpu backend:

./buildnativeoperations.sh

Building the CUDA Backend

The CUDA Backend has some additional requirements before it can be built:

CUDA SDK
Visual Studio 2015 or 2017 or 2019 (Please note: Visual Studio 2017 is NOT SUPPORTED by CUDA 8.0 and below, Visual Studio 2019 is supported since CUDA 10.2)

Inside a normal cmd.exe command prompt, run C:\Program Files (x86)\Microsoft Visual Studio *YOUR VERSION*\VC\bin\amd64\vcvars64.bat
Run c:\msys64\mingw64.exe inside that
Change to your libnd4j folder
./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

This builds the CUDA nd4j.dll.

Building nd4j

While still in the libnd4j folder, run:

export LIBND4J_HOME=`pwd`

or

If you want to use Control Panel for that: if you have libnd4j path looking like 'c:\Users\username\libnd4j' set LIBND4J_HOME to '/Users/username/libnd4j'

Now leave the libnd4j directory and clone the repository. Run the following to compile nd4j with support for both the native cpu backend as well as the cuda backend:

mvn clean install -DskipTests -Dmaven.javadoc.skip=true

If you don't want the cuda backend, e.g. because you didn't or can't build it, you can skip it:

mvn clean install -DskipTests -Dmaven.javadoc.skip=true -pl '!org.nd4j:nd4j-cuda-9.0,!org.nd4j:nd4j-cuda-9.0-platform,!org.nd4j:nd4j-tests'

Also, if you're going to build DeepLearning4j without CUDA available, you'll have to deeplearning4j-cuda-9.0 (or 8.0) artifact as well:

mvn clean install -DskipTests=true -Dmaven.javadoc.skip=true -pl '!org.deeplearning4j:deeplearning4j-cuda-9.0'

Using the Native Backend

In order to use your new shiny backends you will have to switch your application to use the version of ND4J that you just compiled and to use the native backend.

For this you change the version of all your ND4J dependencies to version you've built, i.e: "0.9.2-SNAPSHOT".

CPU Backend

Use nd4j-native backend like that:

org.nd4jnd4j-native0.9.2-SNAPSHOT

CUDA Backend

Exchange nd4j-native for nd4j-cuda-9.0 (or nd4j-cuda-8.0) like that:

org.nd4jnd4j-cuda-9.00.9.2-SNAPSHOT

Troubleshooting

When I start my application, I still see a "Can't find dependent libraries" error

If your application continues to run, then you are just seeing an artefact of one way we try to load the native library, but your application should run just fine.

I'm having trouble downloading or updating packages using pacman

There are a number of things that can potentially go wrong. First, try updating packman using the following commands:

pacman -Syy
pacman -Syu
pacman -S pacman-mirrors

Note that you might need to restart the msys2 shell between/after these steps.

pacman -S wget

then uncomment (remove the # symbol) the following line in the /etc/pacman.conf configuration file:

XferCommand = /usr/bin/wget --passive-ftp -c -O %o %u

"buildnativeoperations.sh blas cpu" can't find BLAS libraries

First, make sure you have BLAS libraries intalled. Typically, this involves building OpenBLAS by downloading OpenBLAS and running the commands 'make', 'make install' in msys2.

Running the buildnativeoperations.sh script in the MinGW-w64 Win64 Shell instead of the standard msys2 shell may resolve this issue.

I'm getting other errors not listed here

Depending on how your build environment and PATH environment variable is set up, you might experience some other issues. Some situations that may be problematic include:

Having older (or multiple) MinGW installs on your PATH (check: type "where c++" or "where gcc" into msys2)
Having older (or multiple) cmake installs on your PATH (check: "where cmake" and "cmake --version")
Having multiple BLAS libraries on your PATH (check: "where libopenblas.dll", "where libblas.dll" and "where liblapack.dll")

I'm getting `jniNativeOps.dll: Can't find dependent libraries` errors

public class App {
    public static void main(String[] args){
        System.loadLibrary("libopenblas.dll");
    }
}

If this also crashes with the Can't find dependent libraries error, then you have to setup your PATH correctly (see the introduction to this document).

Finally, using dumpbin (from Visual Studio) can help to show required dependencies for jniNativeOps.dll:

dumpbin /dependents [path to jniNativeOps.dll]

My application crashes on the first usage of ND4J with the CUDA Backend (Windows)

Exception in thread "main" java.lang.RuntimeException: Can't allocate [HOST] memory: 32

If the Exception you are getting looks anything like this, and you see this upon startup:

o.n.j.c.CudaEnvironment - Device [0]: Free: 0 Total memory: 0

You will have to add JOGL to your dependencies:

    <dependency>
      <groupId>org.jogamp.gluegen</groupId>
      <artifactId>gluegen-rt-main</artifactId>
      <version>2.3.1</version>
    </dependency>
    <dependency>
      <groupId>org.jogamp.jogl</groupId>
      <artifactId>jogl-all-main</artifactId>
      <version>2.3.1</version>
    </dependency>

And as the very first thing in your main method you will need to add:

        GLProfile.initSingleton();

This should allow ND4J to work correctly (you still have to set that the JVM has to use the GPU in the Nvidia System Panel).

My Display Driver / System crashes when I use the CUDA Backend (Windows)

My JVM is crashing with the problematic frame being in `cygwin1.dll`

In order to fix this problem, all you have to do is to remove cygwin from your PATH while building libnd4j and nd4j.

If you want to inspect your path you can do this by running:

    echo $PATH

If you want to set your PATH temporarily, you can do so with:

    export PATH=... # Replace ... with what ever you want to have there

CUDA build is failing with cmake/nmake errors

Some errors such as the following can appear if the visual studio vcvars64.bat file is run before attempting the cuda build.

  The parameter is incorrectRC Pass 1 failed to run.
  NMAKE : fatal error U1077: 'C:\msys64\mingw64\bin\cmake.exe' : return code '0xffffffff'
  NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\nmake.exe"' : return code '0x2'

To resolve this, ensure that you haven't run vcvars64/vcvarsall in the msys2 shell before building.

MSI Installer

To build an MSI Installer run: ./buildnativeoperations.sh -p msi

For gpu run: ./buildnativeoperations.sh -p msi -c cuda

BLAS Impls

Openblas: Ensure that you set up $MSYSROOT/opt/OpenBLAS/lib. If you built OpenBLAS in msys2 (make, make install), then you should not need to do anything else.

MKL Setup

To build libnd4j with MKL:

Download MKL from https://software.intel.com/en-us/articles/free_mkl and install. Registration is required (free).
Add the \redist\intel64_win\mkl directory to your system PATH environment variable. This will be in a location such as C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2016.3.207\windows\redist\intel64_win\mkl\

Then build libnd4j as before. You may have to be careful about having multiple BLAS implementations on your path. Ideally, have only MKL on the path while building libnd4j.

Note: you may be able to get some additional performance on hyperthreaded processors by setting the system/environment variable MKL_DYNAMIC to have the value 'false'.

float16_nhcw float16_nhwc

How to Add Operations

There's multiple different Ops designs supported in libND4j, and in this guide we'll try to explain how to build your very own operation.

XYZ operations

This kind of operations is actually split into multiple subtypes, based on element-access and result type:

Transform operations: These operations typically take some NDArray in, and change each element independent of others.
Reduction operations: These operations take some NDArray and dimensions, and return reduced NDArray (or scalar) back. I.e. sum along dimension(s).
Scalar operations: These operations are similar to transforms, but they only do arithmetic operations, and second operand is scalar. I.e. each element in given NDArray will add given scalar value.
Pairwise operations: These operations are between regular transform opeartions and scalar operations. I.e. element-wise addition of two NDArrays.
Random operations: Most of these operations related to random numbers distributions: Uniform, Gauss, Bernoulli etc.

for (Nd4jLong i = start; i < end; i++) {
    result[i] = OpType::op(x[i], scalar, extraParams);
}

Now, let's take a look into typical XYZ op implementation. Here's how Add operation will look like:

template<typename T>
class Add {
public:
    op_def static T op(T d1, T d2) {
        return d1 + d2;
    }

    // this signature will be used in Scalar loops
    op_def static T op(T d1, T d2, T *params) {
        return d1 + d2;
    }

    // this signature will be used in reductions
    op_def static T op(T d1) {
        return d1;
    }

    // op for MetaOps
    op_def static T op(T d1, T *params) {
        return d1 + params[0];
    }
};

Custom operations

However, there are some minor difference between various CustomOp declarations:

DECLARE_OP(string, int, int, bool): these operations take no fp/int arguments, and output shape equals to input shape.
DECLARE_CONFIGURABLE_OP(string, int, int, bool, int, int): these operations do take fp/int output arguments, and output shape equals to input shape.
DECLARE_REDUCTION_OP(string, int, int, bool, int, int): these operations do take fp/int output arguments, and output shape is calculated as Reduction.
DECLARE_CUSTOM_OP(string, int, int, bool, int, int): these operations return NDArray with custom shape, that usually depends on input and arguments.
DECLARE_BOOLEAN_OP(string, int, bool): these operations take some NDArrays and return scalar, where 0 is False, and other values are treated as True.

Let's take a look at example CustomOp:

CUSTOM_OP_IMPL(tear, 1, -1, false, 0, -1) {
    auto input = INPUT_VARIABLE(0);

    REQUIRE_TRUE(!block.getIArguments()->empty(), 0, "At least 1 dimension should be specified for Tear");

    std::vector<int> dims(*block.getIArguments());

    for (auto &v: dims)
        REQUIRE_TRUE(v >= 0 && v < input->rankOf(), 0, "Tear dimensions should be non-negative values, and lower then input rank. Got %i instead", v);

    auto tads = input->allTensorsAlongDimension(dims);
    for (int e = 0; e < tads->size(); e++) {
        auto outE = OUTPUT_VARIABLE(e);
        outE->assign(tads->at(e));

        this->storeResult(block, e, *outE);
    }

    delete tads;

    return ND4J_STATUS_OK;
}

DECLARE_SHAPE_FN(tear) {
    auto inShape = inputShape->at(0);

    std::vector<int> dims(*block.getIArguments());

    if (dims.size() > 1)
        std::sort(dims.begin(), dims.end());

    shape::TAD tad(inShape, dims.data(), (int) dims.size());
    tad.createTadOnlyShapeInfo();
    Nd4jLong numTads = shape::tadLength(inShape, dims.data(), (int) dims.size());

    auto result = SHAPELIST();
    for (int e = 0; e < numTads; e++) {
        int *newShape;
        COPY_SHAPE(tad.tadOnlyShapeInfo, newShape);
        result->push_back(newShape);
    }

    return result;
}

Important part of op declaration is input/output description for the op. I.e. as shown above: CUSTOM_OP_IMPL(tear, 1, -1, false, 0, -1). This declaration means:

Op name: tear
Op expects at least 1 NDArray as input
Op returns unknown positive number of NDArrays as output
Op can't be run in-place, so under any circumstances original NDArray will stay intact
Op doesn't expect any T (aka floating point) arguments
Op expects unknown positive number of integer arguments. In case of this op it's dimensions to split input NDArray.

Here's another example: DECLARE_CUSTOM_OP(permute, 1, 1, true, 0, -2); This declaration means:

Op name: permute
Op expects at least 1 NDArray as input
Op returns 1 NDArray as output
Op can be run in-place if needed (it means: input == output, and input is modified and returned as output)
Op doesn't expect any T arguments
Op expects unknown number of integer arguments OR no integer arguments at all.

-1 means at least 1 of the expected parameter will be present
-2 means an unknown number of parameters. Use this in situations where inputs of certain types maybe optional. A common use case is when a parameter maybe passed in as an ndrray or as a TARG or IARG (floating point or integer arguments respectively)

c++11 syntactic sugar

auto lambda = LAMBDA_TT(_x, _y) {
    return (_x + _y) * 2;
};

x.applyPairwiseLambda(&y, lambda);

In this simple example, each element of NDArray x will get values set to x[e] = (x[e] + y[e]) * 2.

Tests

For tests libnd4j uses Google Tests suit. All tests are located at tests_cpu/layers_tests folder. Here's simple way to run those from command line:

cd tests_cpu
cmake -G "Unix Makefiles"
make -j 4
./layers_tests/runtests

You can also use your IDE (i.e. Jetbrains CLion) to run tests via GUI.

PLEASE NOTE: if you're considering submitting your new op to libnd4j repository via pull request - consider adding tests for it. Ops without tests won't be approved.

Backend-specific operation

GPU/MPI/whatever to be added soon.

Utility macros

We have number of utility macros, suitable for custom ops. Here they are:

INPUT_VARIABLE(int): this macro returns you NDArray at specified input index.
OUTPUT_VARIABLE(int): this macro returns you NDArray at specified output index.
STORE_RESULT(NDArray): this macro stores result to VariableSpace.
STORE_2_RESULTS(NDArray, NDArray): this macro stores results accordingly to VariableSpace.
INT_ARG(int): this macro returns you specific Integer argument passed to the given op.
T_ARG(int): this macro returns you specific T argument passed to the given op.
ALLOCATE(...): this macro check if Workspace is available, and either uses Workspace or direct memory allocation if Workspace isn't available.
RELEASE(...): this macro is made to release memory allocated with ALLOCATE() macro.
REQUIRE_TRUE(...): this macro takes condition, and evaluates it. If evaluation doesn't end up as True - exception is raised, and specified message is printed out.
LAMBDA_T(X) and LAMBDA_TT(X, Y): lambda declaration for NDArray::applyLambda and NDArray::applyPairwiseLambda
COPY_SHAPE(SRC, TGT): this macro allocates memory for TGT pointer and copies shape from SRC pointer
ILAMBDA_T(X) and ILAMBDA_TT(X, Y): lambda declaration for indexed lambdas, index argument is passed in as Nd4jLong (aka long long)
FORCEINLINE: platform-specific definition for functions inlining

Explicit template instantiations in helper methods.

    template<typename X, typename Z>
    void  argMin_(const NDArray& input, NDArray& output, const std::vector<int>& dimensions);

We should write this to explicitly instantiate it.

BUILD_DOUBLE_TEMPLATE(template void argMin_, (const NDArray& input, NDArray& output, const std::vector<int>& dimensions),
               LIBND4J_TYPES, INDEXING_TYPES);

Here:

LIBND4J_TYPES means we want to use all types in the place of X
INDEXING_TYPES means we will use index types ( int, int64_t) as Z type

#cmakedefine LIBND4J_TYPE_GEN

Then we just add _@FL_TYPE_INDEX@ as suffix in type name and it will split those types for us and generate cpp files inside ${CMAKE_BINARY_DIR}/compilation_units folder.

LIBND4J_TYPE_@FL_TYPE_INDEX@

Here how the complete cpp.in file will look like:

#cmakedefine LIBND4J_TYPE_GEN 
//this header is where our template functions resides
#include <ops/declarable/helpers/cpu/indexReductions.hpp>
namespace sd {
    namespace ops {
        namespace helpers {
            BUILD_DOUBLE_TEMPLATE(template void argMax_, (const NDArray& input, NDArray& output, const std::vector<int>& dimensions), 
               LIBND4J_TYPES_@FL_TYPE_INDEX@, INDEXING_TYPES);
        }
    }
}

How To Guides

Building on Windows

Building libnd4j

Additional build arguments

Building the CPU Backend

Building the CUDA Backend

Building nd4j

Using the Native Backend

CPU Backend

CUDA Backend

Troubleshooting

When I start my application, I still see a "Can't find dependent libraries" error

I'm having trouble downloading or updating packages using pacman

"buildnativeoperations.sh blas cpu" can't find BLAS libraries

I'm getting other errors not listed here

I'm getting jniNativeOps.dll: Can't find dependent libraries errors

My application crashes on the first usage of ND4J with the CUDA Backend (Windows)

My Display Driver / System crashes when I use the CUDA Backend (Windows)

My JVM is crashing with the problematic frame being in cygwin1.dll

CUDA build is failing with cmake/nmake errors

MSI Installer

BLAS Impls

MKL Setup

Building for raspberry pi or Jetson Nano

old one

Building on ios

How to Add Operations

XYZ operations

Custom operations

c++11 syntactic sugar

Tests

Backend-specific operation

Utility macros

Explicit template instantiations in helper methods.

How to Setup CLion

Overview

How to Setup CLion

Overview

Building on ios

Building on Windows

Building libnd4j

Additional build arguments

Building the CPU Backend

Building the CUDA Backend

Building nd4j

Using the Native Backend

CPU Backend

CUDA Backend

Troubleshooting

When I start my application, I still see a "Can't find dependent libraries" error

I'm having trouble downloading or updating packages using pacman

"buildnativeoperations.sh blas cpu" can't find BLAS libraries

I'm getting other errors not listed here

I'm getting jniNativeOps.dll: Can't find dependent libraries errors

My application crashes on the first usage of ND4J with the CUDA Backend (Windows)

My Display Driver / System crashes when I use the CUDA Backend (Windows)

My JVM is crashing with the problematic frame being in cygwin1.dll

CUDA build is failing with cmake/nmake errors

MSI Installer

BLAS Impls

MKL Setup

How to Add Operations

XYZ operations

Custom operations

c++11 syntactic sugar

Tests

Backend-specific operation

Utility macros

Explicit template instantiations in helper methods.

Building for raspberry pi or Jetson Nano

old one

I'm getting `jniNativeOps.dll: Can't find dependent libraries` errors

My JVM is crashing with the problematic frame being in `cygwin1.dll`

I'm getting `jniNativeOps.dll: Can't find dependent libraries` errors

My JVM is crashing with the problematic frame being in `cygwin1.dll`