There's multiple different Ops designs supported in libND4j, and in this guide we'll try to explain how to build your very own operation.
This kind of operations is actually split into multiple subtypes, based on element-access and result type:
Transform operations: These operations typically take some NDArray in, and change each element independent of others.
Reduction operations: These operations take some NDArray and dimensions, and return reduced NDArray (or scalar) back. I.e. sum along dimension(s).
Scalar operations: These operations are similar to transforms, but they only do arithmetic operations, and second operand is scalar. I.e. each element in given NDArray will add given scalar value.
Pairwise operations: These operations are between regular transform opeartions and scalar operations. I.e. element-wise addition of two NDArrays.
Random operations: Most of these operations related to random numbers distributions: Uniform, Gauss, Bernoulli etc.
Despite differences between these operations, they are all using XZ/XYZ three-operand design, where X and Y are inputs, and Z is output. Data access in these operations is usually trivial, and loop based. I.e. most trivial loop for scalar transform will look like this:
Operation used in this loop will be template-driven, and compiled statically. There are another loops implementation, depending on op group or strides within NDArrays, but idea will be the same all the time: each element of the NDArray will be accessed within loop.
Now, let's take a look into typical XYZ op implementation. Here's how Add
operation will look like:
This particular operation is used in different XYZ op groups, but you see the idea: element-wise operation, which is invoked on each element in given NDArray. So, if you want to add new XYZ operation to libnd4j, you should just add operation implementation to file includes/ops/ops.h
, and assign it to specific ops group in file includes/loops/legacy_ops.h
together with some number unique to this ops group, i.e.: (21, simdOps::Add)
After libnd4j is recompiled, this op will become available for legacy execution mechanism, NDArray wrappers, and LegacyOp
wrappers (those are made to map legacy operations to CustomOps design for Graph).
Custom operations is a new concept, added recently and mostly suits SameDiff/Graph needs. For CustomOps we defined universal signature, with variable number of input/output NDArrays, and variable number of floating-point and integer arguments.
However, there are some minor difference between various CustomOp declarations:
DECLARE_OP(string, int, int, bool): these operations take no fp/int arguments, and output shape equals to input shape.
DECLARE_CONFIGURABLE_OP(string, int, int, bool, int, int): these operations do take fp/int output arguments, and output shape equals to input shape.
DECLARE_REDUCTION_OP(string, int, int, bool, int, int): these operations do take fp/int output arguments, and output shape is calculated as Reduction.
DECLARE_CUSTOM_OP(string, int, int, bool, int, int): these operations return NDArray with custom shape, that usually depends on input and arguments.
DECLARE_BOOLEAN_OP(string, int, bool): these operations take some NDArrays and return scalar, where 0 is False, and other values are treated as True.
Let's take a look at example CustomOp:
In the example above, we declare tear
CustomOp implementation, and shape function for this op. So, at the moment of op execution, we assume that we will either have output array(s) provided by end-user, or they will be generated with shape function.
You can also see number of macros used, we'll cover those later as well. Beyond that - op execution logic is fairly simple & linear: Each new op implements protected member function DeclarableOp<T>::validateAndExecute(Block<T>& block)
, and this method is eventually called either from GraphExecutioner, or via direct call, like DeclarableOp<T>::execute(Block<T>& block)
.
Important part of op declaration is input/output description for the op. I.e. as shown above: CUSTOM_OP_IMPL(tear, 1, -1, false, 0, -1)
. This declaration means:
Op name: tear
Op expects at least 1 NDArray as input
Op returns unknown positive number of NDArrays as output
Op can't be run in-place, so under any circumstances original NDArray will stay intact
Op doesn't expect any T (aka floating point) arguments
Op expects unknown positive number of integer arguments. In case of this op it's dimensions to split input NDArray.
Here's another example: DECLARE_CUSTOM_OP(permute, 1, 1, true, 0, -2);
This declaration means:
Op name: permute
Op expects at least 1 NDArray as input
Op returns 1 NDArray as output
Op can be run in-place if needed (it means: input == output, and input is modified and returned as output)
Op doesn't expect any T arguments
Op expects unknown number of integer arguments OR no integer arguments at all.
Note on parameters: Negative values (-1,-2) mean very specific things. When op validation is invoked (checking the parameters) either the exact number of parameters in the descriptor must be present for each type or the following:
-1 means at least 1 of the expected parameter will be present
-2 means an unknown number of parameters. Use this in situations where inputs of certain types maybe optional. A common use case is when a parameter maybe passed in as an ndrray or as a TARG or IARG (floating point or integer arguments respectively)
In ops you can easily use c++11 features, including lambdas. In some cases it might be easiest way to build your custom op (or some part of it) via NDArray::applyLambda
or NDArray::applyPairwiseLambda
:
In this simple example, each element of NDArray x
will get values set to x[e] = (x[e] + y[e]) * 2
.
For tests libnd4j uses Google Tests suit. All tests are located at tests_cpu/layers_tests
folder. Here's simple way to run those from command line:
You can also use your IDE (i.e. Jetbrains CLion) to run tests via GUI.
PLEASE NOTE: if you're considering submitting your new op to libnd4j repository via pull request - consider adding tests for it. Ops without tests won't be approved.
GPU/MPI/whatever to be added soon.
We have number of utility macros, suitable for custom ops. Here they are:
INPUT_VARIABLE(int): this macro returns you NDArray at specified input index.
OUTPUT_VARIABLE(int): this macro returns you NDArray at specified output index.
STORE_RESULT(NDArray): this macro stores result to VariableSpace.
STORE_2_RESULTS(NDArray, NDArray): this macro stores results accordingly to VariableSpace.
INT_ARG(int): this macro returns you specific Integer argument passed to the given op.
T_ARG(int): this macro returns you specific T argument passed to the given op.
ALLOCATE(...): this macro check if Workspace is available, and either uses Workspace or direct memory allocation if Workspace isn't available.
RELEASE(...): this macro is made to release memory allocated with ALLOCATE() macro.
REQUIRE_TRUE(...): this macro takes condition, and evaluates it. If evaluation doesn't end up as True - exception is raised, and specified message is printed out.
LAMBDA_T(X) and LAMBDA_TT(X, Y): lambda declaration for NDArray::applyLambda
and NDArray::applyPairwiseLambda
COPY_SHAPE(SRC, TGT): this macro allocates memory for TGT pointer and copies shape from SRC pointer
ILAMBDA_T(X) and ILAMBDA_TT(X, Y): lambda declaration for indexed lambdas, index argument is passed in as Nd4jLong (aka long long)
FORCEINLINE: platform-specific definition for functions inlining
We should explicitly instantiate template methods for different data types in libraries. Furethemore, to speed up parallel compilation we need to add those template instantiations in separate source files. Besides, another reason is that: some compilers are choked when these template instantiations are many in one translation unit. To ease this cumbersome operation we have Cmake helper and macros helpers. Example: Suppose we have such function:
We should write this to explicitly instantiate it.
Here:
LIBND4J_TYPES means we want to use all types in the place of X
INDEXING_TYPES means we will use index types ( int, int64_t) as Z type
But to speed up compilation process and also helping compilers we can further separate it into different source files. Firstly we rename the original template source with hpp extension: Secondly we add file with the suffix cpp.in (or cu.in for cuda) that will include that hpp header and place it in the apropriate compilation units folder. in our case it will be in ./libnd4j/include/ops/declarable/helpers/cpu/compilation_units folder with the name argmax.cpp.in . Later we decide which type we want to separate into different sources. In our case we want to split LIBND4J_TYPES (other ones: INT_TYPE , FLOAT_TYPE, PAIRWISE_TYPE ). We hint cmake that case with this (adding _GEN suffix):
Then we just add _@FL_TYPE_INDEX@ as suffix in type name and it will split those types for us and generate cpp files inside ${CMAKE_BINARY_DIR}/compilation_units folder.
Here how the complete cpp.in file will look like:
Used LLVM 4.0 to build+ ios-arm
, ios-x86
, and ios-x86_64
.
When building on ios, a static library is assembled. For more on how to build for ios, please see gluon's guide.
bash pi_build.sh
using this helper script one can cross build libnd4j and dl4j with arm COMPUTE LIBRARY . it will download cross compiler and arm compute library.
options
value
description
-a or --arch
arm32
cross compiles for pi/linux 32bit
-a or --arch
arm64
cross compiles for pi/linux 64bit
-a or --arch
android-arm
cross compiles for android 32bit
-a or --arch
android-arm64
cross compiles for android 64bit
-a or --arch
jetson-arm64
cross compiles for jetson nano 64bit
-m or --mvn
if provided will build dl4j using maven
example:
bash pi_build.sh --arch android-arm64 --mvn
to change version of the arm COMPUTE LIBRARY modify this line in the script
Please follow following instructions to build nd4j for raspberry PI:
download cross compilation tools for Raspberry PI
download deeplearning4j:
build libnd4j:
build nd4j
All of these instructions assume you are on a 64-bit system
libnd4j depends on some Unix utilities for compilation. So in order to compile it you will need to install Msys2.
After you have setup Msys2 by following their instructions, you will have to install some additional development packages. Start the msys2 shell and setup the dev environment with:
This will install the needed dependencies for use in the msys2 shell. You will have to use the msys2 shell (especially c:\msys64\mingw64.exe
) for the whole compilation process.
You will also need to setup your PATH environment variable to include C:\msys64\mingw64\bin
(or where ever you have decided to install msys2). If you have IntelliJ (or another IDE) open, you will have to restart it before this change takes effect for applications started through them. If you don't, you probably will see a "Can't find dependent libraries" error.
For cpu, we recommend openblas. We will be adding instructions for mkl and other cpu implementations later.
Send us a pull request or file an issue if you have something in particular you are looking for.
libnd4j and nd4j go hand in hand, and libnd4j is required for two out of the three currently supported backends (nd4j-native and nd4j-cuda). For this reason they should always be rebuild together.
There's few additional arguments for buildnativeoperations.sh
script you could use:
Now clone this repository, and in that directory run the following to build the dll for the cpu backend:
The CUDA Backend has some additional requirements before it can be built:
Visual Studio 2015 or 2017 or 2019 (Please note: Visual Studio 2017 is NOT SUPPORTED by CUDA 8.0 and below, Visual Studio 2019 is supported since CUDA 10.2)
In order to build the CUDA backend you will have to setup some more environment variables first, by calling vcvars64.bat
. But first, set the system environment variable SET_FULL_PATH
to true
, so all of the variables that vcvars64.bat
sets up, are passed to the mingw shell. Additionally, you need to open the mingw64.ini
in your msys64 installation folder and add the command: MSYS2_PATH_TYPE=inherit
. Replace YOUR VERSION with the target version. 14.0 is known to work.
Inside a normal cmd.exe command prompt, run C:\Program Files (x86)\Microsoft Visual Studio *YOUR VERSION*\VC\bin\amd64\vcvars64.bat
Run c:\msys64\mingw64.exe
inside that
Change to your libnd4j folder
./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH
This builds the CUDA nd4j.dll.
While still in the libnd4j
folder, run:
Now leave the libnd4j directory and clone the repository. Run the following to compile nd4j with support for both the native cpu backend as well as the cuda backend:
If you don't want the cuda backend, e.g. because you didn't or can't build it, you can skip it:
Please notice the single quotes around the last parameter, if you leave them out or use double quotes you will get an error about event not found
from your shell. If this doesn't work, make sure you have a current version of maven installed.
Also, if you're going to build DeepLearning4j without CUDA available, you'll have to deeplearning4j-cuda-9.0 (or 8.0) artifact as well:
In order to use your new shiny backends you will have to switch your application to use the version of ND4J that you just compiled and to use the native backend.
For this you change the version of all your ND4J dependencies to version you've built, i.e: "0.9.2-SNAPSHOT".
Use nd4j-native backend like that:
org.nd4jnd4j-native0.9.2-SNAPSHOT
Exchange nd4j-native for nd4j-cuda-9.0 (or nd4j-cuda-8.0) like that:
org.nd4jnd4j-cuda-9.00.9.2-SNAPSHOT
If your application continues to run, then you are just seeing an artefact of one way we try to load the native library, but your application should run just fine.
If your application crashes (and you see that error more than once) then you probably have a problem with your PATH environment variable. Please make sure that you have your msys2 bin directory on the PATH and that you restarted your IDE. Maybe even try to restart the system.
There are a number of things that can potentially go wrong. First, try updating packman using the following commands:
Note that you might need to restart the msys2 shell between/after these steps.
One user has reported issues downloading packages using the default downloader (timeouts and "error: failed retrieving file" messages). If you are experiencing these issues, it may help to switch to using the wget downloader. To do this, install wget using
then uncomment (remove the # symbol) the following line in the /etc/pacman.conf configuration file:
First, make sure you have BLAS libraries intalled. Typically, this involves building OpenBLAS by downloading OpenBLAS and running the commands 'make', 'make install' in msys2.
Running the buildnativeoperations.sh script in the MinGW-w64 Win64 Shell instead of the standard msys2 shell may resolve this issue.
Depending on how your build environment and PATH environment variable is set up, you might experience some other issues. Some situations that may be problematic include:
Having older (or multiple) MinGW installs on your PATH (check: type "where c++" or "where gcc" into msys2)
Having older (or multiple) cmake installs on your PATH (check: "where cmake" and "cmake --version")
Having multiple BLAS libraries on your PATH (check: "where libopenblas.dll", "where libblas.dll" and "where liblapack.dll")
jniNativeOps.dll: Can't find dependent libraries
errorsThis is usually due to an incorrectly setup PATH (see "I'm getting other errors not listed here"). As the PATH using the msys2 shell is a little bit different then for other applications, you can check that the PATH is really the problem by running the following test program:
If this also crashes with the Can't find dependent libraries
error, then you have to setup your PATH correctly (see the introduction to this document).
Note: Another possible cause of "...jniNativeOps.dll: Can't find dependent libraries" seems to be having an old or incompatible version of libstc++-6.dll on your PATH. You want this file to be pulled in from mingw via you PATH environment variable. To check your PATH/environment, run where libstdc++-6.dll
and where libgcc_s_seh-1.dll
; these should list the msys/mingw directories (and/or list them first, if there are other copies on the PATH).
Finally, using dumpbin (from Visual Studio) can help to show required dependencies for jniNativeOps.dll:
If the Exception you are getting looks anything like this, and you see this upon startup:
Then you are most probably trying to use a mobile GPU (like 970m) and Optimus is trying to ruin the day. First you should try to force the usage of the GPU through normal means, like setting the JVM to run on your GPU via the Nvidia System Panel or by disabling the iGPU in your BIOS. If this still isn't enough, you can try the following workaround, that while not recommended for production, should allow you to still use your GPU.
You will have to add JOGL to your dependencies:
And as the very first thing in your main
method you will need to add:
This should allow ND4J to work correctly (you still have to set that the JVM has to use the GPU in the Nvidia System Panel).
ND4J is meant to be used with pure compute cards (i.e. the Tesla series). On consumer GPUs that are mainly meant for gaming, this results in a usage that can conflict with the cards primary work: Displaying your Desktop.
Microsoft has added the Timeout Detection and Recovery (TDR) to detect malfunctioning drivers and improper usage, which now interferes with the compute tasks of ND4J, by killing them if they occupy the GPU for longer then a few seconds. This results in the "Display driver stopped responding and has recovered" message. This results in a perceived driver crash along with a crash of your application. If you try to run it again TDR may decide that something is messing with the display driver and force a reboot.
If you really want to use your display GPU for compute with ND4J (not recommended), you will have to disable TDR by setting TdrLevel=0 (see https://msdn.microsoft.com/en-us/library/windows/hardware/ff569918%28v=vs.85%29.aspx). If you do this you will have display freezes, which, depending on your workload, can stay quite a long time.
cygwin1.dll
If you have any cygwin related dlls in the crash log, this means that you have build libnd4j or nd4j with cygwin being on the PATH before Msys2. This results in successful compilation, but crashes the JVM with some usecases.
In order to fix this problem, all you have to do is to remove cygwin from your PATH while building libnd4j and nd4j.
If you want to inspect your path you can do this by running:
If you want to set your PATH temporarily, you can do so with:
Some errors such as the following can appear if the visual studio vcvars64.bat file is run before attempting the cuda build.
To resolve this, ensure that you haven't run vcvars64/vcvarsall in the msys2 shell before building.
To build an MSI Installer run: ./buildnativeoperations.sh -p msi
For gpu run: ./buildnativeoperations.sh -p msi -c cuda
Openblas: Ensure that you set up $MSYSROOT/opt/OpenBLAS/lib. If you built OpenBLAS in msys2 (make, make install), then you should not need to do anything else.
Note: our informal/unscientific testing suggests that Intel MKL can be about equal with, and up to about 40% faster than OpenBLAS on some matrix multiply (gemm) operations, on some machines. Installing MKL is recommended but not required.
To build libnd4j with MKL:
Download MKL from https://software.intel.com/en-us/articles/free_mkl and install. Registration is required (free).
Add the \redist\intel64_win\mkl directory to your system PATH environment variable. This will be in a location such as C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2016.3.207\windows\redist\intel64_win\mkl\
Then build libnd4j as before. You may have to be careful about having multiple BLAS implementations on your path. Ideally, have only MKL on the path while building libnd4j.
Note: you may be able to get some additional performance on hyperthreaded processors by setting the system/environment variable MKL_DYNAMIC to have the value 'false'.
float16_nhcw float16_nhwc
Setting up clion for modifying the libnd4j code base
In order to setup clion, we need to configure the cmake defaults to ensure that tests can be built. Normally this is setup by the libnd4j build script. A general tutorial on how to configure cmake profiles for clion can be found here. When configuring cmake to run, there are generally a few steps to follow.
This will in general include setting up the cmake gtest integration (we use google test for our test suite)
Setup a toolchain. This will depend on your OS. More can be found here. This will cover the compiler, debugger and other needed additional software to enable clion to manage your cmake project.
After configuration, let cmake build/index the files. It takes time to setup the files, auto complete and other expected functionality provided by the IDE. Ensure this is done by keeping an eye on the bottom right of the IDE to ensure all tasks are complete.
After your clion environment is setup, you may put the following cmake configuration for CPU: -DSD_CPU=true -DSD_BUILD_TESTS=true -DSD_X86_BUILD=true -DSD_ALL_OPS=true -DSD_ARCH=x86-64 -DSD_X86_BUILD=true -DSD_SHARED_LIB=true
The above will configure your IDE to build a shared library for intel cpus as well as configure the gtest setup to run. Please ensure that you read about how to configure cmake profiles (as linked above)
Now you should be able to make modifications backed by the IDE. Note that you can also run tests under testscpu/layerstests
In order to run all tests, you may run the AllTests.cpp entry point. In order to run the tests (or even a specific test) just right click on any test and click the green arrow option that appears in the dropdown similar to "Run AllTests.." or something like that. For more information on this, please see here.
When running tests, note that it may take a while. It will build a whole executable compiling the relevant parts of the code base (or if necessary, the whole code base) in order to run the tests.