1 of 4

Reference

Understanding graph execution

Basic idea

libnd4j contains Directed Acyclic Graph execution engine, suited for both local and remote execution. However, main goal here is execution of externally originated graphs, serialized into FlatBuffers and provided either via pointer, or file.

This basic example shows execution of graph loaded from file:

auto graph = GraphExecutioner<float>::importFromFlatBuffers("./some_file.fb");
GraphExecutioner<float>::execute(graph);
// ... do something with results ...
delete graph;

FlatBuffers schemas

You can find scheme files here.

At this moment libnd4j repo contains compiled definitions for C++, Python, Java, and JSON, but FlatBuffers can be compiled for PHP, C#, JavaScript, TypeScript and Go as well. Please refer to flatc instructions to do that.

Such bindings allow you to build FlatBuffers files/buffers suitable for remote execution of your graph and obtaining results back. I.e. you can use JavaScript to build graph (or just update variables/placeholders), send them to remote RPC server powered by libnd4j, and get results back.

Graph execution logic

No matter how graph is represented on the front-end, on backend it's rather simple: topologically sorted list of operations executed sequentially if there's shared dependencies, or (optionally) in parallel, if there's no shared dependencies for current graph nodes.

Each node in the graph represents single linear algebra operation applied to input(s) of the node. For example: z = Add(x, y) is operation that takes 2 NDArrays as input, and produes 1 NDArray as output. So, graph is built of such primitive operations, which are executed sequentially.

Memory management within graph

Everything that happens within graph during execution, stays within VariableSpace. It acts as storage for Variables and NDArrays produced during graph execution. On top of that, there's an option to use pre-allocated Workspaces for allocation of NDArrays.

Current graph limitations

There are some limitations. Some of them will be lifted eventually, others won't be. Here's the list:

Graph has single data type. I.e. Graph<float> or Graph<float16> or Graph<double> This limitation will be lifted soon.
On some platforms, like Java, single Variable/Placeholder size is limited to 2GB buffer size. However, on libnd4j side there's no such limitation.
Variable size/dimensionality has limitations: max NDArray rank is limited to 32 at this moment, and any single dimension is limited to MAX_INT size.
Recursion isn't directly supported at this moment.
CUDA isn't supported at this moment. This limitation will be lifted soon.
When used from C++, Graph only supports FeedForward mode. This limitation will be lifted soon.

Minified Graph binaries

There's an option to build minified binaries suited for execution of specific graphs. Idea is quite simple: you feed your existing Graph(s) in FlatBuffers format into special app, which extracts operations used in your Graph(s) and excludes all other operations from target binary.

# building full libnd4j copy AND minfier app
./buildnativeoperations.sh -a native -m 
...
# building libnd4j for 2 specific graphs
./minifier -l -a native -o libnd4j_special ../some_path/some_graph1.fb ../some_path/some_graph2.fb
Option 'l': Build library
Option 'a': Target arch: native
Option 'o': Output file name is libnd4j_special
Total available operations: 423

Retrieving ops from the Graph and collect them...

Collecting out Scopes...
Operations found so far:
rank
range
subtract
transpose
matmul
biasadd
TRANSFORM{15}

Building minified library...

Once minifier finishes - you'll have libnd4j_special.so and libnd4j_special.h files ready, and they'll contain only those operations used in 2 graphs provided at compilation time + basic primitives used to work with Graph. Things like NDArray, GraphExecutioner etc will be included as well.

This library can be used in your application as any other shared libray out there: you'll include headers file and you'll be able to call for things you need.

Documentation

Documentation for individual operations, and basic classes (like NDArray, Graph etc) is available as part of Nd4j javadoc: https://nd4j.org/doc/

Embedded profiling

If you're adding new ops, and want to make sure they run ok on your specific device - you might want to give a shot to embedded Graph profiling helper. Despite being simple - it still provides you with time spent in various parts of Graph.

Environment::getInstance().setProfiling(true);
auto graph = GraphExecutioner::importFromFlatBuffers("./resources/ae_00.fb");

auto profile = GraphProfilingHelper::profile(graph, 1000);
profile->printOut();

delete graph;

1000 iterations laterm you'll get statistics printed out. Statistics basically includes time spent in various parts of code and memory allocation details.

Here's how it'll look like:

Printing out Graph...
8. matmul; Inputs: [{1:0}, {2:0}]; 
9. biasadd; Inputs: [{8:0}, {3:0}]; 
10. TRANSFORM:{15}; Inputs: [{9:0}]; 
11. rank; Inputs: [{2:0}]; 
12. subtract; Inputs: [{11:0}, {4:0}]; 
13. range; Inputs: [{5:0}, {11:0}, {6:0}]; 
14. subtract; Inputs: [{12:0}, {13:0}]; 
15. transpose; Inputs: [{2:0}, {14:0}]; 
16. matmul; Inputs: [{10:0}, {15:0}]; 
17. biasadd; Inputs: [{16:0}, {7:0}]; 
18. TRANSFORM:{15}; Inputs: [{17:0}]; 

Printing out Scopes...
Graph profile: 1000 executions

Memory:
ACT: 0; TMP: 0; OBJ: 0; TTL: 1788;

Time:
Construction time: 2135 ns;
Execution time: 41820 ns;

Per-node reports:
Node: <8:MatMul>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 200;
      Time: PREP: 1160 ns; EXEC: 3167 ns; TTL: 5929 ns;
      PREP: INPUT: 251 ns; SHAPE: 382 ns; ARRAY: 217 ns;
Node: <9:BiasAdd>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 104;
      Time: PREP: 917 ns; EXEC: 3580 ns; TTL: 5957 ns;
      PREP: INPUT: 220 ns; SHAPE: 213 ns; ARRAY: 217 ns;
Node: <10:Tanh>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 104;
      Time: PREP: 756 ns; EXEC: 241 ns; TTL: 1927 ns;
      PREP: INPUT: 140 ns; SHAPE: 195 ns; ARRAY: 205 ns;
Node: <11:transpose/Rank>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 36;
      Time: PREP: 522 ns; EXEC: 119 ns; TTL: 1403 ns;
      PREP: INPUT: 109 ns; SHAPE: 69 ns; ARRAY: 171 ns;
Node: <12:transpose/sub>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 36;
      Time: PREP: 666 ns; EXEC: 185 ns; TTL: 1684 ns;
      PREP: INPUT: 192 ns; SHAPE: 94 ns; ARRAY: 168 ns;
Node: <13:transpose/Range>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 556;
      Time: PREP: 808 ns; EXEC: 647 ns; TTL: 2416 ns;
      PREP: INPUT: 297 ns; SHAPE: 228 ns; ARRAY: 181 ns;
Node: <14:transpose/sub_1>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 56;
      Time: PREP: 721 ns; EXEC: 541 ns; TTL: 2205 ns;
      PREP: INPUT: 23 ns; SHAPE: 92 ns; ARRAY: 165 ns;
Node: <15:transpose>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 96;
      Time: PREP: 3936 ns; EXEC: 602 ns; TTL: 5811 ns;
      PREP: INPUT: 194 ns; SHAPE: 3241 ns; ARRAY: 257 ns;
Node: <16:MatMul_1>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 312;
      Time: PREP: 970 ns; EXEC: 3565 ns; TTL: 6066 ns;
      PREP: INPUT: 203 ns; SHAPE: 320 ns; ARRAY: 193 ns;
Node: <17:BiasAdd_1>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 144;
      Time: PREP: 914 ns; EXEC: 3528 ns; TTL: 5870 ns;
      PREP: INPUT: 231 ns; SHAPE: 191 ns; ARRAY: 223 ns;
Node: <18:output>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 144;
      Time: PREP: 805 ns; EXEC: 285 ns; TTL: 1928 ns;
      PREP: INPUT: 157 ns; SHAPE: 192 ns; ARRAY: 232 ns;

Special timers:
No special timers were set

Roadmap

In short-to-medium term following improvements are expected:

CUDA support for all new ops
Additional data types support: int, long long, q types, bool
Sparse tensors support

Overview of working with libnd4j

Native operations for nd4j. Build using cmake

Prerequisites

GCC 4.9+
CUDA Toolkit Versions 10 or 11
CMake 3.8 (as of Nov 2017, in near future will require 3.9)

Additional build arguments

There's few additional arguments for buildnativeoperations.sh script you could use:

 -a XXXXXXXX// shortcut for -march/-mtune, i.e. -a native
 -b release OR -b debug // enables/desables debug builds. release is considered by default
 -j XX // this argument defines how many threads will be used to binaries on your box. i.e. -j 8 
 -cc XX// CUDA-only argument, builds only binaries for target GPU architecture. use this for fast builds
 --check-vectorization  auto-vectorization report for developers. (Currently, only GCC is supported)

More about AutoVectorization report

You can provide the compute capability for your card on the NVIDIA website here or use auto. Please also check your Cuda Toolkit Release notes for supported and dropped features. Here is the latest CUDA Toolkit Release note. You can find the same information for the older Toolkit versions in the CUDA archives.

-cc and --compute option examples

description

-cc all

builds for common GPUs

-cc auto

tries to detect automatically

-cc Maxwell

GPU microarchitecture codename

-cc 75

compute capability 7.5 without a dot

-cc 7.5

compute capability 7.5 with a dot

-cc "Maxwell 6.0 7.5"

space-separated multiple arguments within quotes (note: numbers only with a dot)

OS Specific Requirements

Android

Download the NDK, extract it somewhere, and execute the following commands, replacing android-xxx with either android-arm or android-x86:

git clone https://github.com/eclipse/deeplearning4j
export ANDROID_NDK=/path/to/android-ndk/
cd deeplearning4j/libnd4j
bash buildnativeoperations.sh -platform android-xxx
cd ../nd4j
mvn clean install -Djavacpp.platform=android-xxx -DskipTests -pl '!:nd4j-cuda-9.0,!:nd4j-cuda-9.0-platform,!:nd4j-tests'

OSX

Run ./setuposx.sh (Please ensure you have brew installed)

See macOSx10 CPU only.md

Linux

Depends on the distro - ask in the earlyadopters channel for specifics on distro

Ubuntu Linux 15.10

wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda-repo-ubuntu1504-7-5-local_7.5-18_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1504-7-5-local_7.5-18_amd64.deb
sudo apt-get update
sudo apt-get install cuda
sudo apt-get install cmake
sudo apt-get install gcc-4.9
sudo apt-get install g++-4.9
sudo apt-get install git
git clone https://github.com/deeplearning4j/libnd4j
cd libnd4j/
export LIBND4J_HOME=~/libnd4j/
sudo rm /usr/bin/gcc
sudo rm /usr/bin/g++
sudo ln -s /usr/bin/gcc-4.9 /usr/bin/gcc
sudo ln -s /usr/bin/g++-4.9 /usr/bin/g++
./buildnativeoperations.sh
./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

Ubuntu Linux 16.04

sudo apt install cmake
sudo apt install nvidia-cuda-dev nvidia-cuda-toolkit nvidia-361
export TRICK_NVCC=YES
./buildnativeoperations.sh
./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

The standard development headers are needed.

CentOS 6

yum install centos-release-scl-rh epel-release
yum install devtoolset-3-toolchain maven30 cmake3 git
scl enable devtoolset-3 maven30 bash
./buildnativeoperations.sh
./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

Windows

See Windows.md

Setup for All OS

Set a LIBND4J_HOME as an environment variable to the libnd4j folder you've obtained from GIT
- Note: this is required for building nd4j as well.

Setup cpu followed by gpu, run the following on the command line:

For standard builds:

 ./buildnativeoperations.sh
 ./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

For Debug builds:

 ./buildnativeoperations.sh blas -b debug
 ./buildnativeoperations.sh blas -c cuda -сс YOUR_DEVICE_ARCH -b debug

For release builds (default):

 ./buildnativeoperations.sh
 ./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

OpenMP support

OpenMP 4.0+ should be used to compile libnd4j. However, this shouldn't be any trouble, since OpenMP 4 was released in 2015 and should be available on all major platforms.

Linking with MKL

We can link with MKL either at build time, or at runtime with binaries initially linked with another BLAS implementation such as OpenBLAS. In either case, simply add the path containing libmkl_rt.so (or mkl_rt.dll on Windows), say /path/to/intel64/lib/, to the LD_LIBRARY_PATH environment variable on Linux (or PATH on Windows), and build or run your Java application as usual. If you get an error message like undefined symbol: omp_get_num_procs, it probably means that libiomp5.so, libiomp5.dylib, or libiomp5md.dll is not present on your system. In that case though, it is still possible to use the GNU version of OpenMP by setting these environment variables on Linux, for example:

export MKL_THREADING_LAYER=GNU
export LD_PRELOAD=/usr/lib64/libgomp.so.1

Troubleshooting MKL

Sometimes the above steps might not be all you need to do. Another additional step might be the need to add:

export LD_LIBRARY_PATH=/opt/intel/lib/intel64/:/opt/intel/mkl/lib/intel64

This ensures that mkl will be found first and liked to.

Packaging

If on Ubuntu (14.04 or above) or CentOS (6 or above), this repository is also set to create packages for your distribution. Let's assume you have built:

for the cpu, your command-line was ./buildnativeoperations.sh ...:

cd blasbuild/cpu
make package

for the gpu, your command-line was ./buildnativeoperations.sh -c cuda ...:

cd blasbuild/cuda
make package

Uploading package to Bintray

The package upload script is in packaging. The upload command for an rpm built for cpu is:

./packages/push_to_bintray.sh myAPIUser myAPIKey deeplearning4j blasbuild/cpu/libnd4j-0.8.0.fc7.3.1611.x86_64.rpm https://github.com/deeplearning4j

The upload command for a deb package built for cuda is:

./packages/push_to_bintray.sh myAPIUser myAPIKey deeplearning4j blasbuild/cuda/libnd4j-0.8.0.fc7.3.1611.x86_64.deb https://github.com/deeplearning4j

Running tests

Tests are written with gtest, run using cmake. Tests are currently under tests_cpu/

There are 2 directories for running tests:

libnd4j_tests: These are older legacy ops tests.
layers_tests: This covers the newer graph operations and ops associated with samediff.

For running the tests, we currently use cmake or CLion to run the tests.

To run tests using CUDA backend it's pretty much similar process:

./buildnativeoperations.h -c cuda -cc -b debug -t -j
./blasbuild/cuda/tests_cpu/layers_tests/runtests (.exe on Windows)

Development

In order to extend and update libnd4j, understanding libnd4j's various cmake flags is the key. Many of them are in buildnativeoperations.sh. The pom.xml is used to integrate and auto configure the project for building with deeplearning4j.

At a minimum, you will want to enable tests. An example default set of flags for running tests and getting cpu builds working is as follows:

-DSD_CPU=true -DBLAS=TRUE -DSD_ARCH=x86-64 -DSD_EXTENSION= -DSD_LIBRARY_NAME=nd4jcpu -DSD_CHECK_VECTORIZATION=OFF  -DSD_SHARED_LIB=ON -DSD_STATIC_LIB=OFF -DSD_BUILD_MINIFIER=false -DSD_ALL_OPS=true -DCMAKE_BUILD_TYPE=Release -DPACKAGING=none  -DSD_BUILD_TESTS=OFF -DCOMPUTE=all -DOPENBLAS_PATH=C:/Users/agibs/.javacpp/cache/openblas-0.3.10-1.5.4-windows-x86_64.jar/org/bytedeco/openblas/windows-x86_64 -DDEV=FALSE -DCMAKE_NEED_RESPONSE=YES -DMKL_MULTI_THREADED=TRUE -DSD_BUILD_TESTS=YES

The way the main build script works, it dynamically generates a set of flags suitable for use for building the projects. Understanding the build script will go a long way in to configuring cmake for your particular IDE.

Helpers Overview (CUDNN, OneDNN,Armcompute)

Requirements Helper

Requirements helper was introduced to replace plain checks for making them output informative messages (Debug and Verbose mode) and also replace macros REQUIRE_TRUE.

it will lazily evaluate values and messages if the type wrapped and hasgetValue and getMsg methods
it is implicit bool. this makes it usable with logical operators and also inside if conditions. Besides it will benefit from shortcircuit nature of those operators.

it has the following check methods

Requirements& expect(const T& expVar,const T1& reqVar, Op comparision, const char *first_half="")
Requirements& expectEq(const T& exp,const T1& req)
Requirements& expectNotEq(const T& exp,const T1& req)
Requirements& expectLess(const T& exp,const T1& req)
Requirements& expectLessEq(const T& exp,const T1& req)
Requirements& expectGreater(T exp, T1 req)
Requirements& expectGreaterEq(const T& exp,const T1& req
Requirements& expectTrue(const T& expVar, const char *msg=)
Requirements& expectFalse(const T& expVar, const char *msg=)

you can either log the success case or throw error on the failure
it can use plain types for checks.
if value has stream operator it will be used to output it's value. for custom types you may need add that by yourself
ostream& operator<<(ostream& os, const CustomUserType& dt)
there is generic template InfoVariable wrapper for types to make it informative. you can use lambda operators with them as well to make it lazily evaluated
we added custom ShapeInfoVariable wrapper for the NDArray and vector<> shapes to make them informative

one can use expect to add its own proper comparision. simple lambda for that will be like this:

[](const decltype(expType)& l, const decltype(reqType)& r){
          //compare and return
          return  ....;
      }

Examples:

firstly, we should enable logging

    sd::Environment::getInstance().setDebug(true);
    sd::Environment::getInstance().setVerbose(true);

simple case

Requirements req1("Requirement Helper Example#1");
int x = 20;
req1.expectLess(x, 22);
req1.expectEq(x, 21); //should fail

Output: Requirement Helper Example#1: {20} expected to be equal to 21

using InfoVariable wrapper

int age = 15;
Requirements req2("Requirement Helper Example#2");
req2.expectGreaterEq(makeInfoVariable(age, "the user's age"), 18);

Output:

Requirement Helper Example#2: the user's age {15} expected to be greater than or equal  18

helper behavior while using many checks in one block

int getAge(){
 std::cout<<"getAge() was called"<<std::endl;
 return 15;
}
....
Requirements req3("Requirement Helper Example#3");
int z = 20;
req3.expectEq(z, 21); 
req3.expectGreaterEq(makeInfoVariable(getAge(), "the user's age"), 18);

Output:

Requirement Helper Example#3:  {20} expected to be equal to  21
getAge() was called

As it is seen the second check did not happen as the previous failed. But still getAge() method was called as its function argument.

using shortcircuit to avoid Requirement call at all if the previous one was failed

Requirements req4("Requirement Helper Example#4");
int zz = 20;
req4.expectEq(zz, 21) &&  //shortcicuit And
req4.expectGreaterEq(makeInfoVariable(getAge(), "the user's age"), 18);

Output:

Requirement Helper Example#4:  {20} expected to be equal to  21

using lambdas with InfoVariable. it will make it lazily evaluated

Requirements req5("Requirement Helper Example#5"); 
req5.expectEq( 21, 
         makeInfoVariable(21, []{
                std::cout<<"lambda call#1"<<std::endl;
                return "twenty one";
            }));
req5.expectEq(makeInfoVariable([]{ return 20;}, []{return "twenty";}), 
            makeInfoVariable(21, []{
                std::cout<<"lambda call#2"<<std::endl;
                return "twenty one";
            }));
req5.expectGreaterEq(makeInfoVariable([]{
                     std::cout<<"lambda call#3" <<std::endl;
                     return 15;
                     }, 
                     []{ return "the user's age";}), 
                  makeInfoVariable([]{return 18;}, []{return "the allowed age";})
   );

Output:

```

lambda call#2

Requirement Helper Example#5: twenty {20} expected to be equal to twenty one 21

6. use bool nature and also log the success case
```cpp
Requirements req6("Requirement Helper Example#6");
NDArray * arr= nullptr;
arr !=nullptr && req6.expectEq(arr->rankOf(), 3) ;
req6.logTheSuccess();

Output:

Requirement Helper Example#6: meets the requirements

custom comparision lambda and also another usage of the custom wrapper written by us ShapeInfoVariable. Note: we will use std::vector<int>. this wrapper can be used with NDArray as well.

Requirements req7("Requirement Helper Example#7");
req7.expect(makeShapeInfoVariable(std::vector<int>{2,3,4,5}, SHAPE_MSG_INPUT0), makeShapeInfoVariable(std::vector<int>{2,3,4,7}, SHAPE_MSG_INPUT1),
                 [](const std::vector<int>& l, const std::vector<int>& r){
                     if(l.size()!=r.size()) return false;
                     for (int i = 0; i < l.size(); i++) {
                         if (l.at(i) != r.at(i)) {
                             return false;
                         }
                     }
                     return true;
                 }
              , EXPECTED_EQ_MSG);
}

Output:

Requirement Helper Example#7: the Shape of the Input NDArray#0 {[2, 3, 4, 5]} expected to be equal to the Shape of the Input NDArray#1 [2, 3, 4, 7]

throw error when there is failure

Requirements req8("Requirement Helper Example#8");
req8.expectEq(6,6) &&
req8.expectIn(6, {1,2,3,7,8,9});
req8.throws();

Output:

terminate called after throwing an instance of 'std::invalid_argument'
what():  Op validation failed
...
Requirement Helper Example#8: {6} expected to be one of these {[1, 2, 3, 7, 8, 9, ]}

Here is live example:

Note: some classes were mocked there and do not represent the exact implementations in libnd4j. https://godbolt.org/z/sq98vchs5

Helpers Overview (CUDNN, OneDNN,Armcompute)

Requirements Helper

Requirements helper was introduced to replace plain checks for making them output informative messages (Debug and Verbose mode) and also replace macros REQUIRE_TRUE.

it will lazily evaluate values and messages if the type wrapped and hasgetValue and getMsg methods
it is implicit bool. this makes it usable with logical operators and also inside if conditions. Besides it will benefit from shortcircuit nature of those operators.

it has the following check methods

Requirements& expect(const T& expVar,const T1& reqVar, Op comparision, const char *first_half="")
Requirements& expectEq(const T& exp,const T1& req)
Requirements& expectNotEq(const T& exp,const T1& req)
Requirements& expectLess(const T& exp,const T1& req)
Requirements& expectLessEq(const T& exp,const T1& req)
Requirements& expectGreater(T exp, T1 req)
Requirements& expectGreaterEq(const T& exp,const T1& req
Requirements& expectTrue(const T& expVar, const char *msg=)
Requirements& expectFalse(const T& expVar, const char *msg=)

you can either log the success case or throw error on the failure
it can use plain types for checks.
if value has stream operator it will be used to output it's value. for custom types you may need add that by yourself
ostream& operator<<(ostream& os, const CustomUserType& dt)
there is generic template InfoVariable wrapper for types to make it informative. you can use lambda operators with them as well to make it lazily evaluated
we added custom ShapeInfoVariable wrapper for the NDArray and vector<> shapes to make them informative

one can use expect to add its own proper comparision. simple lambda for that will be like this:

[](const decltype(expType)& l, const decltype(reqType)& r){
          //compare and return
          return  ....;
      }

Examples:

firstly, we should enable logging

    sd::Environment::getInstance().setDebug(true);
    sd::Environment::getInstance().setVerbose(true);

simple case

Requirements req1("Requirement Helper Example#1");
int x = 20;
req1.expectLess(x, 22);
req1.expectEq(x, 21); //should fail

Output: Requirement Helper Example#1: {20} expected to be equal to 21

using InfoVariable wrapper

int age = 15;
Requirements req2("Requirement Helper Example#2");
req2.expectGreaterEq(makeInfoVariable(age, "the user's age"), 18);

Output:

Requirement Helper Example#2: the user's age {15} expected to be greater than or equal  18

helper behavior while using many checks in one block

int getAge(){
 std::cout<<"getAge() was called"<<std::endl;
 return 15;
}
....
Requirements req3("Requirement Helper Example#3");
int z = 20;
req3.expectEq(z, 21); 
req3.expectGreaterEq(makeInfoVariable(getAge(), "the user's age"), 18);

Output:

Requirement Helper Example#3:  {20} expected to be equal to  21
getAge() was called

As it is seen the second check did not happen as the previous failed. But still getAge() method was called as its function argument.

using shortcircuit to avoid Requirement call at all if the previous one was failed

Requirements req4("Requirement Helper Example#4");
int zz = 20;
req4.expectEq(zz, 21) &&  //shortcicuit And
req4.expectGreaterEq(makeInfoVariable(getAge(), "the user's age"), 18);

Output:

Requirement Helper Example#4:  {20} expected to be equal to  21

using lambdas with InfoVariable. it will make it lazily evaluated

Requirements req5("Requirement Helper Example#5"); 
req5.expectEq( 21, 
         makeInfoVariable(21, []{
                std::cout<<"lambda call#1"<<std::endl;
                return "twenty one";
            }));
req5.expectEq(makeInfoVariable([]{ return 20;}, []{return "twenty";}), 
            makeInfoVariable(21, []{
                std::cout<<"lambda call#2"<<std::endl;
                return "twenty one";
            }));
req5.expectGreaterEq(makeInfoVariable([]{
                     std::cout<<"lambda call#3" <<std::endl;
                     return 15;
                     }, 
                     []{ return "the user's age";}), 
                  makeInfoVariable([]{return 18;}, []{return "the allowed age";})
   );

Output:

```

lambda call#2

Requirement Helper Example#5: twenty {20} expected to be equal to twenty one 21

6. use bool nature and also log the success case
```cpp
Requirements req6("Requirement Helper Example#6");
NDArray * arr= nullptr;
arr !=nullptr && req6.expectEq(arr->rankOf(), 3) ;
req6.logTheSuccess();

Output:

Requirement Helper Example#6: meets the requirements

custom comparision lambda and also another usage of the custom wrapper written by us ShapeInfoVariable. Note: we will use std::vector<int>. this wrapper can be used with NDArray as well.

Requirements req7("Requirement Helper Example#7");
req7.expect(makeShapeInfoVariable(std::vector<int>{2,3,4,5}, SHAPE_MSG_INPUT0), makeShapeInfoVariable(std::vector<int>{2,3,4,7}, SHAPE_MSG_INPUT1),
                 [](const std::vector<int>& l, const std::vector<int>& r){
                     if(l.size()!=r.size()) return false;
                     for (int i = 0; i < l.size(); i++) {
                         if (l.at(i) != r.at(i)) {
                             return false;
                         }
                     }
                     return true;
                 }
              , EXPECTED_EQ_MSG);
}

Output:

Requirement Helper Example#7: the Shape of the Input NDArray#0 {[2, 3, 4, 5]} expected to be equal to the Shape of the Input NDArray#1 [2, 3, 4, 7]

throw error when there is failure

Requirements req8("Requirement Helper Example#8");
req8.expectEq(6,6) &&
req8.expectIn(6, {1,2,3,7,8,9});
req8.throws();

Output:

terminate called after throwing an instance of 'std::invalid_argument'
what():  Op validation failed
...
Requirement Helper Example#8: {6} expected to be one of these {[1, 2, 3, 7, 8, 9, ]}

Here is live example:

Note: some classes were mocked there and do not represent the exact implementations in libnd4j. https://godbolt.org/z/sq98vchs5

Overview of working with libnd4j

Native operations for nd4j. Build using cmake

Prerequisites

GCC 4.9+
CUDA Toolkit Versions 10 or 11
CMake 3.8 (as of Nov 2017, in near future will require 3.9)

Additional build arguments

There's few additional arguments for buildnativeoperations.sh script you could use:

 -a XXXXXXXX// shortcut for -march/-mtune, i.e. -a native
 -b release OR -b debug // enables/desables debug builds. release is considered by default
 -j XX // this argument defines how many threads will be used to binaries on your box. i.e. -j 8 
 -cc XX// CUDA-only argument, builds only binaries for target GPU architecture. use this for fast builds
 --check-vectorization  auto-vectorization report for developers. (Currently, only GCC is supported)

More about AutoVectorization report

-cc and --compute option examples

description

-cc all

builds for common GPUs

-cc auto

tries to detect automatically

-cc Maxwell

GPU microarchitecture codename

-cc 75

compute capability 7.5 without a dot

-cc 7.5

compute capability 7.5 with a dot

-cc "Maxwell 6.0 7.5"

space-separated multiple arguments within quotes (note: numbers only with a dot)

OS Specific Requirements

Android

Download the NDK, extract it somewhere, and execute the following commands, replacing android-xxx with either android-arm or android-x86:

git clone https://github.com/eclipse/deeplearning4j
export ANDROID_NDK=/path/to/android-ndk/
cd deeplearning4j/libnd4j
bash buildnativeoperations.sh -platform android-xxx
cd ../nd4j
mvn clean install -Djavacpp.platform=android-xxx -DskipTests -pl '!:nd4j-cuda-9.0,!:nd4j-cuda-9.0-platform,!:nd4j-tests'

OSX

Run ./setuposx.sh (Please ensure you have brew installed)

See macOSx10 CPU only.md

Linux

Depends on the distro - ask in the earlyadopters channel for specifics on distro

Ubuntu Linux 15.10

wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda-repo-ubuntu1504-7-5-local_7.5-18_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1504-7-5-local_7.5-18_amd64.deb
sudo apt-get update
sudo apt-get install cuda
sudo apt-get install cmake
sudo apt-get install gcc-4.9
sudo apt-get install g++-4.9
sudo apt-get install git
git clone https://github.com/deeplearning4j/libnd4j
cd libnd4j/
export LIBND4J_HOME=~/libnd4j/
sudo rm /usr/bin/gcc
sudo rm /usr/bin/g++
sudo ln -s /usr/bin/gcc-4.9 /usr/bin/gcc
sudo ln -s /usr/bin/g++-4.9 /usr/bin/g++
./buildnativeoperations.sh
./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

Ubuntu Linux 16.04

sudo apt install cmake
sudo apt install nvidia-cuda-dev nvidia-cuda-toolkit nvidia-361
export TRICK_NVCC=YES
./buildnativeoperations.sh
./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

The standard development headers are needed.

CentOS 6

yum install centos-release-scl-rh epel-release
yum install devtoolset-3-toolchain maven30 cmake3 git
scl enable devtoolset-3 maven30 bash
./buildnativeoperations.sh
./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

Windows

See Windows.md

Setup for All OS

Set a LIBND4J_HOME as an environment variable to the libnd4j folder you've obtained from GIT
- Note: this is required for building nd4j as well.

Setup cpu followed by gpu, run the following on the command line:

For standard builds:

 ./buildnativeoperations.sh
 ./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

For Debug builds:

 ./buildnativeoperations.sh blas -b debug
 ./buildnativeoperations.sh blas -c cuda -сс YOUR_DEVICE_ARCH -b debug

For release builds (default):

 ./buildnativeoperations.sh
 ./buildnativeoperations.sh -c cuda -сс YOUR_DEVICE_ARCH

OpenMP support

OpenMP 4.0+ should be used to compile libnd4j. However, this shouldn't be any trouble, since OpenMP 4 was released in 2015 and should be available on all major platforms.

Linking with MKL

export MKL_THREADING_LAYER=GNU
export LD_PRELOAD=/usr/lib64/libgomp.so.1

Troubleshooting MKL

Sometimes the above steps might not be all you need to do. Another additional step might be the need to add:

export LD_LIBRARY_PATH=/opt/intel/lib/intel64/:/opt/intel/mkl/lib/intel64

This ensures that mkl will be found first and liked to.

Packaging

If on Ubuntu (14.04 or above) or CentOS (6 or above), this repository is also set to create packages for your distribution. Let's assume you have built:

for the cpu, your command-line was ./buildnativeoperations.sh ...:

cd blasbuild/cpu
make package

for the gpu, your command-line was ./buildnativeoperations.sh -c cuda ...:

cd blasbuild/cuda
make package

Uploading package to Bintray

The package upload script is in packaging. The upload command for an rpm built for cpu is:

./packages/push_to_bintray.sh myAPIUser myAPIKey deeplearning4j blasbuild/cpu/libnd4j-0.8.0.fc7.3.1611.x86_64.rpm https://github.com/deeplearning4j

The upload command for a deb package built for cuda is:

./packages/push_to_bintray.sh myAPIUser myAPIKey deeplearning4j blasbuild/cuda/libnd4j-0.8.0.fc7.3.1611.x86_64.deb https://github.com/deeplearning4j

Running tests

Tests are written with gtest, run using cmake. Tests are currently under tests_cpu/

There are 2 directories for running tests:

libnd4j_tests: These are older legacy ops tests.
layers_tests: This covers the newer graph operations and ops associated with samediff.

For running the tests, we currently use cmake or CLion to run the tests.

To run tests using CUDA backend it's pretty much similar process:

./buildnativeoperations.h -c cuda -cc -b debug -t -j
./blasbuild/cuda/tests_cpu/layers_tests/runtests (.exe on Windows)

Development

At a minimum, you will want to enable tests. An example default set of flags for running tests and getting cpu builds working is as follows:

-DSD_CPU=true -DBLAS=TRUE -DSD_ARCH=x86-64 -DSD_EXTENSION= -DSD_LIBRARY_NAME=nd4jcpu -DSD_CHECK_VECTORIZATION=OFF  -DSD_SHARED_LIB=ON -DSD_STATIC_LIB=OFF -DSD_BUILD_MINIFIER=false -DSD_ALL_OPS=true -DCMAKE_BUILD_TYPE=Release -DPACKAGING=none  -DSD_BUILD_TESTS=OFF -DCOMPUTE=all -DOPENBLAS_PATH=C:/Users/agibs/.javacpp/cache/openblas-0.3.10-1.5.4-windows-x86_64.jar/org/bytedeco/openblas/windows-x86_64 -DDEV=FALSE -DCMAKE_NEED_RESPONSE=YES -DMKL_MULTI_THREADED=TRUE -DSD_BUILD_TESTS=YES

Understanding graph execution

Basic idea

This basic example shows execution of graph loaded from file:

auto graph = GraphExecutioner<float>::importFromFlatBuffers("./some_file.fb");
GraphExecutioner<float>::execute(graph);
// ... do something with results ...
delete graph;

FlatBuffers schemas

You can find scheme files here.

Graph execution logic

Memory management within graph

Current graph limitations

There are some limitations. Some of them will be lifted eventually, others won't be. Here's the list:

Graph has single data type. I.e. Graph<float> or Graph<float16> or Graph<double> This limitation will be lifted soon.
On some platforms, like Java, single Variable/Placeholder size is limited to 2GB buffer size. However, on libnd4j side there's no such limitation.
Variable size/dimensionality has limitations: max NDArray rank is limited to 32 at this moment, and any single dimension is limited to MAX_INT size.
Recursion isn't directly supported at this moment.
CUDA isn't supported at this moment. This limitation will be lifted soon.
When used from C++, Graph only supports FeedForward mode. This limitation will be lifted soon.

Minified Graph binaries

# building full libnd4j copy AND minfier app
./buildnativeoperations.sh -a native -m 
...
# building libnd4j for 2 specific graphs
./minifier -l -a native -o libnd4j_special ../some_path/some_graph1.fb ../some_path/some_graph2.fb
Option 'l': Build library
Option 'a': Target arch: native
Option 'o': Output file name is libnd4j_special
Total available operations: 423

Retrieving ops from the Graph and collect them...

Collecting out Scopes...
Operations found so far:
rank
range
subtract
transpose
matmul
biasadd
TRANSFORM{15}

Building minified library...

This library can be used in your application as any other shared libray out there: you'll include headers file and you'll be able to call for things you need.

Documentation

Documentation for individual operations, and basic classes (like NDArray, Graph etc) is available as part of Nd4j javadoc: https://nd4j.org/doc/

Embedded profiling

Environment::getInstance().setProfiling(true);
auto graph = GraphExecutioner::importFromFlatBuffers("./resources/ae_00.fb");

auto profile = GraphProfilingHelper::profile(graph, 1000);
profile->printOut();

delete graph;

1000 iterations laterm you'll get statistics printed out. Statistics basically includes time spent in various parts of code and memory allocation details.

Here's how it'll look like:

Printing out Graph...
8. matmul; Inputs: [{1:0}, {2:0}]; 
9. biasadd; Inputs: [{8:0}, {3:0}]; 
10. TRANSFORM:{15}; Inputs: [{9:0}]; 
11. rank; Inputs: [{2:0}]; 
12. subtract; Inputs: [{11:0}, {4:0}]; 
13. range; Inputs: [{5:0}, {11:0}, {6:0}]; 
14. subtract; Inputs: [{12:0}, {13:0}]; 
15. transpose; Inputs: [{2:0}, {14:0}]; 
16. matmul; Inputs: [{10:0}, {15:0}]; 
17. biasadd; Inputs: [{16:0}, {7:0}]; 
18. TRANSFORM:{15}; Inputs: [{17:0}]; 

Printing out Scopes...
Graph profile: 1000 executions

Memory:
ACT: 0; TMP: 0; OBJ: 0; TTL: 1788;

Time:
Construction time: 2135 ns;
Execution time: 41820 ns;

Per-node reports:
Node: <8:MatMul>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 200;
      Time: PREP: 1160 ns; EXEC: 3167 ns; TTL: 5929 ns;
      PREP: INPUT: 251 ns; SHAPE: 382 ns; ARRAY: 217 ns;
Node: <9:BiasAdd>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 104;
      Time: PREP: 917 ns; EXEC: 3580 ns; TTL: 5957 ns;
      PREP: INPUT: 220 ns; SHAPE: 213 ns; ARRAY: 217 ns;
Node: <10:Tanh>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 104;
      Time: PREP: 756 ns; EXEC: 241 ns; TTL: 1927 ns;
      PREP: INPUT: 140 ns; SHAPE: 195 ns; ARRAY: 205 ns;
Node: <11:transpose/Rank>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 36;
      Time: PREP: 522 ns; EXEC: 119 ns; TTL: 1403 ns;
      PREP: INPUT: 109 ns; SHAPE: 69 ns; ARRAY: 171 ns;
Node: <12:transpose/sub>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 36;
      Time: PREP: 666 ns; EXEC: 185 ns; TTL: 1684 ns;
      PREP: INPUT: 192 ns; SHAPE: 94 ns; ARRAY: 168 ns;
Node: <13:transpose/Range>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 556;
      Time: PREP: 808 ns; EXEC: 647 ns; TTL: 2416 ns;
      PREP: INPUT: 297 ns; SHAPE: 228 ns; ARRAY: 181 ns;
Node: <14:transpose/sub_1>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 56;
      Time: PREP: 721 ns; EXEC: 541 ns; TTL: 2205 ns;
      PREP: INPUT: 23 ns; SHAPE: 92 ns; ARRAY: 165 ns;
Node: <15:transpose>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 96;
      Time: PREP: 3936 ns; EXEC: 602 ns; TTL: 5811 ns;
      PREP: INPUT: 194 ns; SHAPE: 3241 ns; ARRAY: 257 ns;
Node: <16:MatMul_1>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 312;
      Time: PREP: 970 ns; EXEC: 3565 ns; TTL: 6066 ns;
      PREP: INPUT: 203 ns; SHAPE: 320 ns; ARRAY: 193 ns;
Node: <17:BiasAdd_1>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 144;
      Time: PREP: 914 ns; EXEC: 3528 ns; TTL: 5870 ns;
      PREP: INPUT: 231 ns; SHAPE: 191 ns; ARRAY: 223 ns;
Node: <18:output>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 144;
      Time: PREP: 805 ns; EXEC: 285 ns; TTL: 1928 ns;
      PREP: INPUT: 157 ns; SHAPE: 192 ns; ARRAY: 232 ns;

Special timers:
No special timers were set

Roadmap

In short-to-medium term following improvements are expected:

CUDA support for all new ops
Additional data types support: int, long long, q types, bool
Sparse tensors support