libnd4j contains Directed Acyclic Graph execution engine, suited for both local and remote execution. However, main goal here is execution of externally originated graphs, serialized into FlatBuffers and provided either via pointer, or file.
This basic example shows execution of graph loaded from file:
You can find scheme files here.
At this moment libnd4j repo contains compiled definitions for C++, Python, Java, and JSON, but FlatBuffers can be compiled for PHP, C#, JavaScript, TypeScript and Go as well. Please refer to flatc
instructions to do that.
Such bindings allow you to build FlatBuffers files/buffers suitable for remote execution of your graph and obtaining results back. I.e. you can use JavaScript to build graph (or just update variables/placeholders), send them to remote RPC server powered by libnd4j, and get results back.
No matter how graph is represented on the front-end, on backend it's rather simple: topologically sorted list of operations executed sequentially if there's shared dependencies, or (optionally) in parallel, if there's no shared dependencies for current graph nodes.
Each node in the graph represents single linear algebra operation applied to input(s) of the node. For example: z = Add(x, y)
is operation that takes 2 NDArrays as input, and produes 1 NDArray as output. So, graph is built of such primitive operations, which are executed sequentially.
Everything that happens within graph during execution, stays within VariableSpace. It acts as storage for Variables and NDArrays produced during graph execution. On top of that, there's an option to use pre-allocated Workspaces for allocation of NDArrays.
There are some limitations. Some of them will be lifted eventually, others won't be. Here's the list:
Graph has single data type. I.e. Graph<float> or Graph<float16> or Graph<double> This limitation will be lifted soon.
On some platforms, like Java, single Variable/Placeholder size is limited to 2GB buffer size. However, on libnd4j side there's no such limitation.
Variable size/dimensionality has limitations: max NDArray rank is limited to 32 at this moment, and any single dimension is limited to MAX_INT size.
Recursion isn't directly supported at this moment.
CUDA isn't supported at this moment. This limitation will be lifted soon.
When used from C++, Graph only supports FeedForward mode. This limitation will be lifted soon.
There's an option to build minified binaries suited for execution of specific graphs. Idea is quite simple: you feed your existing Graph(s) in FlatBuffers format into special app, which extracts operations used in your Graph(s) and excludes all other operations from target binary.
Once minifier
finishes - you'll have libnd4j_special.so
and libnd4j_special.h
files ready, and they'll contain only those operations used in 2 graphs provided at compilation time + basic primitives used to work with Graph. Things like NDArray, GraphExecutioner etc will be included as well.
This library can be used in your application as any other shared libray out there: you'll include headers file and you'll be able to call for things you need.
Documentation for individual operations, and basic classes (like NDArray, Graph etc) is available as part of Nd4j javadoc: https://nd4j.org/doc/
If you're adding new ops, and want to make sure they run ok on your specific device - you might want to give a shot to embedded Graph profiling helper. Despite being simple - it still provides you with time spent in various parts of Graph.
1000 iterations laterm you'll get statistics printed out. Statistics basically includes time spent in various parts of code and memory allocation details.
Here's how it'll look like:
In short-to-medium term following improvements are expected:
CUDA support for all new ops
Additional data types support: int, long long, q types, bool
Sparse tensors support
Native operations for nd4j. Build using cmake
GCC 4.9+
CUDA Toolkit Versions 10 or 11
CMake 3.8 (as of Nov 2017, in near future will require 3.9)
There's few additional arguments for buildnativeoperations.sh
script you could use:
More about AutoVectorization report
You can provide the compute capability for your card on the NVIDIA website here or use auto. Please also check your Cuda Toolkit Release notes for supported and dropped features. Here is the latest CUDA Toolkit Release note. You can find the same information for the older Toolkit versions in the CUDA archives.
Download the NDK, extract it somewhere, and execute the following commands, replacing android-xxx
with either android-arm
or android-x86
:
Run ./setuposx.sh (Please ensure you have brew installed)
Depends on the distro - ask in the earlyadopters channel for specifics on distro
The standard development headers are needed.
See Windows.md
Set a LIBND4J_HOME as an environment variable to the libnd4j folder you've obtained from GIT
Note: this is required for building nd4j as well.
Setup cpu followed by gpu, run the following on the command line:
For standard builds:
For Debug builds:
For release builds (default):
OpenMP 4.0+ should be used to compile libnd4j. However, this shouldn't be any trouble, since OpenMP 4 was released in 2015 and should be available on all major platforms.
We can link with MKL either at build time, or at runtime with binaries initially linked with another BLAS implementation such as OpenBLAS. In either case, simply add the path containing libmkl_rt.so
(or mkl_rt.dll
on Windows), say /path/to/intel64/lib/
, to the LD_LIBRARY_PATH
environment variable on Linux (or PATH
on Windows), and build or run your Java application as usual. If you get an error message like undefined symbol: omp_get_num_procs
, it probably means that libiomp5.so
, libiomp5.dylib
, or libiomp5md.dll
is not present on your system. In that case though, it is still possible to use the GNU version of OpenMP by setting these environment variables on Linux, for example:
Sometimes the above steps might not be all you need to do. Another additional step might be the need to add:
This ensures that mkl will be found first and liked to.
If on Ubuntu (14.04 or above) or CentOS (6 or above), this repository is also set to create packages for your distribution. Let's assume you have built:
for the cpu, your command-line was ./buildnativeoperations.sh ...
:
for the gpu, your command-line was ./buildnativeoperations.sh -c cuda ...
:
The package upload script is in packaging. The upload command for an rpm built for cpu is:
The upload command for a deb package built for cuda is:
Tests are written with gtest, run using cmake. Tests are currently under tests_cpu/
There are 2 directories for running tests:
libnd4j_tests: These are older legacy ops tests.
layers_tests: This covers the newer graph operations and ops associated with samediff.
For running the tests, we currently use cmake or CLion to run the tests.
To run tests using CUDA backend it's pretty much similar process:
./buildnativeoperations.h -c cuda -cc -b debug -t -j
./blasbuild/cuda/tests_cpu/layers_tests/runtests (.exe on Windows)
In order to extend and update libnd4j, understanding libnd4j's various cmake flags is the key. Many of them are in buildnativeoperations.sh. The pom.xml is used to integrate and auto configure the project for building with deeplearning4j.
At a minimum, you will want to enable tests. An example default set of flags for running tests and getting cpu builds working is as follows:
The way the main build script works, it dynamically generates a set of flags suitable for use for building the projects. Understanding the build script will go a long way in to configuring cmake for your particular IDE.
Requirements helper was introduced to replace plain checks for making them output informative messages (Debug and Verbose mode) and also replace macros REQUIRE_TRUE.
it will lazily evaluate values and messages if the type wrapped and hasgetValue
and getMsg
methods
it is implicit bool. this makes it usable with logical operators and also inside if conditions. Besides it will benefit from shortcircuit nature of those operators.
it has the following check methods
you can either log the success case or throw error on the failure
it can use plain types for checks.
if value has stream operator it will be used to output it's value. for custom types you may need add that by yourself
ostream& operator<<(ostream& os, const CustomUserType& dt)
there is generic template InfoVariable
wrapper for types to make it informative. you can use lambda operators with them as well to make it lazily evaluated
we added custom ShapeInfoVariable
wrapper for the NDArray and vector<> shapes to make them informative
one can use expect
to add its own proper comparision. simple lambda for that will be like this:
firstly, we should enable logging
simple case
Output: Requirement Helper Example#1: {20} expected to be equal to 21
using InfoVariable wrapper
Output:
helper behavior while using many checks in one block
Output:
As it is seen the second check did not happen as the previous failed. But still getAge()
method was called as its function argument.
using shortcircuit to avoid Requirement call at all if the previous one was failed
Output:
using lambdas with InfoVariable. it will make it lazily evaluated
Output:
```
lambda call#2
Requirement Helper Example#5: twenty {20} expected to be equal to twenty one 21
Output:
custom comparision lambda and also another usage of the custom wrapper written by us ShapeInfoVariable
. Note: we will use std::vector<int>
. this wrapper can be used with NDArray
as well.
Output:
throw error when there is failure
Output:
Note: some classes were mocked there and do not represent the exact implementations in libnd4j.
-cc and --compute option examples
description
-cc all
builds for common GPUs
-cc auto
tries to detect automatically
-cc Maxwell
GPU microarchitecture codename
-cc 75
compute capability 7.5 without a dot
-cc 7.5
compute capability 7.5 with a dot
-cc "Maxwell 6.0 7.5"
space-separated multiple arguments within quotes (note: numbers only with a dot)