Deeplearning4j
Community ForumND4J JavadocDL4J Javadoc
EN 1.0.0-M1.1
EN 1.0.0-M1.1
  • Deeplearning4j Suite Overview
  • Release Notes
    • 1.0.0-M1.1
    • 1.0.0-M1
    • 1.0.0-beta7
    • 1.0.0-beta6
    • 1.0.0-beta5
    • 1.0.0-beta4
    • 1.0.0-beta3
    • 1.0.0-beta2
    • 1.0.0-beta
    • 1.0.0-alpha
    • 0.9.1
    • 0.9.0
    • 0.8.0
    • 0.7.2
    • 0.7.1
    • 0.7.0
    • 0.6.0
    • 0.5.0
    • 0.4.0
  • Multi-Project
    • Tutorials
      • Beginners
      • Quickstart
    • How To Guides
      • Import in to your favorite IDE
      • Contribute
        • Eclipse Contributors
      • Developer Docs
        • Github Actions/Build Infra
        • Javacpp
        • Release
        • Testing
      • Build From Source
      • Benchmark
      • Beginners
    • Reference
      • Examples Tour
    • Explanation
      • The core workflow
      • Configuration
        • Backends
          • Performance Issues
          • CPU
          • Cudnn
        • Memory
          • Workspaces
      • Build Tools
      • Snapshots
      • Maven
  • Deeplearning4j
    • Tutorials
      • Quick Start
      • Language Processing
        • Doc2Vec
        • Sentence Iterator
        • Tokenization
        • Vocabulary Cache
    • How To Guides
      • Custom Layers
      • Keras Import
        • Functional Models
        • Sequential Models
        • Custom Layers
        • Keras Import API Overview
          • Advanced Activations
          • Convolutional Layers
          • Core Layers
          • Embedding Layers
          • Local Layers
          • Noise Layers
          • Normalization Layers
          • Pooling Layers
          • Recurrent Layers
          • Wrapper Layers
        • Supported Features Overview
          • Activations
          • Constraints
          • Initializers
          • Losses
          • Optimizers
          • Regularizers
      • Tuning and Training
        • Visualization
        • Troubleshooting Training
        • Early Stopping
        • Evaluation
        • Transfer Learning
    • Reference
      • Model Zoo
        • Zoo Models
      • Activations
      • Auto Encoders
      • Computation Graph
      • Convolutional Layers
      • DataSet Iterators
      • Layers
      • Model Listeners
      • Saving and Loading Models
      • Multi Layer Network
      • Recurrent Layers
      • Updaters/Optimizers
      • Vertices
      • Word2vec/Glove/Doc2Vec
    • Explanation
  • datavec
    • Tutorials
      • Overview
    • How To Guides
    • Reference
      • Analysis
      • Conditions
      • Executors
      • Filters
      • Normalization
      • Operations
      • Transforms
      • Readers
      • Records
      • Reductions
      • Schemas
      • Serialization
      • Visualization
    • Explanation
  • Nd4j
    • Tutorials
      • Quickstart
    • How To Guides
      • Other Framework Interop
        • Tensorflow
        • TVM
        • Onnx
      • Matrix Manipulation
      • Element wise Operations
      • Basics
    • Reference
      • Op Descriptor Format
      • Tensor
      • Syntax
    • Explanation
  • Samediff
    • Tutorials
      • Quickstart
    • How To Guides
      • Importing Tensorflow
      • Adding Operations
        • codegen
    • Reference
      • Operation Namespaces
        • Base Operations
        • Bitwise
        • CNN
        • Image
        • LinAlg
        • Loss
        • Math
        • NN
        • Random
        • RNN
      • Variables
    • Explanation
      • Model Import Framework
  • Libnd4j
    • How To Guides
      • Building on Windows
      • Building for raspberry pi or Jetson Nano
      • Building on ios
      • How to Add Operations
      • How to Setup CLion
    • Reference
      • Understanding graph execution
      • Overview of working with libnd4j
      • Helpers Overview (CUDNN, OneDNN,Armcompute)
    • Explanation
  • Python4j
    • Tutorials
      • Quickstart
    • How To Guides
      • Write Python Script
    • Reference
      • Python Types
      • Python Path
      • Garbage Collection
      • Python Script Execution
    • Explanation
  • RL4j
    • Tutorials
    • How To Guides
    • Reference
    • Explanation
  • Spark
    • Tutorials
      • DL4J on Spark Quickstart
    • How To Guides
      • How To
      • Data How To
    • Reference
      • Parameter Server
      • Technical Reference
    • Explanation
      • Spark API Reference
  • codegen
Powered by GitBook
On this page
  • Basic idea
  • FlatBuffers schemas
  • Graph execution logic
  • Memory management within graph
  • Current graph limitations
  • Minified Graph binaries
  • Documentation
  • Embedded profiling
  • Roadmap

Was this helpful?

Edit on Git
Export as PDF
  1. Libnd4j
  2. Reference

Understanding graph execution

PreviousReferenceNextOverview of working with libnd4j

Was this helpful?

Basic idea

libnd4j contains Directed Acyclic Graph execution engine, suited for both local and remote execution. However, main goal here is execution of externally originated graphs, serialized into FlatBuffers and provided either via pointer, or file.

This basic example shows execution of graph loaded from file:

auto graph = GraphExecutioner<float>::importFromFlatBuffers("./some_file.fb");
GraphExecutioner<float>::execute(graph);
// ... do something with results ...
delete graph;

FlatBuffers schemas

You can find scheme files .

At this moment libnd4j repo contains compiled definitions for C++, Python, Java, and JSON, but FlatBuffers can be compiled for PHP, C#, JavaScript, TypeScript and Go as well. Please refer to flatc instructions to do that.

Such bindings allow you to build FlatBuffers files/buffers suitable for remote execution of your graph and obtaining results back. I.e. you can use JavaScript to build graph (or just update variables/placeholders), send them to remote RPC server powered by libnd4j, and get results back.

Graph execution logic

No matter how graph is represented on the front-end, on backend it's rather simple: topologically sorted list of operations executed sequentially if there's shared dependencies, or (optionally) in parallel, if there's no shared dependencies for current graph nodes.

Each node in the graph represents single linear algebra operation applied to input(s) of the node. For example: z = Add(x, y) is operation that takes 2 NDArrays as input, and produes 1 NDArray as output. So, graph is built of such primitive operations, which are executed sequentially.

Memory management within graph

Everything that happens within graph during execution, stays within VariableSpace. It acts as storage for Variables and NDArrays produced during graph execution. On top of that, there's an option to use pre-allocated Workspaces for allocation of NDArrays.

Current graph limitations

There are some limitations. Some of them will be lifted eventually, others won't be. Here's the list:

  • Graph has single data type. I.e. Graph<float> or Graph<float16> or Graph<double> This limitation will be lifted soon.

  • On some platforms, like Java, single Variable/Placeholder size is limited to 2GB buffer size. However, on libnd4j side there's no such limitation.

  • Variable size/dimensionality has limitations: max NDArray rank is limited to 32 at this moment, and any single dimension is limited to MAX_INT size.

  • Recursion isn't directly supported at this moment.

  • CUDA isn't supported at this moment. This limitation will be lifted soon.

  • When used from C++, Graph only supports FeedForward mode. This limitation will be lifted soon.

Minified Graph binaries

There's an option to build minified binaries suited for execution of specific graphs. Idea is quite simple: you feed your existing Graph(s) in FlatBuffers format into special app, which extracts operations used in your Graph(s) and excludes all other operations from target binary.

# building full libnd4j copy AND minfier app
./buildnativeoperations.sh -a native -m 
...
# building libnd4j for 2 specific graphs
./minifier -l -a native -o libnd4j_special ../some_path/some_graph1.fb ../some_path/some_graph2.fb
Option 'l': Build library
Option 'a': Target arch: native
Option 'o': Output file name is libnd4j_special
Total available operations: 423

Retrieving ops from the Graph and collect them...

Collecting out Scopes...
Operations found so far:
rank
range
subtract
transpose
matmul
biasadd
TRANSFORM{15}

Building minified library...

Once minifier finishes - you'll have libnd4j_special.so and libnd4j_special.h files ready, and they'll contain only those operations used in 2 graphs provided at compilation time + basic primitives used to work with Graph. Things like NDArray, GraphExecutioner etc will be included as well.

This library can be used in your application as any other shared libray out there: you'll include headers file and you'll be able to call for things you need.

Documentation

Embedded profiling

If you're adding new ops, and want to make sure they run ok on your specific device - you might want to give a shot to embedded Graph profiling helper. Despite being simple - it still provides you with time spent in various parts of Graph.

Environment::getInstance().setProfiling(true);
auto graph = GraphExecutioner::importFromFlatBuffers("./resources/ae_00.fb");

auto profile = GraphProfilingHelper::profile(graph, 1000);
profile->printOut();

delete graph;

1000 iterations laterm you'll get statistics printed out. Statistics basically includes time spent in various parts of code and memory allocation details.

Here's how it'll look like:

Printing out Graph...
8. matmul; Inputs: [{1:0}, {2:0}]; 
9. biasadd; Inputs: [{8:0}, {3:0}]; 
10. TRANSFORM:{15}; Inputs: [{9:0}]; 
11. rank; Inputs: [{2:0}]; 
12. subtract; Inputs: [{11:0}, {4:0}]; 
13. range; Inputs: [{5:0}, {11:0}, {6:0}]; 
14. subtract; Inputs: [{12:0}, {13:0}]; 
15. transpose; Inputs: [{2:0}, {14:0}]; 
16. matmul; Inputs: [{10:0}, {15:0}]; 
17. biasadd; Inputs: [{16:0}, {7:0}]; 
18. TRANSFORM:{15}; Inputs: [{17:0}]; 

Printing out Scopes...
Graph profile: 1000 executions

Memory:
ACT: 0; TMP: 0; OBJ: 0; TTL: 1788;

Time:
Construction time: 2135 ns;
Execution time: 41820 ns;

Per-node reports:
Node: <8:MatMul>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 200;
      Time: PREP: 1160 ns; EXEC: 3167 ns; TTL: 5929 ns;
      PREP: INPUT: 251 ns; SHAPE: 382 ns; ARRAY: 217 ns;
Node: <9:BiasAdd>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 104;
      Time: PREP: 917 ns; EXEC: 3580 ns; TTL: 5957 ns;
      PREP: INPUT: 220 ns; SHAPE: 213 ns; ARRAY: 217 ns;
Node: <10:Tanh>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 104;
      Time: PREP: 756 ns; EXEC: 241 ns; TTL: 1927 ns;
      PREP: INPUT: 140 ns; SHAPE: 195 ns; ARRAY: 205 ns;
Node: <11:transpose/Rank>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 36;
      Time: PREP: 522 ns; EXEC: 119 ns; TTL: 1403 ns;
      PREP: INPUT: 109 ns; SHAPE: 69 ns; ARRAY: 171 ns;
Node: <12:transpose/sub>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 36;
      Time: PREP: 666 ns; EXEC: 185 ns; TTL: 1684 ns;
      PREP: INPUT: 192 ns; SHAPE: 94 ns; ARRAY: 168 ns;
Node: <13:transpose/Range>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 556;
      Time: PREP: 808 ns; EXEC: 647 ns; TTL: 2416 ns;
      PREP: INPUT: 297 ns; SHAPE: 228 ns; ARRAY: 181 ns;
Node: <14:transpose/sub_1>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 56;
      Time: PREP: 721 ns; EXEC: 541 ns; TTL: 2205 ns;
      PREP: INPUT: 23 ns; SHAPE: 92 ns; ARRAY: 165 ns;
Node: <15:transpose>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 96;
      Time: PREP: 3936 ns; EXEC: 602 ns; TTL: 5811 ns;
      PREP: INPUT: 194 ns; SHAPE: 3241 ns; ARRAY: 257 ns;
Node: <16:MatMul_1>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 312;
      Time: PREP: 970 ns; EXEC: 3565 ns; TTL: 6066 ns;
      PREP: INPUT: 203 ns; SHAPE: 320 ns; ARRAY: 193 ns;
Node: <17:BiasAdd_1>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 144;
      Time: PREP: 914 ns; EXEC: 3528 ns; TTL: 5870 ns;
      PREP: INPUT: 231 ns; SHAPE: 191 ns; ARRAY: 223 ns;
Node: <18:output>
      Memory: ACT: 0; TMP: 0; OBJ: 0; TTL: 144;
      Time: PREP: 805 ns; EXEC: 285 ns; TTL: 1928 ns;
      PREP: INPUT: 157 ns; SHAPE: 192 ns; ARRAY: 232 ns;

Special timers:
No special timers were set

Roadmap

In short-to-medium term following improvements are expected:

  • CUDA support for all new ops

  • Additional data types support: int, long long, q types, bool

  • Sparse tensors support

Documentation for individual operations, and basic classes (like NDArray, Graph etc) is available as part of Nd4j javadoc:

here
https://nd4j.org/doc/