How to Add Operations

There's multiple different Ops designs supported in libND4j, and in this guide we'll try to explain how to build your very own operation.

XYZ operations

This kind of operations is actually split into multiple subtypes, based on element-access and result type:

  • Transform operations: These operations typically take some NDArray in, and change each element independent of others.

  • Reduction operations: These operations take some NDArray and dimensions, and return reduced NDArray (or scalar) back. I.e. sum along dimension(s).

  • Scalar operations: These operations are similar to transforms, but they only do arithmetic operations, and second operand is scalar. I.e. each element in given NDArray will add given scalar value.

  • Pairwise operations: These operations are between regular transform opeartions and scalar operations. I.e. element-wise addition of two NDArrays.

  • Random operations: Most of these operations related to random numbers distributions: Uniform, Gauss, Bernoulli etc.

Despite differences between these operations, they are all using XZ/XYZ three-operand design, where X and Y are inputs, and Z is output. Data access in these operations is usually trivial, and loop based. I.e. most trivial loop for scalar transform will look like this:

for (Nd4jLong i = start; i < end; i++) {
    result[i] = OpType::op(x[i], scalar, extraParams);

Operation used in this loop will be template-driven, and compiled statically. There are another loops implementation, depending on op group or strides within NDArrays, but idea will be the same all the time: each element of the NDArray will be accessed within loop.

Now, let's take a look into typical XYZ op implementation. Here's how Add operation will look like:

template<typename T>
class Add {
    op_def static T op(T d1, T d2) {
        return d1 + d2;

    // this signature will be used in Scalar loops
    op_def static T op(T d1, T d2, T *params) {
        return d1 + d2;

    // this signature will be used in reductions
    op_def static T op(T d1) {
        return d1;

    // op for MetaOps
    op_def static T op(T d1, T *params) {
        return d1 + params[0];

This particular operation is used in different XYZ op groups, but you see the idea: element-wise operation, which is invoked on each element in given NDArray. So, if you want to add new XYZ operation to libnd4j, you should just add operation implementation to file includes/ops/ops.h, and assign it to specific ops group in file includes/loops/legacy_ops.h together with some number unique to this ops group, i.e.: (21, simdOps::Add)

After libnd4j is recompiled, this op will become available for legacy execution mechanism, NDArray wrappers, and LegacyOp wrappers (those are made to map legacy operations to CustomOps design for Graph).

Custom operations

Custom operations is a new concept, added recently and mostly suits SameDiff/Graph needs. For CustomOps we defined universal signature, with variable number of input/output NDArrays, and variable number of floating-point and integer arguments.

However, there are some minor difference between various CustomOp declarations:

  • DECLARE_OP(string, int, int, bool): these operations take no fp/int arguments, and output shape equals to input shape.

  • DECLARE_CONFIGURABLE_OP(string, int, int, bool, int, int): these operations do take fp/int output arguments, and output shape equals to input shape.

  • DECLARE_REDUCTION_OP(string, int, int, bool, int, int): these operations do take fp/int output arguments, and output shape is calculated as Reduction.

  • DECLARE_CUSTOM_OP(string, int, int, bool, int, int): these operations return NDArray with custom shape, that usually depends on input and arguments.

  • DECLARE_BOOLEAN_OP(string, int, bool): these operations take some NDArrays and return scalar, where 0 is False, and other values are treated as True.

Let's take a look at example CustomOp:

CUSTOM_OP_IMPL(tear, 1, -1, false, 0, -1) {
    auto input = INPUT_VARIABLE(0);

    REQUIRE_TRUE(!block.getIArguments()->empty(), 0, "At least 1 dimension should be specified for Tear");

    std::vector<int> dims(*block.getIArguments());

    for (auto &v: dims)
        REQUIRE_TRUE(v >= 0 && v < input->rankOf(), 0, "Tear dimensions should be non-negative values, and lower then input rank. Got %i instead", v);

    auto tads = input->allTensorsAlongDimension(dims);
    for (int e = 0; e < tads->size(); e++) {
        auto outE = OUTPUT_VARIABLE(e);

        this->storeResult(block, e, *outE);

    delete tads;

    return ND4J_STATUS_OK;

    auto inShape = inputShape->at(0);

    std::vector<int> dims(*block.getIArguments());

    if (dims.size() > 1)
        std::sort(dims.begin(), dims.end());

    shape::TAD tad(inShape,, (int) dims.size());
    Nd4jLong numTads = shape::tadLength(inShape,, (int) dims.size());

    auto result = SHAPELIST();
    for (int e = 0; e < numTads; e++) {
        int *newShape;
        COPY_SHAPE(tad.tadOnlyShapeInfo, newShape);

    return result;

In the example above, we declare tear CustomOp implementation, and shape function for this op. So, at the moment of op execution, we assume that we will either have output array(s) provided by end-user, or they will be generated with shape function.

You can also see number of macros used, we'll cover those later as well. Beyond that - op execution logic is fairly simple & linear: Each new op implements protected member function DeclarableOp<T>::validateAndExecute(Block<T>& block), and this method is eventually called either from GraphExecutioner, or via direct call, like DeclarableOp<T>::execute(Block<T>& block).

Important part of op declaration is input/output description for the op. I.e. as shown above: CUSTOM_OP_IMPL(tear, 1, -1, false, 0, -1). This declaration means:

  • Op name: tear

  • Op expects at least 1 NDArray as input

  • Op returns unknown positive number of NDArrays as output

  • Op can't be run in-place, so under any circumstances original NDArray will stay intact

  • Op doesn't expect any T (aka floating point) arguments

  • Op expects unknown positive number of integer arguments. In case of this op it's dimensions to split input NDArray.

Here's another example: DECLARE_CUSTOM_OP(permute, 1, 1, true, 0, -2); This declaration means:

  • Op name: permute

  • Op expects at least 1 NDArray as input

  • Op returns 1 NDArray as output

  • Op can be run in-place if needed (it means: input == output, and input is modified and returned as output)

  • Op doesn't expect any T arguments

  • Op expects unknown number of integer arguments OR no integer arguments at all.

Note on parameters: Negative values (-1,-2) mean very specific things. When op validation is invoked (checking the parameters) either the exact number of parameters in the descriptor must be present for each type or the following:

  • -1 means at least 1 of the expected parameter will be present

  • -2 means an unknown number of parameters. Use this in situations where inputs of certain types maybe optional. A common use case is when a parameter maybe passed in as an ndrray or as a TARG or IARG (floating point or integer arguments respectively)

c++11 syntactic sugar

In ops you can easily use c++11 features, including lambdas. In some cases it might be easiest way to build your custom op (or some part of it) via NDArray::applyLambda or NDArray::applyPairwiseLambda:

auto lambda = LAMBDA_TT(_x, _y) {
    return (_x + _y) * 2;

x.applyPairwiseLambda(&y, lambda);

In this simple example, each element of NDArray x will get values set to x[e] = (x[e] + y[e]) * 2.


For tests libnd4j uses Google Tests suit. All tests are located at tests_cpu/layers_tests folder. Here's simple way to run those from command line:

cd tests_cpu
cmake -G "Unix Makefiles"
make -j 4

You can also use your IDE (i.e. Jetbrains CLion) to run tests via GUI.

PLEASE NOTE: if you're considering submitting your new op to libnd4j repository via pull request - consider adding tests for it. Ops without tests won't be approved.

Backend-specific operation

GPU/MPI/whatever to be added soon.

Utility macros

We have number of utility macros, suitable for custom ops. Here they are:

  • INPUT_VARIABLE(int): this macro returns you NDArray at specified input index.

  • OUTPUT_VARIABLE(int): this macro returns you NDArray at specified output index.

  • STORE_RESULT(NDArray): this macro stores result to VariableSpace.

  • STORE_2_RESULTS(NDArray, NDArray): this macro stores results accordingly to VariableSpace.

  • INT_ARG(int): this macro returns you specific Integer argument passed to the given op.

  • T_ARG(int): this macro returns you specific T argument passed to the given op.

  • ALLOCATE(...): this macro check if Workspace is available, and either uses Workspace or direct memory allocation if Workspace isn't available.

  • RELEASE(...): this macro is made to release memory allocated with ALLOCATE() macro.

  • REQUIRE_TRUE(...): this macro takes condition, and evaluates it. If evaluation doesn't end up as True - exception is raised, and specified message is printed out.

  • LAMBDA_T(X) and LAMBDA_TT(X, Y): lambda declaration for NDArray::applyLambda and NDArray::applyPairwiseLambda

  • COPY_SHAPE(SRC, TGT): this macro allocates memory for TGT pointer and copies shape from SRC pointer

  • ILAMBDA_T(X) and ILAMBDA_TT(X, Y): lambda declaration for indexed lambdas, index argument is passed in as Nd4jLong (aka long long)

  • FORCEINLINE: platform-specific definition for functions inlining

Explicit template instantiations in helper methods.

We should explicitly instantiate template methods for different data types in libraries. Furethemore, to speed up parallel compilation we need to add those template instantiations in separate source files. Besides, another reason is that: some compilers are choked when these template instantiations are many in one translation unit. To ease this cumbersome operation we have Cmake helper and macros helpers. Example: Suppose we have such function:

    template<typename X, typename Z>
    void  argMin_(const NDArray& input, NDArray& output, const std::vector<int>& dimensions);

We should write this to explicitly instantiate it.

BUILD_DOUBLE_TEMPLATE(template void argMin_, (const NDArray& input, NDArray& output, const std::vector<int>& dimensions),


  • LIBND4J_TYPES means we want to use all types in the place of X

  • INDEXING_TYPES means we will use index types ( int, int64_t) as Z type

But to speed up compilation process and also helping compilers we can further separate it into different source files. Firstly we rename the original template source with hpp extension: Secondly we add file with the suffix (or for cuda) that will include that hpp header and place it in the apropriate compilation units folder. in our case it will be in ./libnd4j/include/ops/declarable/helpers/cpu/compilation_units folder with the name . Later we decide which type we want to separate into different sources. In our case we want to split LIBND4J_TYPES (other ones: INT_TYPE , FLOAT_TYPE, PAIRWISE_TYPE ). We hint cmake that case with this (adding _GEN suffix):

#cmakedefine LIBND4J_TYPE_GEN 

Then we just add _@FL_TYPE_INDEX@ as suffix in type name and it will split those types for us and generate cpp files inside ${CMAKE_BINARY_DIR}/compilation_units folder.


Here how the complete file will look like:

#cmakedefine LIBND4J_TYPE_GEN 
//this header is where our template functions resides
#include <ops/declarable/helpers/cpu/indexReductions.hpp>
namespace sd {
    namespace ops {
        namespace helpers {
            BUILD_DOUBLE_TEMPLATE(template void argMax_, (const NDArray& input, NDArray& output, const std::vector<int>& dimensions), 

Last updated