Concatenates a ReLU which selects only the positive part of the activation with a ReLU which selects only the negative part of the activation. Note that as a result this non-linearity doubles the depth of the activations.
x (NUMERIC) - Input variable
Neural network batch normalization operation.
For details, see https://arxiv.org/abs/1502.03167
input (NUMERIC) - Input variable.
mean (NUMERIC) - Mean value. For 1d axis, this should match input.size(axis)
variance (NUMERIC) - Variance value. For 1d axis, this should match input.size(axis)
gamma (NUMERIC) - Gamma value. For 1d axis, this should match input.size(axis)
beta (NUMERIC) - Beta value. For 1d axis, this should match input.size(axis)
epsilon - Epsilon constant for numerical stability (to avoid division by 0)
axis - For 2d CNN activations: 1 for NCHW format activations, or 3 for NHWC format activations.
For 3d CNN activations: 1 for NCDHW format, 4 for NDHWC
For 1d/RNN activations: 1 for NCW format, 2 for NWC (Size: AtLeast(min=1))
Bias addition operation: a special case of addition, typically used with CNN 4D activations and a 1D bias vector
input (NUMERIC) - 4d input variable
bias (NUMERIC) - 1d bias
nchw - The format - nchw=true means [minibatch, channels, height, width] format; nchw=false - [minibatch, height, width, channels].
Unused for 2d inputs
This operation performs dot product attention on the given timeseries input with the given queries
out = sum(similarity(k_i, q) * v_i)
similarity(k, q) = softmax(k q) where x q is the dot product of x and q
Optionally with normalization step:
similarity(k, q) = softmax(k * q / sqrt(size(q))
See also "Attention is all you need" (https://arxiv.org/abs/1706.03762, p. 4, eq. 1)
Note: This supports multiple queries at once, if only one query is available the queries vector still has to
be 3D but can have queryCount = 1
Note: keys and values usually is the same array. If you want to use it as the same array, simply pass it for
both.
Note: Queries, keys and values must either be all rank 3 or all rank 4 arrays. Mixing them doesn't work. The
output rank will depend on the input rank.
queries (NUMERIC) - input 3D array "queries" of shape [batchSize, featureKeys, queryCount]
or 4D array of shape [batchSize, numHeads, featureKeys, queryCount]
keys (NUMERIC) - input 3D array "keys" of shape [batchSize, featureKeys, timesteps]
or 4D array of shape [batchSize, numHeads, featureKeys, timesteps]
values (NUMERIC) - input 3D array "values" of shape [batchSize, featureValues, timesteps]
or 4D array of shape [batchSize, numHeads, featureValues, timesteps]
mask (NUMERIC) - OPTIONAL; array that defines which values should be skipped of shape [batchSize, timesteps]
scaled - normalization, false -> do not apply normalization, true -> apply normalization
Dropout operation
input (NUMERIC) - Input array
inputRetainProbability - Probability of retaining an input (set to 0 with probability 1-p)
Element-wise exponential linear unit (ELU) function:
out = x if x > 0
out = a * (exp(x) - 1) if x <= 0
with constant a = 1.0
See: https://arxiv.org/abs/1511.07289
x (NUMERIC) - Input variable
GELU activation function - Gaussian Error Linear Units
For more details, see Gaussian Error Linear Units (GELUs) - https://arxiv.org/abs/1606.08415
This method uses the sigmoid approximation
x (NUMERIC) - Input variable
Element-wise hard sigmoid function:
out[i] = 0 if in[i] <= -2.5
out[1] = 0.2*in[i]+0.5 if -2.5 < in[i] < 2.5
out[i] = 1 if in[i] >= 2.5
x (NUMERIC) - Input variable
Element-wise hard tanh function:
out[i] = -1 if in[i] <= -1
out[1] = in[i] if -1 < in[i] < 1
out[i] = 1 if in[i] >= 1
x (NUMERIC) - Input variable
Derivative (dOut/dIn) of the element-wise hard Tanh function - hardTanh(INDArray)
x (NUMERIC) - Input variable
Apply Layer Normalization
y = gain * standardize(x) + bias
input (NUMERIC) - Input variable
gain (NUMERIC) - Gain
bias (NUMERIC) - Bias
channelsFirst - For 2D input - unused. True for NCHW (minibatch, channels, height, width), false for NHWC data
dimensions - Dimensions to perform layer norm over - dimension=1 for 2d/MLP data, dimension=1,2,3 for CNNs (Size: AtLeast(min=1))
Element-wise leaky ReLU function:
out = x if x >= 0.0
out = alpha * x if x < cutoff
Alpha value is most commonly set to 0.01
x (NUMERIC) - Input variable
alpha - Cutoff - commonly 0.01
Leaky ReLU derivative: dOut/dIn given input.
x (NUMERIC) - Input variable
alpha - Cutoff - commonly 0.01
Linear layer operation: out = mmul(in,w) + bias
Note that bias array is optional
input (NUMERIC) - Input data
weights (NUMERIC) - Weights variable, shape [nIn, nOut]
bias (NUMERIC) - Optional bias variable (may be null)
Element-wise sigmoid function: out[i] = log(sigmoid(in[i]))
x (NUMERIC) - Input variable
Log softmax activation
x (NUMERIC) -
Log softmax activation
x (NUMERIC) - Input
dimension - Dimension along which to apply log softmax
This performs multi-headed dot product attention on the given timeseries input
out = concat(head_1, head_2, ..., head_n) * Wo
head_i = dot_product_attention(Wq_i_q, Wk_i_k, Wv_i*v)
Optionally with normalization when calculating the attention for each head.
See also "Attention is all you need" (https://arxiv.org/abs/1706.03762, pp. 4,5, "3.2.2 Multi-Head Attention")
This makes use of dot_product_attention OP support for rank 4 inputs.
see dotProductAttention(INDArray, INDArray, INDArray, INDArray, boolean, boolean)
queries (NUMERIC) - input 3D array "queries" of shape [batchSize, featureKeys, queryCount]
keys (NUMERIC) - input 3D array "keys" of shape [batchSize, featureKeys, timesteps]
values (NUMERIC) - input 3D array "values" of shape [batchSize, featureValues, timesteps]
Wq (NUMERIC) - input query projection weights of shape [numHeads, projectedKeys, featureKeys]
Wk (NUMERIC) - input key projection weights of shape [numHeads, projectedKeys, featureKeys]
Wv (NUMERIC) - input value projection weights of shape [numHeads, projectedValues, featureValues]
Wo (NUMERIC) - output projection weights of shape [numHeads * projectedValues, outSize]
mask (NUMERIC) - OPTIONAL; array that defines which values should be skipped of shape [batchSize, timesteps]
scaled - normalization, false -> do not apply normalization, true -> apply normalization
Padding operation
input (NUMERIC) - Input tensor
padding (NUMERIC) - Padding value
PadMode - Padding format - default = CONSTANT
constant - Padding constant
GELU activation function - Gaussian Error Linear Units
For more details, see Gaussian Error Linear Units (GELUs) - https://arxiv.org/abs/1606.08415
This method uses the precise method
x (NUMERIC) - Input variable
PReLU (Parameterized Rectified Linear Unit) operation. Like LeakyReLU with a learnable alpha:
out[i] = in[i] if in[i] >= 0
out[i] = in[i] * alpha[i] otherwise
sharedAxes allows you to share learnable parameters along axes.
For example, if the input has shape [batchSize, channels, height, width]
and you want each channel to have its own cutoff, use sharedAxes = [2, 3] and an
alpha with shape [channels].
input (NUMERIC) - Input data
alpha (NUMERIC) - The cutoff variable. Note that the batch dimension (the 0th, whether it is batch or not) should not be part of alpha.
sharedAxes - Which axes to share cutoff parameters along. (Size: AtLeast(min=1))
Element-wise rectified linear function with specified cutoff:
out[i] = in[i] if in[i] >= cutoff
out[i] = 0 otherwise
x (NUMERIC) - Input
cutoff - Cutoff value for ReLU operation - x > cutoff ? x : 0. Usually 0
Element-wise "rectified linear 6" function with specified cutoff:
out[i] = min(max(in, cutoff), 6)
x (NUMERIC) - Input
cutoff - Cutoff value for ReLU operation. Usually 0
ReLU (Rectified Linear Unit) layer operation: out = relu(mmul(in,w) + bias)
Note that bias array is optional
input (NUMERIC) - Input data
weights (NUMERIC) - Weights variable
bias (NUMERIC) - Optional bias variable (may be null)
Element-wise SeLU function - Scaled exponential Lineal Unit: see Self-Normalizing Neural Networks
out[i] = scale alpha (exp(in[i])-1) if in[i]>0, or 0 if in[i] <= 0
Uses default scale and alpha values.
x (NUMERIC) - Input variable
Element-wise sigmoid function: out[i] = 1.0/(1+exp(-in[i]))
x (NUMERIC) - Input variable
Element-wise sigmoid function derivative: dL/dIn given input and dL/dOut
x (NUMERIC) - Input Variable
wrt (NUMERIC) - Gradient at the output - dL/dOut. Must have same shape as the input
Softmax activation, along the specified dimension
x (NUMERIC) - Input
dimension - Dimension along which to apply softmax - default = -1