Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
AlexNet
Dl4j’s AlexNet model interpretation based on the original paper ImageNet Classification with Deep Convolutional Neural Networks and the imagenetExample code referenced. References: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/train_val.prototxt
Model is built in dl4j based on available functionality and notes indicate where there are gaps waiting for enhancements.
Bias initialization in the paper is 1 in certain layers but 0.1 in the imagenetExample code Weight distribution uses 0.1 std for all layers in the paper but 0.005 in the dense layers in the imagenetExample code
Darknet19 Reference: https://arxiv.org/pdf/1612.08242.pdf ImageNet weights for this model are available and have been converted from https://pjreddie.com/darknet/imagenet/ using https://github.com/allanzelener/YAD2K .
There are 2 pretrained models, one for 224x224 images and one fine-tuned for 448x448 images. Call setInputShape() with either {3, 224, 224} or {3, 448, 448} before initialization. The channels of the input images need to be in RGB order (not BGR), with values normalized within [0, 1]. The output labels are as per https://github.com/pjreddie/darknet/blob/master/data/imagenet.shortnames.list .
A variant of the original FaceNet model that relies on embeddings and triplet loss. Reference: https://arxiv.org/abs/1503.03832 Also based on the OpenFace implementation: http://reports-archive.adm.cs.cmu.edu/anon/2016/CMU-CS-16-118.pdf
A variant of the original FaceNet model that relies on embeddings and triplet loss. Reference: https://arxiv.org/abs/1503.03832 Also based on the OpenFace implementation: http://reports-archive.adm.cs.cmu.edu/anon/2016/CMU-CS-16-118.pdf
LeNet was an early promising achiever on the ImageNet dataset. References:
MNIST weights for this model are available and have been converted from https://github.com/f00-/mnist-lenet-keras.
Implementation of NASNet-A in Deeplearning4j. NASNet refers to Neural Architecture Search Network, a family of models that were designed automatically by learning the model architectures directly on the dataset of interest.
This implementation uses 1056 penultimate filters and an input shape of (3, 224, 224). You can change this.
Paper: https://arxiv.org/abs/1707.07012 ImageNet weights for this model are available and have been converted from https://keras.io/applications/.
Residual networks for deep learning.
Paper: https://arxiv.org/abs/1512.03385 ImageNet weights for this model are available and have been converted from https://keras.io/applications/</a>.
A simple convolutional network for generic image classification. Reference: https://github.com/oarriaga/face_classification/
U-Net
An implementation of SqueezeNet. Touts similar accuracy to AlexNet with a fraction of the parameters.
Paper: https://arxiv.org/abs/1602.07360 ImageNet weights for this model are available and have been converted from https://github.com/rcmalli/keras-squeezenet/.
LSTM designed for text generation. Can be trained on a corpus of text. For this model, numClasses is
Architecture follows this implementation: https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py
Walt Whitman weights are available for generating text from his works, adapted from https://github.com/craigomac/InfiniteMonkeys.
Tiny YOLO Reference: https://arxiv.org/pdf/1612.08242.pdf
ImageNet+VOC weights for this model are available and have been converted from https://pjreddie.com/darknet/yolo using https://github.com/allanzelener/YAD2K and the following code.
String filename = “tiny-yolo-voc.h5”; ComputationGraph graph = KerasModelImport.importKerasModelAndWeights(filename, false); INDArray priors = Nd4j.create(priorBoxes);
FineTuneConfiguration fineTuneConf = new FineTuneConfiguration.Builder() .seed(seed) .iterations(iterations) .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT) .gradientNormalization(GradientNormalization.RenormalizeL2PerLayer) .gradientNormalizationThreshold(1.0) .updater(new Adam.Builder().learningRate(1e-3).build()) .l2(0.00001) .activation(Activation.IDENTITY) .trainingWorkspaceMode(workspaceMode) .inferenceWorkspaceMode(workspaceMode) .build();
ComputationGraph model = new TransferLearning.GraphBuilder(graph) .fineTuneConfiguration(fineTuneConf) .addLayer(“outputs”, new Yolo2OutputLayer.Builder() .boundingBoxPriors(priors) .build(), “conv2d_9”) .setOutputs(“outputs”) .build();
System.out.println(model.summary(InputType.convolutional(416, 416, 3)));
ModelSerializer.writeModel(model, “tiny-yolo-voc_dl4j_inference.v1.zip”, false); }</pre>
The channels of the 416x416 input images need to be in RGB order (not BGR), with values normalized within [0, 1].
U-Net
An implementation of U-Net, a deep learning network for image segmentation in Deeplearning4j. The u-net is convolutional network architecture for fast and precise segmentation of images. Up to now it has outperformed the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Paper: https://arxiv.org/abs/1505.04597 Weights are available for image segmentation trained on a synthetic dataset
VGG-16, from Very Deep Convolutional Networks for Large-Scale Image Recognition https://arxiv.org/abs/1409.1556
Deep Face Recognition http://www.robots.ox.ac.uk/~vgg/publications/2015/Parkhi15/parkhi15.pdf
ImageNet weights for this model are available and have been converted from https://github.com/fchollet/keras/tree/1.1.2/keras/applications. CIFAR-10 weights for this model are available and have been converted using “approach 2” from https://github.com/rajatvikramsingh/cifar10-vgg16. VGGFace weights for this model are available and have been converted from https://github.com/rcmalli/keras-vggface.
VGG-19, from Very Deep Convolutional Networks for Large-Scale Image Recognition https://arxiv.org/abs/1409.1556 ImageNet weights for this model are available and have been converted from https://github.com/fchollet/keras/tree/1.1.2/keras/applications.
U-Net
An implementation of Xception in Deeplearning4j. A novel deep convolutional neural network architecture inspired by Inception, where Inception modules have been replaced with depthwise separable convolutions.
Paper: https://arxiv.org/abs/1610.02357 ImageNet weights for this model are available and have been converted from https://keras.io/applications/.
YOLOv2 Reference: https://arxiv.org/pdf/1612.08242.pdf
ImageNet+COCO weights for this model are available and have been converted from https://pjreddie.com/darknet/yolo using https://github.com/allanzelener/YAD2K and the following code.
The channels of the 608x608 input images need to be in RGB order (not BGR), with values normalized within [0, 1].
pretrainedUrl
Default prior boxes for the model
How to build complex networks with DL4J computation graph.
This page describes how to build more complicated networks, using DL4J's Computation Graph functionality.
DL4J has two types of networks comprised of multiple layers:
The MultiLayerNetwork, which is essentially a stack of neural network layers (with a single input layer and single output layer), and
The ComputationGraph, which allows for greater freedom in network architectures
Specifically, the ComputationGraph allows for networks to be built with the following features:
Multiple network input arrays
Multiple network outputs (including mixed classification/regression architectures)
Layers connected to other layers using a directed acyclic graph connection structure (instead of just a stack of layers)
As a general rule, when building networks with a single input layer, a single output layer, and an input->a->b->c->output type connection structure: MultiLayerNetwork is usually the preferred network. However, everything that MultiLayerNetwork can do, ComputationGraph can do as well - though the configuration may be a little more complicated.
Examples of some architectures that can be built using ComputationGraph include:
Multi-task learning architectures
Recurrent neural networks with skip connections
GoogLeNet, a complex type of convolutional netural network for image classification
The basic idea is that in the ComputationGraph, the core building block is the GraphVertex, instead of layers. Layers (or, more accurately the LayerVertex objects), are but one type of vertex in the graph. Other types of vertices include:
Input Vertices
Element-wise operation vertices
Merge vertices
Subset vertices
Preprocessor vertices
These types of graph vertices are described briefly below.
LayerVertex: Layer vertices (graph vertices with neural network layers) are added using the .addLayer(String,Layer,String...)
method. The first argument is the label for the layer, and the last arguments are the inputs to that layer. If you need to manually add an InputPreProcessor (usually this is unnecessary - see next section) you can use the .addLayer(String,Layer,InputPreProcessor,String...)
method.
InputVertex: Input vertices are specified by the addInputs(String...)
method in your configuration. The strings used as inputs can be arbitrary - they are user-defined labels, and can be referenced later in the configuration. The number of strings provided define the number of inputs; the order of the input also defines the order of the corresponding INDArrays in the fit methods (or the DataSet/MultiDataSet objects).
ElementWiseVertex: Element-wise operation vertices do for example an element-wise addition or subtraction of the activations out of one or more other vertices. Thus, the activations used as input for the ElementWiseVertex must all be the same size, and the output size of the elementwise vertex is the same as the inputs.
MergeVertex: The MergeVertex concatenates/merges the input activations. For example, if a MergeVertex has 2 inputs of size 5 and 10 respectively, then output size will be 5+10=15 activations. For convolutional network activations, examples are merged along the depth: so suppose the activations from one layer have 4 features and the other has 5 features (both with (4 or 5) x width x height activations), then the output will have (4+5) x width x height activations.
SubsetVertex: The subset vertex allows you to get only part of the activations out of another vertex. For example, to get the first 5 activations out of another vertex with label "layer1", you can use .addVertex("subset1", new SubsetVertex(0,4), "layer1")
: this means that the 0th through 4th (inclusive) activations out of the "layer1" vertex will be used as output from the subset vertex.
PreProcessorVertex: Occasionally, you might want to the functionality of an InputPreProcessor without that preprocessor being associated with a layer. The PreProcessorVertex allows you to do this.
Finally, it is also possible to define custom graph vertices by implementing both a configuration and implementation class for your custom GraphVertex.
Suppose we wish to build the following recurrent neural network architecture:
For the sake of this example, lets assume our input data is of size 5. Our configuration would be as follows:
Note that in the .addLayer(...) methods, the first string ("L1", "L2") is the name of that layer, and the strings at the end (["input"], ["input","L1"]) are the inputs to that layer.
Consider the following architecture:
Here, the merge vertex takes the activations out of layers L1 and L2, and merges (concatenates) them: thus if layers L1 and L2 both have has 4 output activations (.nOut(4)) then the output size of the merge vertex is 4+4=8 activations.
To build the above network, we use the following configuration:
In multi-task learning, a neural network is used to make multiple independent predictions. Consider for example a simple network used for both classification and regression simultaneously. In this case, we have two output layers, "out1" for classification, and "out2" for regression.
In this case, the network configuration is:
One feature of the ComputationGraphConfiguration is that you can specify the types of input to the network, using the .setInputTypes(InputType...)
method in the configuration.
The setInputType method has two effects:
It will automatically add any InputPreProcessors as required. InputPreProcessors are necessary to handle the interaction between for example fully connected (dense) and convolutional layers, or recurrent and fully connected layers.
It will automatically calculate the number of inputs (.nIn(x) config) to a layer. Thus, if you are using the setInputTypes(InputType...)
functionality, it is not necessary to manually specify the .nIn(x) options in your configuration. This can simplify building some architectures (such as convolutional networks with fully connected layers). If the .nIn(x) is specified for a layer, the network will not override this when using the InputType functionality.
For example, if your network has 2 inputs, one being a convolutional input and the other being a feed-forward input, you would use .setInputTypes(InputType.convolutional(depth,width,height), InputType.feedForward(feedForwardInputSize))
There are two types of data that can be used with the ComputationGraph.
The DataSet class was originally designed for use with the MultiLayerNetwork, however can also be used with ComputationGraph - but only if that computation graph has a single input and output array. For computation graph architectures with more than one input array, or more than one output array, DataSet and DataSetIterator cannot be used (instead, use MultiDataSet/MultiDataSetIterator).
A DataSet object is basically a pair of INDArrays that hold your training data. In the case of RNNs, it may also include masking arrays (see this for more details). A DataSetIterator is essentially an iterator over DataSet objects.
MultiDataSet is multiple input and/or multiple output version of DataSet. It may also include multiple mask arrays (for each input/output array) in the case of recurrent neural networks. As a general rule, you should use DataSet/DataSetIterator, unless you are dealing with multiple inputs and/or multiple outputs.
There are currently two ways to use a MultiDataSetIterator:
By implementing the MultiDataSetIterator interface directly
By using the RecordReaderMultiDataSetIterator in conjuction with DataVec record readers
The RecordReaderMultiDataSetIterator provides a number of options for loading data. In particular, the RecordReaderMultiDataSetIterator provides the following functionality:
Multiple DataVec RecordReaders may be used simultaneously
The record readers need not be the same modality: for example, you can use an image record reader with a CSV record reader
It is possible to use a subset of the columns in a RecordReader for different purposes - for example, the first 10 columns in a CSV could be your input, and the last 5 could be your output
It is possible to convert single columns from a class index to a one-hot representation
Some basic examples on how to use the RecordReaderMultiDataSetIterator follow. You might also find these unit tests to be useful.
Suppose we have a CSV file with 5 columns, and we want to use the first 3 as our input, and the last 2 columns as our output (for regression). We can build a MultiDataSetIterator to do this as follows:
Suppose we have two separate CSV files, one for our inputs, and one for our outputs. Further suppose we are building a multi-task learning architecture, whereby have two outputs - one for classification. For this example, let's assume the data is as follows:
Input file: myInput.csv, and we want to use all columns as input (without modification)
Output file: myOutput.csv.
Network output 1 - regression: columns 0 to 3
Network output 2 - classification: column 4 is the class index for classification, with 3 classes. Thus column 4 contains integer values [0,1,2] only, and we want to convert these indexes to a one-hot representation for classification.
In this case, we can build our iterator as follows:
Special algorithms for gradient descent.
At a simple level, activation functions help decide whether a neuron should be activated. This helps determine whether the information that the neuron is receiving is relevant for the input. The activation function is a non-linear transformation that happens over an input signal, and the transformed output is sent to the next neuron.
The recommended method to use activations is to add an activation layer in your neural network, and configure your desired activation:
Rectified tanh
Essentially max(0, tanh(x))
Underlying implementation is in native code
f(x) = alpha (exp(x) - 1.0); x < 0 = x ; x>= 0
alpha defaults to 1, if not specified
f(x) = max(0, x)
Rational tanh approximation From https://arxiv.org/pdf/1508.01292v3
f(x) = 1.7159 tanh(2x/3) where tanh is approximated as follows, tanh(y) ~ sgn(y) { 1 - 1/(1+|y|+y^2+1.41645y^4)}
Underlying implementation is in native code
Thresholded RELU
f(x) = x for x > theta, f(x) = 0 otherwise. theta defaults to 1.0
f(x) = min(max(input, cutoff), 6)
f(x) = 1 / (1 + exp(-x))
GELU activation function - Gaussian Error Linear Units
/ Parametrized Rectified Linear Unit (PReLU)
f(x) = alpha x for x < 0, f(x) = x for x >= 0
alpha has the same shape as x and is a learned parameter.
f(x) = x
f_i(x) = x_i / (1+
x_i
)
f(x) = min(1, max(0, 0.2x + 0.5))
f_i(x) = exp(x_i - shift) / sum_j exp(x_j - shift) where shift = max_i(x_i)
f(x) = x^3
f(x) = max(0,x) + alpha min(0, x)
alpha is drawn from uniform(l,u) during training and is set to l+u/2 during test l and u default to 1/8 and 1/3 respectively
Empirical Evaluation of Rectified Activations in Convolutional Network
f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
https://arxiv.org/pdf/1706.02515.pdf
Leaky RELU f(x) = max(0, x) + alpha min(0, x) alpha defaults to 0.01
f(x) = x sigmoid(x)
f(x) = log(1+e^x)
Data iteration tools for loading into neural networks.
A dataset iterator allows for easy loading of data into neural networks and help organize batching, conversion, and masking. The iterators included in Eclipse Deeplearning4j help with either user-provided data, or automatic loading of common benchmarking datasets such as MNIST and IRIS.
For most use cases, initializing an iterator and passing a reference to a MultiLayerNetwork
or ComputationGraph
fit()
method is all you need to begin a task for training:
Many other methods also accept iterators for tasks such as evaluation:
MNIST data set iterator - 60000 training digits, 10000 test digits, 10 classes. Digits have 28x28 pixels and 1 channel (grayscale). For futher details, see http://yann.lecun.com/exdb/mnist/
UCI synthetic control chart time series dataset. This dataset is useful for classification of univariate time series with six categories: Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift
Details: https://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series Data: https://archive.ics.uci.edu/ml/machine-learning-databases/synthetic_control-mld/synthetic_control.data Image: https://archive.ics.uci.edu/ml/machine-learning-databases/synthetic_control-mld/data.jpeg
UciSequenceDataSetIterator
Create an iterator for the training set, with the specified minibatch size. Randomized with RNG seed 123
param batchSize Minibatch size
CifarDataSetIterator is an iterator for CIFAR-10 dataset - 10 classes, with 32x32 images with 3 channels (RGB)
This fetcher uses a cached version of the CIFAR dataset which is converted to PNG images, see: https://pjreddie.com/projects/cifar-10-dataset-mirror/.
Cifar10DataSetIterator
Create an iterator for the training set, with random iteration order (RNG seed fixed to 123)
param batchSize Minibatch size for the iterator
IrisDataSetIterator: An iterator for the well-known Iris dataset. 4 features, 3 label classes https://archive.ics.uci.edu/ml/datasets/Iris
IrisDataSetIterator
next
IrisDataSetIterator handles traversing through the Iris Data Set.
param batch Batch size
param numExamples Total number of examples
LFW iterator - Labeled Faces from the Wild dataset See http://vis-www.cs.umass.edu/lfw/ 13233 images total, with 5749 classes.
LFWDataSetIterator
Create LFW data specific iterator
param batchSize the batch size of the examples
param numExamples the overall number of examples
param imgDim an array of height, width and channels
param numLabels the overall number of examples
param useSubset use a subset of the LFWDataSet
param labelGenerator path label generator to use
param train true if use train value
param splitTrainTest the percentage to split data for train and remainder goes to test
param imageTransform how to transform the image
param rng random number to lock in batch shuffling
Tiny ImageNet is a subset of the ImageNet database. TinyImageNet is the default course challenge for CS321n at Stanford University.
Tiny ImageNet has 200 classes, each consisting of 500 training images. Images are 64x64 pixels, RGB.
See: http://cs231n.stanford.edu/ and https://tiny-imagenet.herokuapp.com/
TinyImageNetDataSetIterator
Create an iterator for the training set, with random iteration order (RNG seed fixed to 123)
param batchSize Minibatch size for the iterator
EMNIST DataSetIterator
COMPLETE: Also known as 'ByClass' split. 814,255 examples total (train + test), 62 classes
MERGE: Also known as 'ByMerge' split. 814,255 examples total. 47 unbalanced classes. Combines lower and upper case characters (that are difficult to distinguish) into one class for each letter (instead of 2), for letters C, I, J, K, L, M, O, P, S, U, V, W, X, Y and Z
BALANCED: 131,600 examples total. 47 classes (equal number of examples in each class)
LETTERS: 145,600 examples total. 26 balanced classes
DIGITS: 280,000 examples total. 10 balanced classes
See: https://www.nist.gov/itl/iad/image-group/emnist-dataset and https://arxiv.org/abs/1702.05373
EmnistDataSetIterator
EMNIST dataset has multiple different subsets. See {- link EmnistDataSetIterator} Javadoc for details.
numExamplesTrain
Create an EMNIST iterator with randomly shuffled data based on a specified RNG seed
param dataSet Dataset (subset) to return
param batchSize Batch size
param train If true: use training set. If false: use test set
param seed Random number generator seed
numExamplesTest
Get the number of test examples for the specified subset
param dataSet Subset to get
return Number of examples for the specified subset
numLabels
Get the number of labels for the specified subset
param dataSet Subset to get
return Number of labels for the specified subset
isBalanced
Get the labels as a character array
return Labels
DataSet objects as well as producing minibatches from individual records.
RecordReaderDataSetIterator
Constructor for classification, where: (a) the label index is assumed to be the very last Writable/column, and (b) the number of classes is inferred from RecordReader.getLabels() Note that if RecordReader.getLabels() returns null, no output labels will be produced
param recordReader Record reader to use as the source of data
param batchSize Minibatch size, for each call of .next()
setCollectMetaData
Main constructor for classification. This will convert the input class index (at position labelIndex, with integer values 0 to numPossibleLabels-1 inclusive) to the appropriate one-hot output/labels representation.
param recordReader RecordReader: provides the source of the data
param batchSize Batch size (number of examples) for the output DataSet objects
param labelIndex Index of the label Writable (usually an IntWritable), as obtained by recordReader.next()
param numPossibleLabels Number of classes (possible labels) for classification
loadFromMetaData
Load a single example to a DataSet, using the provided RecordMetaData. Note that it is more efficient to load multiple instances at once, using {- link #loadFromMetaData(List)}
param recordMetaData RecordMetaData to load from. Should have been produced by the given record reader
return DataSet with the specified example
throws IOException If an error occurs during loading of the data
loadFromMetaData
Load a multiple examples to a DataSet, using the provided RecordMetaData instances.
param list List of RecordMetaData instances to load from. Should have been produced by the record reader provided to the RecordReaderDataSetIterator constructor
return DataSet with the specified examples
throws IOException If an error occurs during loading of the data
writableConverter
Builder class for RecordReaderDataSetIterator
maxNumBatches
Optional argument, usually not used. If set, can be used to limit the maximum number of minibatches that will be returned (between resets). If not set, will always return as many minibatches as there is data available.
param maxNumBatches Maximum number of minibatches per epoch / reset
regression
Use this for single output regression (i.e., 1 output/regression target)
param labelIndex Column index that contains the regression target (indexes start at 0)
regression
Use this for multiple output regression (1 or more output/regression targets). Note that all regression targets must be contiguous (i.e., positions x to y, without gaps)
param labelIndexFrom Column index of the first regression target (indexes start at 0)
param labelIndexTo Column index of the last regression target (inclusive)
classification
Use this for classification
param labelIndex Index that contains the label index. Column (indexes start from 0) be an integer value, and contain values 0 to numClasses-1
param numClasses Number of label classes (i.e., number of categories/classes in the dataset)
preProcessor
Optional arg. Allows the preprocessor to be set
param preProcessor Preprocessor to use
collectMetaData
When set to true: metadata for the current examples will be present in the returned DataSet. Disabled by default.
param collectMetaData Whether metadata should be collected or not
The idea: generate multiple inputs and multiple outputs from one or more Sequence/RecordReaders. Inputs and outputs may be obtained from subsets of the RecordReader and SequenceRecordReaders columns (for examples, some inputs and outputs as different columns in the same record/sequence); it is also possible to mix different types of data (for example, using both RecordReaders and SequenceRecordReaders in the same RecordReaderMultiDataSetIterator). inputs and subsets.
RecordReaderMultiDataSetIterator
When dealing with time series data of different lengths, how should we align the input/labels time series? For equal length: use EQUAL_LENGTH For sequence classification: use ALIGN_END
loadFromMetaData
Load a single example to a DataSet, using the provided RecordMetaData. Note that it is more efficient to load multiple instances at once, using {- link #loadFromMetaData(List)}
param recordMetaData RecordMetaData to load from. Should have been produced by the given record reader
return DataSet with the specified example
throws IOException If an error occurs during loading of the data
loadFromMetaData
Load a multiple sequence examples to a DataSet, using the provided RecordMetaData instances.
param list List of RecordMetaData instances to load from. Should have been produced by the record reader provided to the SequenceRecordReaderDataSetIterator constructor
return DataSet with the specified examples
throws IOException If an error occurs during loading of the data
Sequence record reader data set iterator. Given a record reader (and optionally another record reader for the labels) generate time series (sequence) data sets. Supports padding for one-to-many and many-to-one type data loading (i.e., with different number of inputs vs.
SequenceRecordReaderDataSetIterator
Constructor where features and labels come from different RecordReaders (for example, different files), and labels are for classification.
param featuresReader SequenceRecordReader for the features
param labels Labels: assume single value per time step, where values are integers in the range 0 to numPossibleLables-1
param miniBatchSize Minibatch size for each call of next()
param numPossibleLabels Number of classes for the labels
hasNext
Constructor where features and labels come from different RecordReaders (for example, different files)
loadFromMetaData
Load a single sequence example to a DataSet, using the provided RecordMetaData. Note that it is more efficient to load multiple instances at once, using {- link #loadFromMetaData(List)}
param recordMetaData RecordMetaData to load from. Should have been produced by the given record reader
return DataSet with the specified example
throws IOException If an error occurs during loading of the data
loadFromMetaData
Load a multiple sequence examples to a DataSet, using the provided RecordMetaData instances.
param list List of RecordMetaData instances to load from. Should have been produced by the record reader provided to the SequenceRecordReaderDataSetIterator constructor
return DataSet with the specified examples
throws IOException If an error occurs during loading of the data
Async prefetching iterator wrapper for MultiDataSetIterator implementations This will asynchronously prefetch the specified number of minibatches from the underlying iterator. Also has the option (enabled by default for most constructors) to use a cyclical workspace to avoid creating INDArrays with off-heap memory that needs to be cleaned up by the JVM garbage collector.
Note that appropriate DL4J fit methods automatically utilize this iterator, so users don’t need to manually wrap their iterators when fitting a network
next
We want to ensure, that background thread will have the same thread->device affinity, as master thread
setPreProcessor
Set the preprocessor to be applied to each MultiDataSet, before each MultiDataSet is returned.
param preProcessor MultiDataSetPreProcessor. May be null.
resetSupported
Is resetting supported by this DataSetIterator? Many DataSetIterators do support resetting, but some don’t
return true if reset method is supported; false otherwise
asyncSupported
Does this DataSetIterator support asynchronous prefetching of multiple DataSet objects? Most DataSetIterators do, but in some cases it may not make sense to wrap this iterator in an iterator that does asynchronous prefetching. For example, it would not make sense to use asynchronous prefetching for the following types of iterators: (a) Iterators that store their full contents in memory already (b) Iterators that re-use features/labels arrays (as future next() calls will overwrite past contents) (c) Iterators that already implement some level of asynchronous prefetching (d) Iterators that may return different data depending on when the next() method is called
return true if asynchronous prefetching from this iterator is OK; false if asynchronous prefetching should not be used with this iterator
reset
Resets the iterator back to the beginning
shutdown
We want to ensure, that background thread will have the same thread->device affinity, as master thread
hasNext
Returns {- code true} if the iteration has more elements. (In other words, returns {- code true} if {- link #next} would return an element rather than throwing an exception.)
return {- code true} if the iteration has more elements
next
Returns the next element in the iteration.
return the next element in the iteration
remove
Removes from the underlying collection the last element returned by this iterator (optional operation). This method can be called only once per call to {- link #next}. The behavior of an iterator is unspecified if the underlying collection is modified while the iteration is in progress in any way other than by calling this method.
throws UnsupportedOperationException if the {- code remove} operation is not supported by this iterator
throws IllegalStateException if the {- code next} method has not yet been called, or the {- code remove} method has already been called after the last call to the {- code next} method
implSpec The default implementation throws an instance of {- link UnsupportedOperationException} and performs no other action.
required to get the specified batch size.
Typically used in Spark training, but may be used elsewhere. NOTE: reset method is not supported here.
Async prefetching iterator wrapper for DataSetIterator implementations. This will asynchronously prefetch the specified number of minibatches from the underlying iterator. Also has the option (enabled by default for most constructors) to use a cyclical workspace to avoid creating INDArrays with off-heap memory that needs to be cleaned up by the JVM garbage collector.
Note that appropriate DL4J fit methods automatically utilize this iterator, so users don’t need to manually wrap their iterators when fitting a network
AsyncDataSetIterator
Create an Async iterator with the default queue size of 8
param baseIterator Underlying iterator to wrap and fetch asynchronously from
next
Create an Async iterator with the default queue size of 8
param iterator Underlying iterator to wrap and fetch asynchronously from
param queue Queue size - number of iterators to
inputColumns
Input columns for the dataset
return
totalOutcomes
The number of labels for the dataset
return
resetSupported
Is resetting supported by this DataSetIterator? Many DataSetIterators do support resetting, but some don’t
return true if reset method is supported; false otherwise
asyncSupported
Does this DataSetIterator support asynchronous prefetching of multiple DataSet objects? Most DataSetIterators do, but in some cases it may not make sense to wrap this iterator in an iterator that does asynchronous prefetching. For example, it would not make sense to use asynchronous prefetching for the following types of iterators: (a) Iterators that store their full contents in memory already (b) Iterators that re-use features/labels arrays (as future next() calls will overwrite past contents) (c) Iterators that already implement some level of asynchronous prefetching (d) Iterators that may return different data depending on when the next() method is called
return true if asynchronous prefetching from this iterator is OK; false if asynchronous prefetching should not be used with this iterator
reset
Resets the iterator back to the beginning
shutdown
We want to ensure, that background thread will have the same thread->device affinity, as master thread
batch
Batch size
return
setPreProcessor
Set a pre processor
param preProcessor a pre processor to set
getPreProcessor
Returns preprocessors, if defined
return
hasNext
Get dataset iterator record reader labels
next
Returns the next element in the iteration.
return the next element in the iteration
remove
Removes from the underlying collection the last element returned by this iterator (optional operation). This method can be called only once per call to {- link #next}. The behavior of an iterator is unspecified if the underlying collection is modified while the iteration is in progress in any way other than by calling this method.
throws UnsupportedOperationException if the {- code remove} operation is not supported by this iterator
throws IllegalStateException if the {- code next} method has not yet been called, or the {- code remove} method has already been called after the last call to the {- code next} method
implSpec The default implementation throws an instance of {- link UnsupportedOperationException} and performs no other action.
First value in pair is the features vector, second value in pair is the labels. Supports generating 2d features/labels only
DoublesDataSetIterator
param iterable Iterable to source data from
param batchSize Batch size for generated DataSet objects
required to get a specified batch size.
Typically used in Spark training, but may be used elsewhere. NOTE: reset method is not supported here.
A wrapper for a dataset to sample from. This will randomly sample from the given dataset.
SamplingDataSetIterator
First value in pair is the features vector, second value in pair is the labels.
INDArrayDataSetIterator
param iterable Iterable to source data from
param batchSize Batch size for generated DataSet objects
This iterator detaches/migrates DataSets coming out from backed DataSetIterator, thus providing “safe” DataSets. This is typically used for debugging and testing purposes, and should not be used in general by users
WorkspacesShieldDataSetIterator
param iterator The underlying iterator to detach values from
This iterator virtually splits given MultiDataSetIterator into Train and Test parts. I.e. you have 100000 examples. Your batch size is 32. That means you have 3125 total batches. With split ratio of 0.7 that will give you 2187 training batches, and 938 test batches.
PLEASE NOTE: You can’t use Test iterator twice in a row. Train iterator should be used before Test iterator use. PLEASE NOTE: You can’t use this iterator, if underlying iterator uses randomization/shuffle between epochs.
param baseIterator
param totalBatches - total number of batches in underlying iterator. this value will be used to determine number of test/train batches
param ratio - this value will be used as splitter. should be between in range of 0.0 > X < 1.0. I.e. if value 0.7 is provided, then 70% of total examples will be used for training, and 30% of total examples will be used for testing
getTrainIterator
This method returns train iterator instance
return
next
This method returns test iterator instance
return
This wrapper takes your existing DataSetIterator implementation and prevents asynchronous prefetch This is mainly used for debugging purposes; generally an iterator that isn’t safe to asynchronously prefetch from
AsyncShieldDataSetIterator
param iterator Iterator to wrop, to disable asynchronous prefetching for
next
Like the standard next method but allows a customizable number of examples returned
param num the number of examples
return the next data applyTransformToDestination
inputColumns
Input columns for the dataset
return
totalOutcomes
The number of labels for the dataset
return
resetSupported
Is resetting supported by this DataSetIterator? Many DataSetIterators do support resetting, but some don’t
return true if reset method is supported; false otherwise
asyncSupported
Does this DataSetIterator support asynchronous prefetching of multiple DataSet objects?
PLEASE NOTE: This iterator ALWAYS returns FALSE
return true if asynchronous prefetching from this iterator is OK; false if asynchronous prefetching should not be used with this iterator
reset
Resets the iterator back to the beginning
batch
Batch size
return
setPreProcessor
Set a pre processor
param preProcessor a pre processor to set
getPreProcessor
Returns preprocessors, if defined
return
hasNext
Get dataset iterator record reader labels
next
Returns the next element in the iteration.
return the next element in the iteration
remove
Removes from the underlying collection the last element returned by this iterator (optional operation). This method can be called only once per call to {- link #next}. The behavior of an iterator is unspecified if the underlying collection is modified while the iteration is in progress in any way other than by calling this method.
throws UnsupportedOperationException if the {- code remove} operation is not supported by this iterator
throws IllegalStateException if the {- code next} method has not yet been called, or the {- code remove} method has already been called after the last call to the {- code next} method
implSpec The default implementation throws an instance of {- link UnsupportedOperationException} and performs no other action.
This class provides baseline implementation of BlockDataSetIterator interface
Baseline implementation includes control over the data fetcher and some basic getters for metadata
This wrapper takes your existing MultiDataSetIterator implementation and prevents asynchronous prefetch
next
Fetch the next ‘num’ examples. Similar to the next method, but returns a specified number of examples
param num Number of examples to fetch
setPreProcessor
Set the preprocessor to be applied to each MultiDataSet, before each MultiDataSet is returned.
param preProcessor MultiDataSetPreProcessor. May be null.
resetSupported
Is resetting supported by this DataSetIterator? Many DataSetIterators do support resetting, but some don’t
return true if reset method is supported; false otherwise
asyncSupported
/ Does this DataSetIterator support asynchronous prefetching of multiple DataSet objects?
PLEASE NOTE: This iterator ALWAYS returns FALSE
return true if asynchronous prefetching from this iterator is OK; false if asynchronous prefetching should not be used with this iterator
reset
Resets the iterator back to the beginning
hasNext
Returns {- code true} if the iteration has more elements. (In other words, returns {- code true} if {- link #next} would return an element rather than throwing an exception.)
return {- code true} if the iteration has more elements
next
Returns the next element in the iteration.
return the next element in the iteration
remove
Removes from the underlying collection the last element returned by this iterator (optional operation). This method can be called only once per call to {- link #next}. The behavior of an iterator is unspecified if the underlying collection is modified while the iteration is in progress in any way other than by calling this method.
throws UnsupportedOperationException if the {- code remove} operation is not supported by this iterator
throws IllegalStateException if the {- code next} method has not yet been called, or the {- code remove} method has already been called after the last call to the {- code next} method
implSpec The default implementation throws an instance of {- link UnsupportedOperationException} and performs no other action.
RandomMultiDataSetIterator: Generates random values (or zeros, ones, integers, etc) according to some distribution. Note: This is typically used for testing, debugging and benchmarking purposes.
RandomMultiDataSetIterator
param numMiniBatches Number of minibatches per epoch
param features Each triple in the list specifies the shape, array order and type of values for the features arrays
param labels Each triple in the list specifies the shape, array order and type of values for the labels arrays
addFeatures
param numMiniBatches Number of minibatches per epoch
addFeatures
Add a new features array to the iterator
param shape Shape of the features
param order Order (‘c’ or ‘f’) for the array
param values Values to fill the array with
addLabels
Add a new labels array to the iterator
param shape Shape of the features
param values Values to fill the array with
addLabels
Add a new labels array to the iterator
param shape Shape of the features
param order Order (‘c’ or ‘f’) for the array
param values Values to fill the array with
generate
Generate a random array with the specified shape
param shape Shape of the array
param values Values to fill the array with
return Random array of specified shape + contents
generate
Generate a random array with the specified shape and order
param shape Shape of the array
param order Order of array (‘c’ or ‘f’)
param values Values to fill the array with
return Random array of specified shape + contents
Builds an iterator that terminates once the number of minibatches returned with .next() is equal to a specified number. Note that a call to .next(num) is counted as a call to return a minibatch regardless of the value of num This essentially restricts the data to this specified number of minibatches.
EarlyTerminationMultiDataSetIterator
Constructor takes the iterator to wrap and the number of minibatches after which the call to hasNext() will return false
param underlyingIterator, iterator to wrap
param terminationPoint, minibatches after which hasNext() will return false
ExistingDataSetIterator
Note that when using this constructor, resetting is not supported
param iterator Iterator to wrap
next
Note that when using this constructor, resetting is not supported
param iterator Iterator to wrap
param labels String labels. May be null.
This class provides baseline implementation of BlockMultiDataSetIterator interface
Builds an iterator that terminates once the number of minibatches returned with .next() is equal to a specified number. Note that a call to .next(num) is counted as a call to return a minibatch regardless of the value of num This essentially restricts the data to this specified number of minibatches.
EarlyTerminationDataSetIterator
Constructor takes the iterator to wrap and the number of minibatches after which the call to hasNext() will return false
param underlyingIterator, iterator to wrap
param terminationPoint, minibatches after which hasNext() will return false
Wraps a data set iterator setting the first (feature matrix) as the labels.
next
Like the standard next method but allows a customizable number of examples returned
param num the number of examples
return the next data applyTransformToDestination
inputColumns
Input columns for the dataset
return
totalOutcomes
The number of labels for the dataset
return
reset
Resets the iterator back to the beginning
batch
Batch size
return
hasNext
Returns {- code true} if the iteration has more elements. (In other words, returns {- code true} if {- link #next} would return an element rather than throwing an exception.)
return {- code true} if the iteration has more elements
next
Returns the next element in the iteration.
return the next element in the iteration
remove
Removes from the underlying collection the last element returned by this iterator (optional operation). This method can be called only once per call to {- link #next}. The behavior of an iterator is unspecified if the underlying collection is modified while the iteration is in progress in any way other than by calling this method.
throws UnsupportedOperationException if the {- code remove} operation is not supported by this iterator
throws IllegalStateException if the {- code next} method has not yet been called, or the {- code remove} method has already been called after the last call to the {- code next} method
This iterator virtually splits given MultiDataSetIterator into Train and Test parts. I.e. you have 100000 examples. Your batch size is 32. That means you have 3125 total batches. With split ratio of 0.7 that will give you 2187 training batches, and 938 test batches.
PLEASE NOTE: You can’t use Test iterator twice in a row. Train iterator should be used before Test iterator use. PLEASE NOTE: You can’t use this iterator, if underlying iterator uses randomization/shuffle between epochs.
DataSetIteratorSplitter
The only constructor
param baseIterator - iterator to be wrapped and split
param totalBatches - total batches in baseIterator
param ratio - train/test split ratio
getTrainIterator
This method returns train iterator instance
return
next
This method returns test iterator instance
return
This dataset iterator combines multiple DataSetIterators into 1 MultiDataSetIterator. Values from each iterator are joined on a per-example basis - i.e., the values from each DataSet are combined as different feature arrays for a multi-input neural network. Labels can come from either one of the underlying DataSetIteartors only (if ‘outcome’ is >= 0) or from all iterators (if outcome is < 0)
JointMultiDataSetIterator
param iterators Underlying iterators to wrap
next
param outcome Index to get the label from. If < 0, labels from all iterators will be used to create the final MultiDataSet
param iterators Underlying iterators to wrap
setPreProcessor
Set the preprocessor to be applied to each MultiDataSet, before each MultiDataSet is returned.
param preProcessor MultiDataSetPreProcessor. May be null.
getPreProcessor
Get the {- link MultiDataSetPreProcessor}, if one has previously been set. Returns null if no preprocessor has been set
return Preprocessor
resetSupported
Is resetting supported by this DataSetIterator? Many DataSetIterators do support resetting, but some don’t
return true if reset method is supported; false otherwise
asyncSupported
Does this MultiDataSetIterator support asynchronous prefetching of multiple MultiDataSet objects? Most MultiDataSetIterators do, but in some cases it may not make sense to wrap this iterator in an iterator that does asynchronous prefetching. For example, it would not make sense to use asynchronous prefetching for the following types of iterators: (a) Iterators that store their full contents in memory already (b) Iterators that re-use features/labels arrays (as future next() calls will overwrite past contents) (c) Iterators that already implement some level of asynchronous prefetching (d) Iterators that may return different data depending on when the next() method is called
return true if asynchronous prefetching from this iterator is OK; false if asynchronous prefetching should not be used with this iterator
reset
Resets the iterator back to the beginning
hasNext
Returns {- code true} if the iteration has more elements. (In other words, returns {- code true} if {- link #next} would return an element rather than throwing an exception.)
return {- code true} if the iteration has more elements
next
Returns the next element in the iteration.
return the next element in the iteration
remove
PLEASE NOTE: This method is NOT implemented
throws UnsupportedOperationException if the {- code remove} operation is not supported by this iterator
throws IllegalStateException if the {- code next} method has not yet been called, or the {- code remove} method has already been called after the last call to the {- code next} method
implSpec The default implementation throws an instance of {- link UnsupportedOperationException} and performs no other action.
First value in pair is the features vector, second value in pair is the labels. Supports generating 2d features/labels only
FloatsDataSetIterator
param iterable Iterable to source data from
param batchSize Batch size for generated DataSet objects
Simple iterator working with list of files. File to DataSet conversion will be handled via provided FileCallback implementation
FileSplitDataSetIterator
param files List of files to iterate over
param callback Callback for loading the files
A dataset iterator for doing multiple passes over a dataset
Use MultiLayerNetwork/ComputationGraph.fit(DataSetIterator, int numEpochs) instead
next
Like the standard next method but allows a customizable number of examples returned
param num the number of examples
return the next data applyTransformToDestination
inputColumns
Input columns for the dataset
return
totalOutcomes
The number of labels for the dataset
return
reset
Resets the iterator back to the beginning
batch
Batch size
return
hasNext
Returns {- code true} if the iteration has more elements. (In other words, returns {- code true} if {- link #next} would return an element rather than throwing an exception.)
return {- code true} if the iteration has more elements
remove
Removes from the underlying collection the last element returned by this iterator (optional operation). This method can be called only once per call to {- link #next}. The behavior of an iterator is unspecified if the underlying collection is modified while the iteration is in progress in any way other than by calling this method.
throws UnsupportedOperationException if the {- code remove} operation is not supported by this iterator
throws IllegalStateException if the {- code next} method has not yet been called, or the {- code remove} method has already been called after the last call to the {- code next} method
This class is simple wrapper that takes single-input MultiDataSets and converts them to DataSets on the fly
PLEASE NOTE: This only works if number of features/labels/masks is 1
MultiDataSetWrapperIterator
param iterator Undelying iterator to wrap
RandomDataSetIterator: Generates random values (or zeros, ones, integers, etc) according to some distribution. Note: This is typically used for testing, debugging and benchmarking purposes.
RandomDataSetIterator
param numMiniBatches Number of minibatches per epoch
param featuresShape Features shape
param labelsShape Labels shape
param featureValues Type of values for the features
param labelValues Type of values for the labels
Iterator that adapts a DataSetIterator to a MultiDataSetIterator
Prebuilt model architectures and weights for out-of-the-box application.
Deeplearning4j has native model zoo that can be accessed and instantiated directly from DL4J. The model zoo also includes pretrained weights for different datasets that are downloaded automatically and checked for integrity using a checksum mechanism.
If you want to use the new model zoo, you will need to add it as a dependency. A Maven POM would add the following:
Once you've successfully added the zoo dependency to your project, you can start to import and use models. Each model extends the ZooModel
abstract class and uses the InstantiableModel
interface. These classes provide methods that help you initialize either an empty, fresh network or a pretrained network.
You can instantly instantiate a model from the zoo using the .init()
method. For example, if you want to instantiate a fresh, untrained network of AlexNet you can use the following code:
If you want to tune parameters or change the optimization algorithm, you can obtain a reference to the underlying network configuration:
Some models have pretrained weights available, and a small number of models are pretrained across different datasets. PretrainedType
is an enumerator that outlines different weight types, which includes IMAGENET
, MNIST
, CIFAR10
, and VGGFACE
.
For example, you can initialize a VGG-16 model with ImageNet weights like so:
And initialize another VGG16 model with weights trained on VGGFace:
If you're not sure whether a model contains pretrained weights, you can use the .pretrainedAvailable()
method which returns a boolean. Simply pass a PretrainedType
enum to this method, which returns true if weights are available.
Note that for convolutional models, input shape information follows the NCHW convention. So if a model's input shape default is new int[]{3, 224, 224}
, this means the model has 3 channels and height/width of 224.
The model zoo comes with well-known image recognition configurations in the deep learning community. The zoo also includes an LSTM for text generation, and a simple CNN for general image recognition.
You can find a complete list of models using this deeplearning4j-zoo Github link.
This includes ImageNet models such as VGG-16, ResNet-50, AlexNet, Inception-ResNet-v1, LeNet, and more.
The zoo comes with a couple additional features if you're looking to use the models for different use cases.
Aside from passing certain configuration information to the constructor of a zoo model, you can also change its input shape using .setInputShape()
.
NOTE: this applies to fresh configurations only, and will not affect pretrained models:
Pretrained models are perfect for transfer learning! You can read more about transfer learning using DL4J here.
Initialization methods often have an additional parameter named workspaceMode
. For the majority of users you will not need to use this; however, if you have a large machine that has "beefy" specifications, you can pass WorkspaceMode.SINGLE
for models such as VGG-19 that have many millions of parameters. To learn more about workspaces, please see this section.
Also known as CNN.
1D convolution layer. Expects input activations of shape [minibatch,channels,sequenceLength]
2D convolution layer
3D convolution layer configuration
hasBias
An optional dataFormat: “NDHWC” or “NCDHW”. Defaults to “NCDHW”. The data format of the input and output data. For “NCDHW” (also known as ‘channels first’ format), the data storage order is: [batchSize, inputChannels, inputDepth, inputHeight, inputWidth]. For “NDHWC” (‘channels last’ format), the data is stored in the order of: [batchSize, inputDepth, inputHeight, inputWidth, inputChannels].
kernelSize
The data format for input and output activations. NCDHW: activations (in/out) should have shape [minibatch, channels, depth, height, width] NDHWC: activations (in/out) should have shape [minibatch, depth, height, width, channels]
stride
Set stride size for 3D convolutions in (depth, height, width) order
param stride kernel size
return 3D convolution layer builder
padding
Set padding size for 3D convolutions in (depth, height, width) order
param padding kernel size
return 3D convolution layer builder
dilation
Set dilation size for 3D convolutions in (depth, height, width) order
param dilation kernel size
return 3D convolution layer builder
dataFormat
The data format for input and output activations. NCDHW: activations (in/out) should have shape [minibatch, channels, depth, height, width] NDHWC: activations (in/out) should have shape [minibatch, depth, height, width, channels]
param dataFormat Data format to use for activations
setKernelSize
Set kernel size for 3D convolutions in (depth, height, width) order
param kernelSize kernel size
setStride
Set stride size for 3D convolutions in (depth, height, width) order
param stride kernel size
setPadding
Set padding size for 3D convolutions in (depth, height, width) order
param padding kernel size
setDilation
Set dilation size for 3D convolutions in (depth, height, width) order
param dilation kernel size
2D deconvolution layer configuration
Deconvolutions are also known as transpose convolutions or fractionally strided convolutions. In essence, deconvolutions swap forward and backward pass with regular 2D convolutions.
See the paper by Matt Zeiler for details: http://www.matthewzeiler.com/wp-content/uploads/2017/07/cvpr2010.pdf
For an intuitive guide to convolution arithmetic and shapes, see: https://arxiv.org/abs/1603.07285v1
hasBias
Deconvolution2D layer nIn in the input layer is the number of channels nOut is the number of filters to be used in the net or in other words the channels The builder specifies the filter/kernel size, the stride and padding The pooling layer takes the kernel size
convolutionMode
Set the convolution mode for the Convolution layer. See {- link ConvolutionMode} for more details
param convolutionMode Convolution mode for layer
kernelSize
Size of the convolution rows/columns
param kernelSize the height and width of the kernel
Cropping layer for convolutional (1d) neural networks. Allows cropping to be done separately for top/bottom
getOutputType
param cropTopBottom Amount of cropping to apply to both the top and the bottom of the input activations
setCropping
Cropping amount for top/bottom (in that order). Must be length 1 or 2 array.
build
param cropping Cropping amount for top/bottom (in that order). Must be length 1 or 2 array.
Cropping layer for convolutional (2d) neural networks. Allows cropping to be done separately for top/bottom/left/right
getOutputType
param cropTopBottom Amount of cropping to apply to both the top and the bottom of the input activations
param cropLeftRight Amount of cropping to apply to both the left and the right of the input activations
setCropping
Cropping amount for top/bottom/left/right (in that order). A length 4 array.
build
param cropping Cropping amount for top/bottom/left/right (in that order). Must be length 4 array.
Cropping layer for convolutional (3d) neural networks. Allows cropping to be done separately for upper and lower bounds of depth, height and width dimensions.
getOutputType
param cropDepth Amount of cropping to apply to both depth boundaries of the input activations
param cropHeight Amount of cropping to apply to both height boundaries of the input activations
param cropWidth Amount of cropping to apply to both width boundaries of the input activations
setCropping
Cropping amount, a length 6 array, i.e. crop left depth, crop right depth, crop left height, crop right height, crop left width, crop right width
build
param cropping Cropping amount, must be length 3 or 6 array, i.e. either crop depth, crop height, crop width or crop left depth, crop right depth, crop left height, crop right height, crop left width, crop right width
Autoencoders are neural networks for unsupervised learning. Eclipse Deeplearning4j supports certain autoencoder layers such as variational autoencoders.
RBMs are no longer supported as of version 0.9.x. They are no longer best-in-class for most machine learning problems.
Autoencoder layer. Adds noise to input and learn a reconstruction function.
corruptionLevel
Level of corruption - 0.0 (none) to 1.0 (all values corrupted)
sparsity
Autoencoder sparity parameter
param sparsity Sparsity
Variational Autoencoder layer
See: Kingma & Welling, 2013: Auto-Encoding Variational Bayes - https://arxiv.org/abs/1312.6114
This implementation allows multiple encoder and decoder layers, the number and sizes of which can be set independently.
A note on scores during pretraining: This implementation minimizes the negative of the variational lower bound objective as described in Kingma & Welling; the mathematics in that paper is based on maximization of the variational lower bound instead. Thus, scores reported during pretraining in DL4J are the negative of the variational lower bound equation in the paper. The backpropagation and learning procedure is otherwise as described there.
encoderLayerSizes
Size of the encoder layers, in units. Each encoder layer is functionally equivalent to a {- link org.deeplearning4j.nn.conf.layers.DenseLayer}. Typically the number and size of the decoder layers (set via {- link #decoderLayerSizes(int…)} is similar to the encoder layers.
setEncoderLayerSizes
Size of the encoder layers, in units. Each encoder layer is functionally equivalent to a {- link org.deeplearning4j.nn.conf.layers.DenseLayer}. Typically the number and size of the decoder layers (set via {- link #decoderLayerSizes(int…)} is similar to the encoder layers.
param encoderLayerSizes Size of each encoder layer in the variational autoencoder
decoderLayerSizes
Size of the decoder layers, in units. Each decoder layer is functionally equivalent to a {- link org.deeplearning4j.nn.conf.layers.DenseLayer}. Typically the number and size of the decoder layers is similar to the encoder layers (set via {- link #encoderLayerSizes(int…)}.
param decoderLayerSizes Size of each deccoder layer in the variational autoencoder
setDecoderLayerSizes
Size of the decoder layers, in units. Each decoder layer is functionally equivalent to a {- link org.deeplearning4j.nn.conf.layers.DenseLayer}. Typically the number and size of the decoder layers is similar to the encoder layers (set via {- link #encoderLayerSizes(int…)}.
param decoderLayerSizes Size of each deccoder layer in the variational autoencoder
reconstructionDistribution
The reconstruction distribution for the data given the hidden state - i.e., P(data|Z). This should be selected carefully based on the type of data being modelled. For example:
{- link GaussianReconstructionDistribution} + {identity or tanh} for real-valued (Gaussian) data
{- link BernoulliReconstructionDistribution} + sigmoid for binary-valued (0 or 1) data
param distribution Reconstruction distribution
lossFunction
Configure the VAE to use the specified loss function for the reconstruction, instead of a ReconstructionDistribution. Note that this is NOT following the standard VAE design (as per Kingma & Welling), which assumes a probabilistic output - i.e., some p(x|z). It is however a valid network configuration, allowing for optimization of more traditional objectives such as mean squared error. Note: clearly, setting the loss function here will override any previously set recontruction distribution
param outputActivationFn Activation function for the output/reconstruction
param lossFunction Loss function to use
lossFunction
Configure the VAE to use the specified loss function for the reconstruction, instead of a ReconstructionDistribution. Note that this is NOT following the standard VAE design (as per Kingma & Welling), which assumes a probabilistic output - i.e., some p(x|z). It is however a valid network configuration, allowing for optimization of more traditional objectives such as mean squared error. Note: clearly, setting the loss function here will override any previously set recontruction distribution
param outputActivationFn Activation function for the output/reconstruction
param lossFunction Loss function to use
lossFunction
Configure the VAE to use the specified loss function for the reconstruction, instead of a ReconstructionDistribution. Note that this is NOT following the standard VAE design (as per Kingma & Welling), which assumes a probabilistic output - i.e., some p(x|z). It is however a valid network configuration, allowing for optimization of more traditional objectives such as mean squared error. Note: clearly, setting the loss function here will override any previously set recontruction distribution
param outputActivationFn Activation function for the output/reconstruction
param lossFunction Loss function to use
pzxActivationFn
Activation function for the input to P(z|data). Care should be taken with this, as some activation functions (relu, etc) are not suitable due to being bounded in range [0,infinity).
param activationFunction Activation function for p(z| x)
pzxActivationFunction
Activation function for the input to P(z|data). Care should be taken with this, as some activation functions (relu, etc) are not suitable due to being bounded in range [0,infinity).
param activation Activation function for p(z | x)
nOut
Set the size of the VAE state Z. This is the output size during standard forward pass, and the size of the distribution P(Z|data) during pretraining.
param nOut Size of P(Z | data) and output size
numSamples
Set the number of samples per data point (from VAE state Z) used when doing pretraining. Default value: 1.
This is parameter L from Kingma and Welling: “In our experiments we found that the number of samples L per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100.”
param numSamples Number of samples per data point for pretraining
Saving and loading of neural networks.
MultiLayerNetwork and ComputationGraph both have save and load methods.
You can save/load a MultiLayerNetwork using:
Similarly, you can save/load a ComputationGraph using:
Internally, these methods use the ModelSerializer
class, which handles loading and saving models. There are two methods for saving models shown in the examples through the link. The first example saves a normal multi layer network, the second one saves a .
Here is a with code to save a computation graph using the ModelSerializer
class, as well as an example of using ModelSerializer to save a neural net built using MultiLayer configuration.
If your model uses probabilities (i.e. DropOut/DropConnect), it may make sense to save it separately, and apply it after model is restored; i.e:
This will guarantee equal results between sessions/JVMs.
Utility class suited to save/restore neural net models
writeModel
Write a model to a file
param model the model to write
param file the file to write to
param saveUpdater whether to save the updater or not
throws IOException
writeModel
Write a model to a file
param model the model to write
param file the file to write to
param saveUpdater whether to save the updater or not
param dataNormalization the normalizer to save (optional)
throws IOException
writeModel
Write a model to a file path
param model the model to write
param path the path to write to
param saveUpdater whether to save the updater or not
throws IOException
writeModel
Write a model to an output stream
param model the model to save
param stream the output stream to write to
param saveUpdater whether to save the updater for the model or not
throws IOException
writeModel
Write a model to an output stream
param model the model to save
param stream the output stream to write to
param saveUpdater whether to save the updater for the model or not
param dataNormalization the normalizer ot save (may be null)
throws IOException
restoreMultiLayerNetwork
Load a multi layer network from a file
param file the file to load from
return the loaded multi layer network
throws IOException
restoreMultiLayerNetwork
Load a multi layer network from a file
param file the file to load from
return the loaded multi layer network
throws IOException
restoreMultiLayerNetwork
Load a MultiLayerNetwork from InputStream from an input stream Note: the input stream is read fully and closed by this method. Consequently, the input stream cannot be re-used.
param is the inputstream to load from
return the loaded multi layer network
throws IOException
see #restoreMultiLayerNetworkAndNormalizer(InputStream, boolean)
restoreMultiLayerNetwork
Restore a multi layer network from an input stream Note: the input stream is read fully and closed by this method. Consequently, the input stream cannot be re-used.
param is the input stream to restore from
return the loaded multi layer network
throws IOException
see #restoreMultiLayerNetworkAndNormalizer(InputStream, boolean)
restoreMultiLayerNetwork
Load a MultilayerNetwork model from a file
param path path to the model file, to get the computation graph from
return the loaded computation graph
throws IOException
restoreMultiLayerNetwork
Load a MultilayerNetwork model from a file
param path path to the model file, to get the computation graph from
return the loaded computation graph
throws IOException
restoreComputationGraph
Restore a MultiLayerNetwork and Normalizer (if present - null if not) from the InputStream. Note: the input stream is read fully and closed by this method. Consequently, the input stream cannot be re-used.
param is Input stream to read from
param loadUpdater Whether to load the updater from the model or not
return Model and normalizer, if present
throws IOException If an error occurs when reading from the stream
restoreComputationGraph
Load a computation graph from a file
param path path to the model file, to get the computation graph from
return the loaded computation graph
throws IOException
restoreComputationGraph
Load a computation graph from a InputStream
param is the inputstream to get the computation graph from
return the loaded computation graph
throws IOException
restoreComputationGraph
Load a computation graph from a InputStream
param is the inputstream to get the computation graph from
return the loaded computation graph
throws IOException
restoreComputationGraph
Load a computation graph from a file
param file the file to get the computation graph from
return the loaded computation graph
throws IOException
restoreComputationGraph
Restore a ComputationGraph and Normalizer (if present - null if not) from the InputStream. Note: the input stream is read fully and closed by this method. Consequently, the input stream cannot be re-used.
param is Input stream to read from
param loadUpdater Whether to load the updater from the model or not
return Model and normalizer, if present
throws IOException If an error occurs when reading from the stream
taskByModel
param model
return
addNormalizerToModel
This method appends normalizer to a given persisted model.
PLEASE NOTE: File should be model file saved earlier with ModelSerializer
param f
param normalizer
addObjectToFile
Add an object to the (already existing) model file using Java Object Serialization. Objects can be restored using {- link #getObjectFromFile(File, String)}
param f File to add the object to
param key Key to store the object under
param o Object to store using Java object serialization
Simple and sequential network configuration.
The MultiLayerNetwork
class is the simplest network configuration API available in Eclipse Deeplearning4j. This class is useful for beginners or users who do not need a complex and branched network graph.
You will not want to use MultiLayerNetwork
configuration if you are creating complex loss functions, using graph vertices, or doing advanced training such as a triplet network. This includes popular complex networks such as InceptionV4.
The example below shows how to build a simple linear classifier using DenseLayer
(a basic multiperceptron layer).
You can also create convolutional configurations:
Special algorithms for gradient descent.
The main difference among the updaters is how they treat the learning rate. Stochastic Gradient Descent, the most common learning algorithm in deep learning, relies on Theta
(the weights in hidden layers) and alpha
(the learning rate). Different updaters help optimize the learning rate until the neural network converges on its most performant state.
To use the updaters, pass a new class to the updater()
method in either a ComputationGraph
or MultiLayerNetwork
.
The Nadam updater.
applyUpdater
Calculate the update based on the given gradient
param gradient the gradient to get the update for
param iteration
return the gradient
Nesterov’s momentum. Keep track of the previous layer’s gradient and use it as a way of updating the gradient.
applyUpdater
Get the nesterov update
param gradient the gradient to get the update for
param iteration
return
RMS Prop updates:
Vectorized Learning Rate used per Connection Weight
applyUpdater
Gets feature specific learning rates Adagrad keeps a history of gradients being passed in. Note that each gradient passed in becomes adapted over time, hence the opName adagrad
param gradient the gradient to get learning rates for
param iteration
applyUpdater
Calculate the update based on the given gradient
param gradient the gradient to get the update for
param iteration
return the gradient
NoOp updater: gradient updater that makes no changes to the gradient
applyUpdater
Calculate the update based on the given gradient
param gradient the gradient to get the update for
param iteration
return the gradient
Ada delta updater. More robust adagrad that keeps track of a moving window average of the gradient rather than the every decaying learning rates of adagrad
applyUpdater
Get the updated gradient for the given gradient and also update the state of ada delta.
param gradient the gradient to get the updated gradient for
param iteration
return the update gradient
SGD updater applies a learning rate only
Gradient modifications: Calculates an update and tracks related information for gradient changes over time for handling updates.
Neural word embeddings for NLP in DL4J.
Contents
Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a , it turns text into a numerical form that deep nets can understand. .
Word2vec's applications extend beyond parsing sentences in the wild. It can be applied just as well to in which patterns may be discerned.
Why? Because words are simply discrete states like the other data mentioned above, and we are simply looking for the transitional probabilities between those states: the likelihood that they will co-occur. So gene2vec, like2vec and follower2vec are all possible. With that in mind, the tutorial below will help you understand how to create neural embeddings for any group of discrete and co-occurring states.
The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words. It does so without human intervention.
Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word's association with other words (e.g. "man" is to "boy" what "woman" is to "girl"), or cluster documents and classify them by topic. Those clusters can form the basis of search, sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management.
The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words.
Here's a list of words associated with "Sweden" using Word2vec, in order of proximity:
The nations of Scandinavia and several wealthy, northern European, Germanic countries are among the top nine.
The vectors we use to represent words are called neural word embeddings, and representations are strange. One thing describes another, even though those two things are radically different. As Elvis Costello said: "Writing about music is like dancing about architecture." Word2vec "vectorizes" about words, and by doing so it makes natural language computer-readable -- we can start to perform powerful mathematical operations on words to detect their similarities.
So a neural word embedding represents a word with numbers. It's a simple, yet unlikely, translation.
It does so in one of two ways, either using context to predict a target word (a method known as continuous bag of words, or CBOW), or using a word to predict a target context, which is called skip-gram. We use the latter method because it produces more accurate results on large datasets.
When the feature vector assigned to a word cannot be used to accurately predict that word's context, the components of the vector are adjusted. Each word's context in the corpus is the teacher sending error signals back to adjust the feature vector. The vectors of words judged similar by their context are nudged closer together by adjusting the numbers in the vector.
Just as Van Gogh's painting of sunflowers is a two-dimensional mixture of oil on canvas that represents vegetable matter in a three-dimensional space in Paris in the late 1880s, so 500 numbers arranged in a vector can represent a word or group of words.
Those numbers locate each word as a point in 500-dimensional vectorspace. Spaces of more than three dimensions are difficult to visualize. (Geoff Hinton, teaching people to imagine 13-dimensional space, suggests that students first picture 3-dimensional space and then say to themselves: "Thirteen, thirteen, thirteen." :)
A well trained set of word vectors will place similar words close to each other in that space. The words oak, elm and birch might cluster in one corner, while war, conflict and strife huddle together in another.
Similar things and ideas are shown to be "close". Their relative meanings have been translated to measurable distances. Qualities become quantities, and algorithms can do their work. But similarity is just the basis of many associations that Word2vec can learn. For example, it can gauge relations between words of one language, and map them to another.
These vectors are the basis of a more comprehensive geometry of words. As shown in the graph, capital cities such as Rome, Paris, Berlin and Beijing cluster near each other, and they will each have similar distances in vectorspace to their countries; i.e. Rome - Italy = Beijing - China. If you only knew that Rome was the capital of Italy, and were wondering about the capital of China, then the equation Rome -Italy + China would return Beijing. No kidding.
Let's look at some other associations Word2vec can produce.
Instead of the pluses, minus and equals signs, we'll give you the results in the notation of logical analogies, where :
means "is to" and ::
means "as"; e.g. "Rome is to Italy as Beijing is to China" = Rome:Italy::Beijing:China
. In the last spot, rather than supplying the "answer", we'll give you the list of words that a Word2vec model proposes, when given the first three elements:
Geopolitics: Iraq - Violence = Jordan
Distinction: Human - Animal = Ethics
President - Power = Prime Minister
Library - Books = Hall
Analogy: Stock Market ≈ Thermometer
By building a sense of one word's proximity to other similar words, which do not necessarily contain the same letters, we have moved beyond hard tokens to a smoother and more general sense of meaning.
Here are Deeplearning4j's natural-language processing components:
SentenceIterator/DocumentIterator: Used to iterate over a dataset. A SentenceIterator returns strings and a DocumentIterator works with inputstreams.
Tokenizer/TokenizerFactory: Used in tokenizing the text. In NLP terms, a sentence is represented as a series of tokens. A TokenizerFactory creates an instance of a tokenizer for a "sentence."
VocabCache: Used for tracking metadata including word counts, document occurrences, the set of tokens (not vocab in this case, but rather tokens that have occurred), vocab (the features included in both bag of words as well as the word vector lookup table)
Inverted Index: Stores metadata about where words occurred. Can be used for understanding the dataset. A Lucene index with the Lucene implementation[1] is automatically created.
Now create and name a new class in Java. After that, you'll take the raw sentences in your .txt file, traverse them with your iterator, and subject them to some sort of preprocessing, such as converting all words to lowercase.
If you want to load a text file besides the sentences provided in our example, you'd do this:
That is, get rid of the ClassPathResource
and feed the absolute path of your .txt
file into the LineSentenceIterator
.
In bash, you can find the absolute file path of any directory by typing pwd
in your command line from within that same directory. To that path, you'll add the file name and voila.
Word2vec needs to be fed words rather than whole sentences, so the next step is to tokenize the data. To tokenize a text is to break it up into its atomic units, creating a new token each time you hit a white space, for example.
That should give you one word per line.
Now that the data is ready, you can configure the Word2vec neural net and feed in the tokens.
This configuration accepts a number of hyperparameters. A few require some explanation:
batchSize is the amount of words you process at a time.
minWordFrequency is the minimum number of times a word must appear in the corpus. Here, if it appears less than 5 times, it is not learned. Words must appear in multiple contexts to learn useful features about them. In very large corpora, it's reasonable to raise the minimum.
useAdaGrad - Adagrad creates a different gradient for each feature. Here we are not concerned with that.
layerSize specifies the number of features in the word vector. This is equal to the number of dimensions in the featurespace. Words represented by 500 features become points in a 500-dimensional space.
learningRate is the step size for each update of the coefficients, as words are repositioned in the feature space.
minLearningRate is the floor on the learning rate. Learning rate decays as the number of words you train on decreases. If learning rate shrinks too much, the net's learning is no longer efficient. This keeps the coefficients moving.
iterate tells the net what batch of the dataset it's training on.
tokenizer feeds it the words from the current batch.
vec.fit() tells the configured net to begin training.
The next step is to evaluate the quality of your feature vectors.
The line vec.similarity("word1","word2")
will return the cosine similarity of the two words you enter. The closer it is to 1, the more similar the net perceives those words to be (see the Sweden-Norway example above). For example:
With vec.wordsNearest("word1", numWordsNearest)
, the words printed to the screen allow you to eyeball whether the net has clustered semantically similar words. You can set the number of nearest words you want with the second parameter of wordsNearest. For example:
You'll want to save the model. The normal way to save models in Deeplearning4j is via the serialization utils (Java serialization is akin to Python pickling, converting an object into a series of bytes).
This will save the vectors to a file called pathToSaveModel.txt
that will appear in the root of the directory where Word2vec is trained. The output in the file should have one word per line, followed by a series of numbers that together are its vector representation.
To keep working with the vectors, simply call methods on vec
like this:
The classic example of Word2vec's arithmetic of words is "king - queen = man - woman" and its logical extension "king - queen + woman = man".
The example above will output the 10 nearest words to the vector king - queen + woman
, which should include man
. The first parameter for wordsNearest has to include the "positive" words king
and woman
, which have a + sign associated with them; the second parameter includes the "negative" word queen
, which is associated with the minus sign (positive and negative here have no emotional connotation); the third is the length of the list of nearest words you would like to see. Remember to add this to the top of the file: import java.util.Arrays;
.
Any number of combinations is possible, but they will only return sensible results if the words you query occurred with enough frequency in the corpus. Obviously, the ability to return similar words (or documents) is at the foundation of both search and recommendation engines.
You can reload the vectors into memory like this:
You can then use Word2vec as a lookup table:
If the word isn't in the vocabulary, Word2vec returns zeros.
Remember to add import java.io.File;
to your imported packages.
Words are read into the vector one at a time, and scanned back and forth within a certain range. Those ranges are n-grams, and an n-gram is a contiguous sequence of n items from a given linguistic sequence; it is the nth version of unigram, bigram, trigram, four-gram or five-gram. A skip-gram simply drops items from the n-gram.
The skip-gram representation popularized by Mikolov and used in the DL4J implementation has proven to be more accurate than other models, such as continuous bag of words, due to the more generalizable contexts generated.
This n-gram is then fed into a neural network to learn the significance of a given word vector; i.e. significance is defined as its usefulness as an indicator of certain larger meanings, or labels.
Q: I get a lot of stack traces like this
A: Look inside the directory where you started your Word2vec application. This can, for example, be an IntelliJ project home directory or the directory where you typed Java at the command line. It should have some directories that look like:
You can shut down your Word2vec application and try to delete them.
Q: Not all of the words from my raw text data are appearing in my Word2vec object…
A: Try to raise the layer size via .layerSize() on your Word2Vec object like so
Q: How do I load my data? Why does training take forever?
A: If all of your sentences have been loaded as one sentence, Word2vec training could take a very long time. That's because Word2vec is a sentence-level algorithm, so sentence boundaries are very important, because co-occurrence statistics are gathered sentence by sentence. (For GloVe, sentence boundaries don't matter, because it's looking at corpus-wide co-occurrence. For many corpora, average sentence length is six words. That means that with a window size of 5 you have, say, 30 (random number here) rounds of skip-gram calculations. If you forget to specify your sentence boundaries, you may load a "sentence" that's 10,000 words long. In that case, Word2vec would attempt a full skip-gram cycle for the whole 10,000-word "sentence". In DL4J's implementation, a line is assumed to be a sentence. You need plug in your own SentenceIterator and Tokenizer. By asking you to specify how your sentences end, DL4J remains language-agnostic. UimaSentenceIterator is one way to do that. It uses OpenNLP for sentence boundary detection.
Q: Why is there such a difference in performance when feeding whole documents as one "sentence" vs splitting into Sentences?
_A:_If average sentence contains 6 words, and window size is 5, maximum theoretical number of 10 skipgram rounds will be achieved on 0 words. Sentence isn't long enough to have full window set with words. Rough maximum number of 5 sg rounds is available there for all words in such sentence.
But if your "sentence" is 1000k words length, you'll have 10 skipgram rounds for every word in this sentence, excluding the first 5 and last five. So, you'll have to spend WAY more time building model + cooccurrence statistics will be shifted due to the absense of sentence boundaries.
Q: How does Word2Vec Use Memory?
A: The major memory consumer in w2v is weights matrix. Math is simple there: NumberOfWords x NumberOfDimensions x 2 x DataType memory footprint.
So, if you build w2v model for 100k words using floats, and 100 dimensions, your memory footprint will be 100k x 100 x 2 x 4 (float size) = 80MB RAM just for matri + some space for strings, variables, threads etc.
If you load pre-built model, it uses roughly 2 times less RAM then during build time, so it's 40MB RAM.
And the most popular model used so far is Google News model. There's 3M words, and vector size 300. That gives us 3.6GB only to load model. And you have to add 3M of strings, that do not have constant size in java. So, usually that's something around 4-6GB for loaded model depending on jvm version/supplier, gc state and phase of the moon.
Q: I did everything you said and the results still don't look right.
A: Make sure you're not hitting into normalization issues. Some tasks, like wordsNearest(), use normalized weights by default, and others require non-normalized weights. Pay attention to this difference.
Word2Vec is especially useful in preparing text-based data for information retrieval and QA systems, which DL4J implements with deep autoencoders.
Marketers might seek to establish relationships among products to build a recommendation engine. Investigators might analyze a social graph to surface members of a single group, or other relations they might have to location or financial sponsorship.
Loading and saving GloVe models to word2vec can be done like so:
Weights update after model serialization/deserialization was added. That is, you can update model state with, say, 200GB of new text by calling loadFullModel
, adding TokenizerFactory
and SentenceIterator
to it, and calling fit()
on the restored model.
Option for multiple datasources for vocab construction was added.
Epochs and Iterations can be specified separately, although they are both typically "1".
Word2Vec.Builder has this option: hugeModelExpected
. If set to true
, the vocab will be periodically truncated during the build.
While minWordFrequency
is useful for ignoring rare words in the corpus, any number of words can be excluded to customize.
Two new WordVectorsSerialiaztion methods have been introduced: writeFullModel
and loadFullModel
. These save and load a full model state.
A decent workstation should be able to handle a vocab with a few million words. Deeplearning4j's Word2vec imlementation can model a few terabytes of data on a single machine. Roughly, the math is: vectorSize * 4 * 3 * vocab.size()
.
Adding hooks and listeners on DL4J models.
Listeners allow users to "hook" into certain events in Eclipse Deeplearning4j. This allows you to collect or print information useful for tasks like training. For example, a ScoreIterationListener
allows you to print training scores from the output layer of a neural network.
To add one or more listeners to a MultiLayerNetwork
or ComputationGraph
, use the addListener
method:
This TrainingListener implementation provides simple way for model evaluation during training. It can be launched every Xth Iteration/Epoch, depending on frequency and InvocationType constructor arguments
EvaluativeListener
This callback will be invoked after evaluation finished
iterationDone
param iterator Iterator to provide data for evaluation
param frequency Frequency (in number of iterations/epochs according to the invocation type) to perform evaluation
param type Type of value for ‘frequency’ - iteration end, epoch end, etc
Score iteration listener. Reports the score (value of the loss function )of the network during training every N iterations
ScoreIterationListener
param printIterations frequency with which to print scores (i.e., every printIterations parameter updates)
A group of listeners
CollectScoresIterationListener simply stores the model scores internally (along with the iteration) every 1 or N iterations (this is configurable). These scores can then be obtained or exported.
CollectScoresIterationListener
Constructor for collecting scores with default saving frequency of 1
iterationDone
Constructor for collecting scores with the specified frequency.
param frequency Frequency with which to collect/save scores
exportScores
Export the scores in tab-delimited (one per line) UTF-8 format.
exportScores
Export the scores in delimited (one per line) UTF-8 format with the specified delimiter
param outputStream Stream to write to
param delimiter Delimiter to use
exportScores
Export the scores to the specified file in delimited (one per line) UTF-8 format, tab delimited
param file File to write to
exportScores
Export the scores to the specified file in delimited (one per line) UTF-8 format, using the specified delimiter
param file File to write to
param delimiter Delimiter to use for writing scores
CheckpointListener: The goal of this listener is to periodically save a copy of the model during training.. Model saving may be done:
Every N epochs
Every N iterations
Every T time units (every 15 minutes, for example) Or some combination of the 3. Example 1: Saving a checkpoint every 2 epochs, keep all model files
Example 2: Saving a checkpoint every 1000 iterations, but keeping only the last 3 models (all older model files will be automatically deleted)
Example 3: Saving a checkpoint every 15 minutes, keeping the most recent 3 and otherwise every 4th checkpoint file:
Note that you can mix these: for example, to save every epoch and every 15 minutes (independent of last save time): To save every epoch, and every 15 minutes, since the last model save use: Note that is this last example, the sinceLast parameter is true. This means the 15-minute counter will be reset any time a model is saved.
CheckpointListener
List all available checkpoints. A checkpoint is ‘available’ if the file can be loaded. Any checkpoint files that have been automatically deleted (given the configuration) will not be returned here.
return List of checkpoint files that can be loaded
This TrainingListener implementation provides a way to “sleep” during specific Neural Network training phases. Suitable for debugging/testing purposes only.
PLEASE NOTE: All timers treat time values as milliseconds. PLEASE NOTE: Do not use it in production environment.
onEpochStart
In this mode parkNanos() call will be used, to make process really idle
A simple listener that collects scores to a list every N iterations. Can also optionally log the score.
Simple IterationListener that tracks time spend on training per iteration.
PerformanceListener
This method defines, if iteration number should be reported together with other data
param reportIteration
return
An iteration listener that provides details on parameters and gradients at each iteration during traning. Attempts to provide much of the same information as the UI histogram iteration listener, but in a text-based format (for example, when learning on a system accessed via SSH etc). i.e., is intended to aid network tuning and debugging This iteration listener is set up to calculate mean, min, max, and mean absolute value of each type of parameter and gradient in the network at each iteration.
Time Iteration Listener. This listener displays into INFO logs the remaining time in minutes and the date of the end of the process. Remaining time is estimated from the amount of time for training so far, and the total number of iterations specified by the user
TimeIterationListener
Constructor
param iterationCount The global number of iteration for training (all epochs)
Computation graph nodes for advanced configuration.
In Eclipse Deeplearning4j a vertex is a type of layer that acts as a node in a ComputationGraph
. It can accept multiple inputs, provide multiple outputs, and can help construct popular networks such as InceptionV4.
L2NormalizeVertex performs L2 normalization on a single input.
L2Vertex calculates the L2 least squares error of two inputs.
For example, in Triplet Embedding you can input an anchor and a pos/neg class and use two parallel L2 vertices to calculate two real numbers which can be fed into a LossLayer to calculate TripletLoss.
A custom layer for removing the first column and row from an input. This is meant to allow importation of Caffe’s GoogLeNet from .
Adds the ability to reshape and flatten the tensor in the computation graph. This is the equivalent to the next layer. ReshapeVertex also ensures the shape is valid for the backward pass.
A ScaleVertex is used to scale the size of activations of a single layer For example, ResNet activations can be scaled in repeating blocks to keep variance under control.
A ShiftVertex is used to shift the activations of a single layer One could use it to add a bias or as part of some other calculation. For example, Highway Layers need them in two places. One, it’s often useful to have the gate weights have a large negative bias. (Of course for this, we could just initialize the biases that way.) But, also it needs to do this: (1-sigmoid(weight input + bias)) () input + sigmoid(weight input + bias) () activation(w2 input + bias) (() is hadamard product) So, here, we could have
a DenseLayer that does the sigmoid
a ScaleVertex(-1) and
a ShiftVertex(1) to accomplish that.
StackVertex allows for stacking of inputs so that they may be forwarded through a network. This is useful for cases such as Triplet Embedding, where shared parameters are not supported by the network.
This vertex will automatically stack all available inputs.
UnstackVertex allows for unstacking of inputs so that they may be forwarded through a network. This is useful for cases such as Triplet Embedding, where embeddings can be separated and run through subsequent layers.
Works similarly to SubsetVertex, except on dimension 0 of the input. stackSize is explicitly defined by the user to properly calculate an step.
ReverseTimeSeriesVertex is used in recurrent neural networks to revert the order of time series. As a result, the last time step is moved to the beginning of the time series and the first time step is moved to the end. This allows recurrent layers to backward process time series.
Masks: The input might be masked (to allow for varying time series lengths in one minibatch). In this case the present input (mask array = 1) will be reverted in place and the padding (mask array = 0) will be left untouched at the same place. For a time series of length n, this would normally mean, that the first n time steps are reverted and the following padding is left untouched, but more complex masks are supported (e.g. [1, 0, 1, 0, …].
setBackpropGradientsViewArray
Gets the current mask array from the provided input
return The mask or null, if no input was provided
Recurrent Neural Network (RNN) implementations in DL4J.
This document outlines the specifics training features and the practicalities of how to use them in DeepLearning4J. This document assumes some familiarity with recurrent neural networks and their use - it is not an introduction to recurrent neural networks, and assumes some familiarity with their both their use and terminology.
DL4J currently supports the following types of recurrent neural network
RNN ("vanilla" RNN)
LSTM (Long Short-Term Memory)
Java documentation for each is available: , .
Consider for the moment a standard feed-forward network (a multi-layer perceptron or 'DenseLayer' in DL4J). These networks expect input and output data that is two-dimensional: that is, data with "shape" [numExamples,inputSize]. This means that the data into a feed-forward network has ‘numExamples’ rows/examples, where each row consists of ‘inputSize’ columns. A single example would have shape [1,inputSize], though in practice we generally use multiple examples for computational and optimization efficiency. Similarly, output data for a standard feed-forward network is also two dimensional, with shape [numExamples,outputSize].
Conversely, data for RNNs are time series. Thus, they have 3 dimensions: one additional dimension for time. Input data thus has shape [numExamples,inputSize,timeSeriesLength], and output data has shape [numExamples,outputSize,timeSeriesLength]. This means that the data in our INDArray is laid out such that the value at position (i,j,k) is the jth value at the kth time step of the ith example in the minibatch. This data layout is shown below.
When importing time series data using the class CSVSequenceRecordReader each line in the data files represents one time step with the earliest time series observation in the first row (or first row after header if present) and the most recent observation in the last row of the csv. Each feature time series is a separate column of the of the csv file. For example if you have five features in time series, each with 120 observations, and a training & test set of size 53 then there will be 106 input csv files(53 input, 53 labels). The 53 input csv files will each have five columns and 120 rows. The label csv files will have one column (the label) and one row.
RnnOutputLayer is a type of layer used as the final layer with many recurrent neural network systems (for both regression and classification tasks). RnnOutputLayer handles things like score calculation, and error calculation (of prediction vs. actual) given a loss function etc. Functionally, it is very similar to the 'standard' OutputLayer class (which is used with feed-forward networks); however it both outputs (and expects as labels/targets) 3d time series data sets.
Configuration for the RnnOutputLayer follows the same design other layers: for example, to set the third layer in a MultiLayerNetwork to a RnnOutputLayer for classification:
Use of RnnOutputLayer in practice can be seen in the examples, linked at the end of this document.
Training neural networks (including RNNs) can be quite computationally demanding. For recurrent neural networks, this is especially the case when we are dealing with long sequences - i.e., training data with many time steps.
Truncated backpropagation through time (BPTT) was developed in order to reduce the computational complexity of each parameter update in a recurrent neural network. In summary, it allows us to train networks faster (by performing more frequent parameter updates), for a given amount of computational power. It is recommended to use truncated BPTT when your input sequences are long (typically, more than a few hundred time steps).
Consider what happens when training a recurrent neural network with a time series of length 12 time steps. Here, we need to do a forward pass of 12 steps, calculate the error (based on predicted vs. actual), and do a backward pass of 12 time steps:
For 12 time steps, in the image above, this is not a problem. Consider, however, that instead the input time series was 10,000 or more time steps. In this case, standard backpropagation through time would require 10,000 time steps for each of the forward and backward passes for each and every parameter update. This is of course very computationally demanding.
In practice, truncated BPTT splits the forward and backward passes into a set of smaller forward/backward pass operations. The specific length of these forward/backward pass segments is a parameter set by the user. For example, if we use truncated BPTT of length 4 time steps, learning looks like the following:
Note that the overall complexity for truncated BPTT and standard BPTT are approximately the same - both do the same number of time step during forward/backward pass. Using this method however, we get 3 parameter updates instead of one for approximately the same amount of effort. However, the cost is not exactly the same there is a small amount of overhead per parameter update.
The downside of truncated BPTT is that the length of the dependencies learned in truncated BPTT can be shorter than in full BPTT. This is easy to see: consider the images above, with a TBPTT length of 4. Suppose that at time step 10, the network needs to store some information from time step 0 in order to make an accurate prediction. In standard BPTT, this is ok: the gradients can flow backwards all the way along the unrolled network, from time 10 to time 0. In truncated BPTT, this is problematic: the gradients from time step 10 simply don't flow back far enough to cause the required parameter updates that would store the required information. This tradeoff is usually worth it, and (as long as the truncated BPTT lengths are set appropriately), truncated BPTT works well in practice.
Using truncated BPTT in DL4J is quite simple: just add the following code to your network configuration (at the end, before the final .build() in your network configuration)
The above code snippet will cause any network training (i.e., calls to MultiLayerNetwork.fit() methods) to use truncated BPTT with segments of length 100 steps.
Some things of note:
By default (if a backprop type is not manually specified), DL4J will use BackpropType.Standard (i.e., full BPTT).
The tBPTTLength configuration parameter set the length of the truncated BPTT passes. Typically, this is somewhere on the order of 50 to 200 time steps, though depends on the application and data.
The truncated BPTT lengths is typically a fraction of the total time series length (i.e., 200 vs. sequence length 1000), but variable length time series in the same minibatch is OK when using TBPTT (for example, a minibatch with two sequences - one of length 100 and another of length 1000 - with a TBPTT length of 200 - will work correctly)
DL4J supports a number of related training features for RNNs, based on the idea of padding and masking. Padding and masking allows us to support training situations including one-to-many, many-to-one, as also support variable length time series (in the same mini-batch).
Suppose we want to train a recurrent neural network with inputs or outputs that don't occur at every time step. Examples of this (for a single example) are shown in the image below. DL4J supports training networks for all of these situations:
Without masking and padding, we are restricted to the many-to-many case (above, left): that is, (a) All examples are of the same length, and (b) Examples have both inputs and outputs at all time steps.
The idea behind padding is simple. Consider two time series of lengths 50 and 100 time steps, in the same mini-batch. The training data is a rectangular array; thus, we pad (i.e., add zeros to) the shorter time series (for both input and output), such that the input and output are both the same length (in this example: 100 time steps).
Of course, if this was all we did, it would cause problems during training. Thus, in addition to padding, we use a masking mechanism. The idea behind masking is simple: we have two additional arrays that record whether an input or output is actually present for a given time step and example, or whether the input/output is just padding.
Recall that with RNNs, our minibatch data has 3 dimensions, with shape [miniBatchSize,inputSize,timeSeriesLength] and [miniBatchSize,outputSize,timeSeriesLength] for the input and output respectively. The padding arrays are then 2 dimensional, with shape [miniBatchSize,timeSeriesLength] for both the input and output, with values of 0 ('absent') or 1 ('present') for each time series and example. The masking arrays for the input and output are stored in separate arrays.
For a single example, the input and output masking arrays are shown below:
For the “Masking not required” cases, we could equivalently use a masking array of all 1s, which will give the same result as not having a mask array at all. Also note that it is possible to use zero, one or two masking arrays when learning RNNs - for example, the many-to-one case could have a masking array for the output only.
In practice: these padding arrays are generally created during the data import stage (for example, by the SequenceRecordReaderDatasetIterator – discussed later), and are contained within the DataSet object. If a DataSet contains masking arrays, the MultiLayerNetwork fit will automatically use them during training. If they are absent, no masking functionality is used.
Mask arrays are also important when doing scoring and evaluation (i.e., when evaluating the accuracy of a RNN classifier). Consider for example the many-to-one case: there is only a single output for each example, and any evaluation should take this into account.
Evaluation using the (output) mask arrays can be used during evaluation by passing it to the following method:
where labels are the actual output (3d time series), predicted is the network predictions (3d time series, same shape as labels), and outputMask is the 2d mask array for the output. Note that the input mask array is not required for evaluation.
Score calculation will also make use of the mask arrays, via the MultiLayerNetwork.score(DataSet) method. Again, if the DataSet contains an output masking array, it will automatically be used when calculating the score (loss function - mean squared error, negative log likelihood etc) for the network.
Sequence classification is one common use of masking. The idea is that although we have a sequence (time series) as input, we only want to provide a single label for the entire sequence (rather than one label at each time step in the sequence).
However, RNNs by design output sequences, of the same length of the input sequence. For sequence classification, masking allows us to train the network with this single label at the final time step - we essentially tell the network that there isn't actually label data anywhere except for the last time step.
Now, suppose we've trained our network, and want to get the last time step for predictions, from the time series output array. How do we do that?
To get the last time step, there are two cases to be aware of. First, when we have a single example, we don't actually need to use the mask arrays: we can just get the last time step in the output array:
Assuming classification (same process for regression, however) the last line above gives us probabilities at the last time step - i.e., the class probabilities for our sequence classification.
The slightly more complex case is when we have multiple examples in the one minibatch (features array), where the lengths of each example differ. (If all are the same length: we can use the same process as above).
In this 'variable length' case, we need to get the last time step for each example separately. If we have the time series lengths for each example from our data pipeline, it becomes straightforward: we just iterate over examples, replacing the timeSeriesLength
in the above code with the length of that example.
If we don't have the lengths of the time series directly, we need to extract them from the mask array.
If we have a labels mask array (which is a one-hot vector, like [0,0,0,1,0] for each time series):
Alternatively, if we have only the features mask: One quick and dirty approach is to use this:
To understand what is happening here, note that originally we have a features mask like [1,1,1,1,0], from which we want to get the last non-zero element. So we map [1,1,1,1,0] -> [1,2,3,4,0], and then get the largest element (which is the last time step).
In either case, we can then do the following:
RNN layers in DL4J can be combined with other layer types. For example, it is possible to combine DenseLayer and LSTM layers in the same network; or combine Convolutional (CNN) layers and LSTM layers for video.
For example, to manually add a preprocessor between layers 1 and 2, add the following to your network configuration: .inputPreProcessor(2, new RnnToFeedForwardPreProcessor())
.
As with other types of neural networks, predictions can be generated for RNNs using the MultiLayerNetwork.output()
and MultiLayerNetwork.feedForward()
methods. These methods can be useful in many circumstances; however, they have the limitation that we can only generate predictions for time series, starting from scratch each and every time.
Consider for example the case where we want to generate predictions in a real-time system, where these predictions are based on a very large amount of history. It this case, it is impractical to use the output/feedForward methods, as they conduct the full forward pass over the entire data history, each time they are called. If we wish to make a prediction for a single time step, at every time step, these methods can be both (a) very costly, and (b) wasteful, as they do the same calculations over and over.
For these situations, MultiLayerNetwork provides four methods of note:
rnnTimeStep(INDArray)
rnnClearPreviousState()
rnnGetPreviousState(int layer)
rnnSetPreviousState(int layer, Map<String,INDArray> state)
The rnnTimeStep() method is designed to allow forward pass (predictions) to be conducted efficiently, one or more steps at a time. Unlike the output/feedForward methods, the rnnTimeStep method keeps track of the internal state of the RNN layers when it is called. It is important to note that output for the rnnTimeStep and the output/feedForward methods should be identical (for each time step), whether we make these predictions all at once (output/feedForward) or whether these predictions are generated one or more steps at a time (rnnTimeStep). Thus, the only difference should be the computational cost.
In summary, the MultiLayerNetwork.rnnTimeStep() method does two things:
Generate output/predictions (forward pass), using the previous stored state (if any)
Update the stored state, storing the activations for the last time step (ready to be used next time rnnTimeStep is called)
For example, suppose we want to use a RNN to predict the weather, one hour in advance (based on the weather at say the previous 100 hours as input). If we were to use the output method, at each hour we would need to feed in the full 100 hours of data to predict the weather for hour 101. Then to predict the weather for hour 102, we would need to feed in the full 100 (or 101) hours of data; and so on for hours 103+.
Alternatively, we could use the rnnTimeStep method. Of course, if we want to use the full 100 hours of history before we make our first prediction, we still need to do the full forward pass:
For the first time we call rnnTimeStep, the only practical difference between the two approaches is that the activations/state of the last time step are stored - this is shown in orange. However, the next time we use the rnnTimeStep method, this stored state will be used to make the next predictions:
There are a number of important differences here:
In the second image (second call of rnnTimeStep) the input data consists of a single time step, instead of the full history of data
The forward pass is thus a single time step (as compared to the hundreds – or more)
After the rnnTimeStep method returns, the internal state will automatically be updated. Thus, predictions for time 103 could be made in the same way as for time 102. And so on.
However, if you want to start making predictions for a new (entirely separate) time series: it is necessary (and important) to manually clear the stored state, using the MultiLayerNetwork.rnnClearPreviousState()
method. This will reset the internal state of all recurrent layers in the network.
If you need to store or set the internal state of the RNN for use in predictions, you can use the rnnGetPreviousState and rnnSetPreviousState methods, for each layer individually. This can be useful for example during serialization (network saving/loading), as the internal network state from the rnnTimeStep method is not saved by default, and must be saved and loaded separately. Note that these get/set state methods return and accept a map, keyed by the type of activation. For example, in the LSTM model, it is necessary to store both the output activations, and the memory cell state.
Some other points of note:
We can use the rnnTimeStep method for multiple independent examples/predictions simultaneously. In the weather example above, we might for example want to make predicts for multiple locations using the same neural network. This works in the same way as training and the forward pass / output methods: multiple rows (dimension 0 in the input data) are used for multiple examples.
If no history/stored state is set (i.e., initially, or after a call to rnnClearPreviousState), a default initialization (zeros) is used. This is the same approach as during training.
The rnnTimeStep can be used for an arbitrary number of time steps simultaneously – not just one time step. However, it is important to note:
For a single time step prediction: the data is 2 dimensional, with shape [numExamples,nIn]; in this case, the output is also 2 dimensional, with shape [numExamples,nOut]
For multiple time step predictions: the data is 3 dimensional, with shape [numExamples,nIn,numTimeSteps]; the output will have shape [numExamples,nOut,numTimeSteps]. Again, the final time step activations are stored as before.
It is not possible to change the number of examples between calls of rnnTimeStep (in other words, if the first use of rnnTimeStep is for say 3 examples, all subsequent calls must be with 3 examples). After resetting the internal state (using rnnClearPreviousState()), any number of examples can be used for the next call of rnnTimeStep.
The rnnTimeStep method makes no changes to the parameters; it is used after training the network has been completed only.
The rnnTimeStep method works with networks containing single and stacked/multiple RNN layers, as well as with networks that combine other layer types (such as Convolutional or Dense layers).
The RnnOutputLayer layer type does not have any internal state, as it does not have any recurrent connections.
Data import for RNNs is complicated by the fact that we have multiple different types of data we could want to use for RNNs: one-to-many, many-to-one, variable length time series, etc. This section will describe the currently implemented data import mechanisms for DL4J.
The methods described here utilize the SequenceRecordReaderDataSetIterator class, in conjunction with the CSVSequenceRecordReader class from DataVec. This approach currently allows you to load delimited (tab, comma, etc) data from files, where each time series is in a separate file. This method also supports:
Variable length time series input
One-to-many and many-to-one data loading (where input and labels are in different files)
Label conversion from an index to a one-hot representation for classification (i.e., '2' to [0,0,1,0])
Skipping a fixed/specified number of rows at the start of the data files (i.e., comment or header rows)
Note that in all cases, each line in the data files represents one time step.
Suppose we have 10 time series in our training data, represented by 20 files: 10 files for the input of each time series, and 10 files for the output/labels. For now, assume these 20 files all contain the same number of time steps (i.e., same number of rows).
This particular constructor takes the number of lines to skip (1 row skipped here), and the delimiter (comma character used here).
In this particular approach, the "%d" is replaced by the corresponding number, and the numbers 0 to 9 (both inclusive) are used.
Finally, we can create our SequenceRecordReaderdataSetIterator:
This DataSetIterator can then be passed to MultiLayerNetwork.fit() to train the network.
The miniBatchSize argument specifies the number of examples (time series) in each minibatch. For example, with 10 files total, miniBatchSize of 5 would give us two data sets with 2 minibatches (DataSet objects) with 5 time series in each.
Note that:
For classification problems: numPossibleLabels is the number of classes in your data set. Use regression = false.
Labels data: one value per line, as a class index
Label data will be converted to a one-hot representation automatically
For regression problems: numPossibleLabels is not used (set it to anything) and use regression = true.
The number of values in the input and labels can be anything (unlike classification: can have an arbitrary number of outputs)
No processing of the labels is done when regression = true
Following on from the last example, suppose that instead of a separate files for our input data and labels, we have both in the same file. However, each time series is still in a separate file.
As of DL4J 0.4-rc3.8, this approach has the restriction of a single column for the output (either a class index, or a single real-valued regression output)
In this case, we create and initialize a single reader. Again, we are skipping one header row, and specifying the format as comma delimited, and assuming our data files are named "myData_0.csv", ..., "myData_9.csv":
miniBatchSize
and numPossibleLabels
are the same as the previous example. Here, labelIndex
specifies which column the labels are in. For example, if the labels are in the fifth column, use labelIndex = 4 (i.e., columns are indexed 0 to numColumns-1).
For regression on a single output value, we use:
Again, the numPossibleLabels argument is not used for regression.
Following on from the previous two examples, suppose that for each example individually, the input and labels are of the same length, but these lengths differ between time series.
We can use the same approach (CSVSequenceRecordReader and SequenceRecordReaderDataSetIterator), though with a different constructor:
The argument here are the same as in the previous example, with the exception of the AlignmentMode.ALIGN_END addition. This alignment mode input tells the SequenceRecordReaderDataSetIterator to expect two things:
That the time series may be of different lengths
To align the input and labels - for each example individually - such that their last values occur at the same time step.
Note that if the features and labels are always of the same length (as is the assumption in example 3), then the two alignment modes (AlignmentMode.ALIGN_END and AlignmentMode.ALIGN_START) will give identical outputs. The alignment mode option is explained in the next section.
Also note: that variable length time series always start at time zero in the data arrays: padding, if required, will be added after the time series has ended.
Unlike examples 1 and 2 above, the DataSet objects produced by the above variableLengthIter instance will also include input and masking arrays, as described earlier in this document.
We can also use the AlignmentMode functionality in example 3 to implement a many-to-one RNN sequence classifier. Here, let us assume:
Input and labels are in separate delimited files
The labels files contain a single row (time step) (either a class index for classification, or one or more numbers for regression)
The input lengths may (optionally) differ between examples
In fact, the same approach as in example 3 can do this:
Alignment modes are relatively straightforward. They specify whether to pad the start or the end of the shorter time series. The diagram below shows how this works, along with the masking arrays (as discussed earlier in this document):
The one-to-many case (similar to the last case above, but with only one input) is done by using AlignmentMode.ALIGN_START.
Note that in the case of training data that contains time series of different lengths, the labels and inputs will be aligned for each example individually, and then the shorter time series will be padded as required:
Recurrent Neural Network Loss Layer. Handles calculation of gradients etc for various objective (loss) time distributed dense component here. Consequently, the output activations size is equal to the input size. Input and output activations are same as other RNN layers: 3 dimensions with shape [miniBatchSize,nIn,timeSeriesLength] and [miniBatchSize,nOut,timeSeriesLength] respectively. Note that RnnLossLayer also has the option to configure an activation function
setNIn
param lossFunction Loss function for the loss layer
and labels of shape [minibatch,nOut,sequenceLength]. It also supports mask arrays. Note that RnnOutputLayer can also be used for 1D CNN layers, which also have [minibatch,nOut,sequenceLength] activations/labels shape.
build
param lossFunction Loss function for the output layer
Bidirectional is a “wrapper” layer: it wraps any uni-directional RNN layer to make it bidirectional. Note that multiple different modes are supported - these specify how the activations should be combined from the forward and separate copies of the wrapped RNN layer, each with separate parameters.
getNOut
This Mode enumeration defines how the activations for the forward and backward networks should be combined. ADD: out = forward + backward (elementwise addition) MUL: out = forward backward (elementwise multiplication) AVERAGE: out = 0.5 (forward + backward) CONCAT: Concatenate the activations. Where ‘forward’ is the activations for the forward RNN, and ‘backward’ is the activations for the backward RNN. In all cases except CONCAT, the output activations size is the same size as the standard RNN that is being wrapped by this layer. In the CONCAT case, the output activations size (dimension 1) is 2x larger than the standard RNN’s activations array.
getUpdaterByParam
Get the updater for the given parameter. Typically the same updater will be used for all updaters, but this is not necessarily the case
param paramName Parameter name
return IUpdater for the parameter
LastTimeStep is a “wrapper” layer: it wraps any RNN (or CNN1D) layer, and extracts out the last time step during forward pass, and returns it as a row vector (per example). That is, for 3d (time series) input (with shape [minibatch, layerSize, timeSeriesLength]), we take the last time step and return it as a 2d array with shape [minibatch, layerSize]. Note that the last time step operation takes into account any mask arrays, if present: thus, variable length time series (in the same minibatch) are handled as expected here.
activationFn( in_t inWeight + out_(t-1) recurrentWeights + bias)}.
Note that other architectures (LSTM, etc) are usually much more effective, especially for longer time series; however SimpleRnn is very fast to compute, and hence may be considered where the length of the temporal dependencies in the dataset are only a few steps long.
Supported neural network layers.
Each layer in a neural network configuration represents a unit of hidden units. When layers are stacked together, they represent a deep neural network.
All layers available in Eclipse Deeplearning4j can be used either in a MultiLayerNetwork
or ComputationGraph
. When configuring a neural network, you pass the layer configuration and the network will instantiate the layer for you.
If you are configuring complex networks such as InceptionV4, you will need to use the ComputationGraph
API and join different branches together using vertices. Check the vertices for more information.
Activation layer is a simple layer that applies the specified activation function to the input activations
clone
param activation Activation function for the layer
activation
Activation function for the layer
activation
param activationFunction Activation function for the layer
activation
param activation Activation function for the layer
Dense layer: a standard fully connected feed forward layer
hasBias
If true (default): include bias parameters in the model. False: no bias.
hasLayerNorm
If true (default = false): enable layer normalization on this layer
Dropout layer. This layer simply applies dropout at training time, and passes activations through unmodified at test
build
Create a dropout layer with standard {- link Dropout}, with the specified probability of retaining the input activation. See {- link Dropout} for the full details
param dropout Activation retain probability.
Embedding layer: feed-forward layer that expects single integers per example as input (class numbers, in range 0 to the equivalent one-hot representation. Mathematically, EmbeddingLayer is equivalent to using a DenseLayer with a one-hot representation for the input; however, it can be much more efficient with a large number of classes (as a dense layer + one-hot input does a matrix multiply with all but one value being zero). Note: can only be used as the first layer for a network Note 2: For a given example index i, the output is activationFunction(weights.getRow(i) + bias), hence the weight rows can be considered a vector/embedding for each example. Note also that embedding layer has an activation function (set to IDENTITY to disable) and optional bias (which is disabled by default)
hasBias
If true: include bias parameters in the layer. False (default): no bias.
weightInit
Initialize the embedding layer using the specified EmbeddingInitializer - such as a Word2Vec instance
param embeddingInitializer Source of the embedding layer weights
weightInit
Initialize the embedding layer using values from the specified array. Note that the array should have shape [vocabSize, vectorSize]. After copying values from the array to initialize the network parameters, the input array will be discarded (so that, if necessary, it can be garbage collected)
param vectors Vectors to initialize the embedding layer with
Embedding layer for sequences: feed-forward layer that expects fixed-length number (inputLength) of integers/indices per example as input, ranged from 0 to numClasses - 1. This input thus has shape [numExamples, inputLength] or shape [numExamples, 1, inputLength]. The output of this layer is 3D (sequence/time series), namely of shape [numExamples, nOut, inputLength]. Note: can only be used as the first layer for a network Note 2: For a given example index i, the output is activationFunction(weights.getRow(i) + bias), hence the weight rows can be considered a vector/embedding of each index. Note also that embedding layer has an activation function (set to IDENTITY to disable) and optional bias (which is disabled by default)
hasBias
If true: include bias parameters in the layer. False (default): no bias.
inputLength
Set input sequence length for this embedding layer.
param inputLength input sequence length
return Builder
inferInputLength
Set input sequence inference mode for embedding layer.
param inferInputLength whether to infer input length
return Builder
weightInit
Initialize the embedding layer using the specified EmbeddingInitializer - such as a Word2Vec instance
param embeddingInitializer Source of the embedding layer weights
weightInit
Initialize the embedding layer using values from the specified array. Note that the array should have shape [vocabSize, vectorSize]. After copying values from the array to initialize the network parameters, the input array will be discarded (so that, if necessary, it can be garbage collected)
param vectors Vectors to initialize the embedding layer with
Global pooling layer - used to do pooling over time for RNNs, and 2d pooling for CNNs. Supports the following
Global pooling layer can also handle mask arrays when dealing with variable length inputs. Mask arrays are assumed to be 2d, and are fed forward through the network during training or post-training forward pass:
Time series: mask arrays are shape [miniBatchSize, maxTimeSeriesLength] and contain values 0 or 1 only
CNNs: mask have shape [miniBatchSize, height] or [miniBatchSize, width]. Important: the current implementation assumes that for CNNs + variable length (masking), the input shape is [miniBatchSize, channels, height, 1] or [miniBatchSize, channels, 1, width] respectively. This is the case with global pooling in architectures like CNN for sentence classification.
Behaviour with default settings:
3d (time series) input with shape [miniBatchSize, vectorSize, timeSeriesLength] -> 2d output [miniBatchSize, vectorSize]
4d (CNN) input with shape [miniBatchSize, channels, height, width] -> 2d output [miniBatchSize, channels]
5d (CNN3D) input with shape [miniBatchSize, channels, depth, height, width] -> 2d output [miniBatchSize, channels]
Alternatively, by setting collapseDimensions = false in the configuration, it is possible to retain the reduced dimensions as 1s: this gives
[miniBatchSize, vectorSize, 1] for RNN output,
[miniBatchSize, channels, 1, 1] for CNN output, and
[miniBatchSize, channels, 1, 1, 1] for CNN3D output.
poolingDimensions
Pooling type for global pooling
poolingType
param poolingType Pooling type for global pooling
collapseDimensions
Whether to collapse dimensions when pooling or not. Usually you do want to do this. Default: true. If true:
3d (time series) input with shape [miniBatchSize, vectorSize, timeSeriesLength] -> 2d output [miniBatchSize, vectorSize]
4d (CNN) input with shape [miniBatchSize, channels, height, width] -> 2d output [miniBatchSize, channels]
5d (CNN3D) input with shape [miniBatchSize, channels, depth, height, width] -> 2d output [miniBatchSize, channels]
If false:
3d (time series) input with shape [miniBatchSize, vectorSize, timeSeriesLength] -> 3d output [miniBatchSize, vectorSize, 1]
4d (CNN) input with shape [miniBatchSize, channels, height, width] -> 2d output [miniBatchSize, channels, 1, 1]
5d (CNN3D) input with shape [miniBatchSize, channels, depth, height, width] -> 2d output [miniBatchSize, channels, 1, 1, 1]
param collapseDimensions Whether to collapse the dimensions or not
pnorm
P-norm constant. Only used if using {- link PoolingType#PNORM} for the pooling type
param pnorm P-norm constant
k
LRN scaling constant k. Default: 2
n
Number of adjacent kernel maps to use when doing LRN. default: 5
param n Number of adjacent kernel maps
alpha
LRN scaling constant alpha. Default: 1e-4
param alpha Scaling constant
beta
Scaling constant beta. Default: 0.75
param beta Scaling constant
cudnnAllowFallback
When using CuDNN and an error is encountered, should fallback to the non-CuDNN implementatation be allowed? If set to false, an exception in CuDNN will be propagated back to the user. If false, the built-in (non-CuDNN) implementation for BatchNormalization will be used
param allowFallback Whether fallback to non-CuDNN implementation should be used
SameDiff version of a 1D locally connected layer.
nIn
Number of inputs to the layer (input size)
nOut
param nOut Number of outputs (output size)
activation
param activation Activation function for the layer
kernelSize
param k Kernel size for the layer
stride
param s Stride for the layer
padding
param p Padding for the layer. Not used if {- link ConvolutionMode#Same} is set
convolutionMode
param cm Convolution mode for the layer. See {- link ConvolutionMode} for details
dilation
param d Dilation for the layer
hasBias
param hasBias If true (default is false) the layer will have a bias
setInputSize
Set input filter size for this locally connected 1D layer
param inputSize height of the input filters
return Builder
SameDiff version of a 2D locally connected layer.
setKernel
Number of inputs to the layer (input size)
setStride
param stride Stride for the layer. Must be 2 values (height/width)
setPadding
param padding Padding for the layer. Not used if {- link ConvolutionMode#Same} is set. Must be 2 values (height/width)
setDilation
param dilation Dilation for the layer. Must be 2 values (height/width)
nIn
param nIn Number of inputs to the layer (input size)
nOut
param nOut Number of outputs (output size)
activation
param activation Activation function for the layer
kernelSize
param k Kernel size for the layer. Must be 2 values (height/width)
stride
param s Stride for the layer. Must be 2 values (height/width)
padding
param p Padding for the layer. Not used if {- link ConvolutionMode#Same} is set. Must be 2 values (height/width)
convolutionMode
param cm Convolution mode for the layer. See {- link ConvolutionMode} for details
dilation
param d Dilation for the layer. Must be 2 values (height/width)
hasBias
param hasBias If true (default is false) the layer will have a bias
setInputSize
Set input filter size (h,w) for this locally connected 2D layer
param inputSize pair of height and width of the input filters to this layer
return Builder
LossLayer is a flexible output layer that performs a loss function on an input without MLP logic. LossLayer is does not have any parameters. Consequently, setting nIn/nOut isn’t supported - the output size is the same size as the input activations.
nIn
param lossFunction Loss function for the loss layer
Output layer used for training via backpropagation based on labels and a specified loss function. Can be configured for both classification and regression. Note that OutputLayer has parameters - it contains a fully-connected layer (effectively contains a DenseLayer) internally. This allows the output size to be different to the layer input size.
build
param lossFunction Loss function for the output layer
Supports the following pooling types: MAX, AVG, SUM, PNORM, NONE
Supports the following pooling types: MAX, AVG, SUM, PNORM, NONE
sequenceLength]}. This layer accepts RNN InputTypes instead of CNN InputTypes.
Supports the following pooling types: MAX, AVG, SUM, PNORM
setKernelSize
Kernel size
param kernelSize kernel size
setStride
Stride
param stride stride value
setPadding
Padding
param padding padding value
sequenceLength]} Example:
size
Upsampling size
param size upsampling size in single spatial dimension of this 1D layer
size
Upsampling size int array with a single element. Array must be length 1
param size upsampling size in single spatial dimension of this 1D layer
Upsampling 2D layer Repeats each value (or rather, set of depth values) in the height and width dimensions by
size
Upsampling size int, used for both height and width
param size upsampling size in height and width dimensions
size
Upsampling size array
param size upsampling size in height and width dimensions
Upsampling 3D layer Repeats each value (all channel values for each x/y/z location) by size[0], size[1] and [minibatch, channels, size[0] depth, size[1] height, size[2] width]}
size
Upsampling size as int, so same upsampling size is used for depth, width and height
param size upsampling size in height, width and depth dimensions
size
Upsampling size as int, so same upsampling size is used for depth, width and height
param size upsampling size in height, width and depth dimensions
Zero padding 1D layer for convolutional neural networks. Allows padding to be done separately for top and bottom.
setPadding
Padding value for left and right. Must be length 2 array
build
param padding Padding for both the left and right
Zero padding 3D layer for convolutional neural networks. Allows padding to be done separately for “left” and “right” in all three spatial dimensions.
setPadding
[padLeftD, padRightD, padLeftH, padRightH, padLeftW, padRightW]
build
param padding Padding for both the left and right in all three spatial dimensions
Zero padding layer for convolutional neural networks (2D CNNs). Allows padding to be done separately for top/bottom/left/right
setPadding
Padding value for top, bottom, left, and right. Must be length 4 array
build
param padHeight Padding for both the top and bottom
param padWidth Padding for both the left and right
is a learnable weight vector of length nOut
“.” is element-wise multiplication
b is a bias vector
Note that the input and output sizes of the element-wise layer are the same for this layer
created by jingshu
getMemoryReport
This is a report of the estimated memory consumption for the given layer
param inputType Input type to the layer. Memory consumption is often a function of the input type
return Memory report for the layer
RepeatVector layer configuration.
RepeatVector takes a mini-batch of vectors of shape (mb, length) and a repeat factor n and outputs a 3D tensor of shape (mb, n, length) in which x is repeated n times.
getRepetitionFactor
Set repetition factor for RepeatVector layer
setRepetitionFactor
Set repetition factor for RepeatVector layer
param n upsampling size in height and width dimensions
repetitionFactor
Set repetition factor for RepeatVector layer
param n upsampling size in height and width dimensions
Note: Input activations to the Yolo2OutputLayer should have shape: [minibatch, b(5+c), H, W], where: b = number of bounding boxes (determined by config - see papers for details) c = number of classes H = output/label height W = output/label width
Important: In practice, this means that the last convolutional layer before your Yolo2OutputLayer should have output depth of b(5+c). Thus if you change the number of bounding boxes, or change the number of object classes, the number of channels (nOut of the last convolution layer) needs to also change. Label format: [minibatch, 4+C, H, W] Order for labels depth: [x1,y1,x2,y2,(class labels)] x1 = box top left position y1 = as above, y axis x2 = box bottom right position y2 = as above y axis Note: labels are represented as a multiple of grid size - for a 13x13 grid, (0,0) is top left, (13,13) is bottom right Note also that mask arrays are not required - this implementation infers the presence or absence of objects in each grid cell from the class labels (which should be 1-hot if an object is present, or all 0s otherwise).
lambdaCoord
Loss function coefficient for position and size/scale components of the loss function. Default (as per paper): 5
lambbaNoObj
Loss function coefficient for the “no object confidence” components of the loss function. Default (as per paper): 0.5
param lambdaNoObj Lambda value for no-object (confidence) component of the loss function
lossPositionScale
Loss function for position/scale component of the loss function
param lossPositionScale Loss function for position/scale
lossClassPredictions
Loss function for the class predictions - defaults to L2 loss (i.e., sum of squared errors, as per the paper), however Loss MCXENT could also be used (which is more common for classification).
param lossClassPredictions Loss function for the class prediction error component of the YOLO loss function
boundingBoxPriors
Bounding box priors dimensions [width, height]. For N bounding boxes, input has shape [rows, columns] = [N, 2] Note that dimensions should be specified as fraction of grid size. For example, a network with 13x13 output, a value of 1.0 would correspond to one grid cell; a value of 13 would correspond to the entire image.
param boundingBoxes Bounding box prior dimensions (width, height)
MaskLayer applies the mask array to the forward pass activations, and backward pass gradients, passing through this layer. It can be used with 2d (feed-forward), 3d (time series) or 4d (CNN) activations.
Wrapper which masks timesteps with activation equal to the specified masking value (0.0 default). Assumes that the input shape is [batch_size, input_size, timesteps].
Adapted from: See also
The AdaMax updater, a variant of Adam.
The Adam updater.
The AMSGrad updater Reference: On the Convergence of Adam and Beyond -
Measuring , no similarity is expressed as a 90 degree angle, while total similarity of 1 is a 0 degree angle, complete overlap; i.e. Sweden equals Sweden, while Norway has a cosine distance of 0.760124 from Sweden, the highest of any other country.
Word2vec is similar to an autoencoder, encoding each word in a vector, but rather than training against the input words through word2vec trains words against other words that neighbor them in the input corpus.
This model was trained on the Google News vocab, which you can and play with. Contemplate, for a moment, that the Word2vec algorithm has never been taught a single rule of English syntax. It knows nothing about the world, and is unassociated with any rules-based symbolic logic or knowledge graph. And yet it learns more, in a flexible and automated fashion, than most knowledge graphs will learn after a years of human labor. It comes to the Google News documents as a blank slate, and by the end of training, it can compute complex analogies that mean something to humans.
You can also query a Word2vec model for other assocations. Not everything has to be two analogies that mirror each other. ()
While Word2vec refers to a family of related algorithms, this implementation uses .
Create a new project in IntelliJ using Maven. If you don't know how to do that, see our . Then specify these properties and dependencies in the POM.xml file in your project's root directory (You can for the most recent versions -- please use those...).
The we use to test the accuracy of our trained nets is hosted on S3. Users whose current hardware takes a long time to train on large corpora can simply download it to explore a Word2vec model without the prelude.
If you trained with the or Gensim, this line will import the model.
With large models, you may run into trouble with your heap space. The Google model may take as much as 10G of RAM, and the JVM only launches with 256 MB of RAM, so you have to adjust your heap space. You can do that either with a bash_profile
file (see our ), or through IntelliJ itself:
Please note : The code below may be outdated. For updated examples, please see our .
Now that you have a basic idea of how to set up Word2Vec, here's of how it can be used with DL4J's API:
After following the instructions in the , you can open this example in IntelliJ and hit run to see it work. If you query the Word2vec model with a word isn't contained in the training corpus, it will return null.
Google Scholar keeps a running tally of the papers citing .
Kenny Helsens, a data scientist based in Belgium, to the NCBI's Online Mendelian Inheritance In Man (OMIM) database. He then looked for the words most similar to alk, a known oncogene of non-small cell lung carcinoma, and Word2vec returned: "nonsmall, carcinomas, carcinoma, mapdkd." From there, he established analogies between other cancer phenotypes and their genotypes. This is just one example of the associations Word2vec can learn on a large corpus. The potential for discovering new aspects of important diseases has only just begun, and outside of medicine, the opportunities are equally diverse.
Andreas Klintberg trained Deeplearning4j's implementation of Word2vec on Swedish, and wrote a .
Word2vec is introduced by a team of researchers at Google led by Tomas Mikolov. Google released under an Apache 2.0 license. In 2014, Mikolov left Google for Facebook, and in May 2015, , which does not abrogate the Apache license under which it has been released.
While words in all languages may be converted into vectors with Word2vec, and those vectors learned with Deeplearning4j, NLP preprocessing can be very language specific, and requires tools beyond our libraries. The has a number of Java-based tools for tokenization, part-of-speech tagging and named-entity recognition for languages such as , Arabic, French, German and Spanish. For Japanese, NLP tools like are useful. Other foreign-language resources, including .
Deeplearning4j has a class called , which is one level of abstraction above word vectors, and which allows you to extract features from any sequence, including social media profiles, transactions, proteins, etc. If data can be described as sequence, it can be learned via skip-gram and hierarchic softmax with the AbstractVectors class. This is compatible with the , also implemented in Deeplearning4j.
; Yoav Goldberg and Omer Levy
Of course, the DenseLayer and Convolutional layers do not handle time series data - they expect a different type of input. To deal with this, we need to use the layer preprocessor functionality: for example, the CnnToRnnPreProcessor and FeedForwardToRnnPreprocessor classes. See for all preprocessors. Fortunately, in most situations, the DL4J configuration system will automatically add these preprocessors as required. However, the preprocessors can be added manually (overriding the automatic addition of preprocessors, for each layer).
(In addition to the examples below, you might find to be of some use.)
To use the and approaches, we first create two CSVSequenceRecordReader objects, one for input and one for labels:
Second, we need to initialize these two readers, by telling them where to get the data from. We do this with an InputSplit object. Suppose that our time series are numbered, with file names "myInput_0.csv", "myInput_1.csv", ..., "myLabels_0.csv", etc. One approach is to use the :
LSTM recurrent neural network layer without peephole connections. Supports CuDNN acceleration - see for details
Local response normalization layer See section 3.3 of
Output (loss) layer for YOLOv2 object detection model, based on the papers: YOLO9000: Better, Faster, Stronger - Redmon & Farhadi (2016) - and You Only Look Once: Unified, Real-Time Object Detection - Redmon et al. (2016) - This loss function implementation is based on the YOLOv2 version of the paper. However, note that it doesn’t currently support simultaneous training on both detection and classification datasets as described in the YOlO9000 paper.