Highlights - 1.0.0-beta4 Release

Main highlight: full multi-datatype support for ND4J and DL4J. In past releases, all N-Dimensional arrays in ND4J were limited to a single datatype (float or double), set globally. Now, arrays of all datatypes may be used simultaneously. The following datatypes are supported:
    DOUBLE: double precision floating point, 64-bit (8 byte)
    FLOAT: single precision floating point, 32-bit (4 byte)
    HALF: half precision floating point, 16-bit (2 byte), "FP16"
    LONG: long signed integer, 64 bit (8 byte)
    INT: signed integer, 32 bit (4 byte)
    SHORT: signed short integer, 16 bit (2 byte)
    UBYTE: unsigned byte, 8 bit (1 byte), 0 to 255
    BYTE: signed byte, 8 bit (1 byte), -128 to 127
    BOOL: boolean type, (0/1, true/false). Uses ubyte storage for easier op parallelization
    UTF8: String array type, UTF8 format
ND4J Behaviour changes of note:
    When creating an INDArray from a Java primitive array, the INDArray datatype will be determined by the primitive array type (unless a datatype is specified)
      For example: Nd4j.createFromArray(double[]) -> DOUBLE datatype INDArray
      Similarly, Nd4j.scalar(1), Nd4j.scalar(1L), Nd4j.scalar(1.0) and Nd4j.scalar(1.0f) will produce INT, LONG, DOUBLE and FLOAT type scalar INDArrays respectively
    Some operations require matched datatypes for operands
      For example, if x and y are different datatypes, a cast may be required: x.add(y.castTo(x.dataType()))
    Some operations have datatype restrictions: for example, sum on a UTF8 array is not supported, nor is variance on a BOOL array. For some operations on boolean arrays (such as sum), casting to an integer or floating point type first may make sense.
DL4J Behaviour changes of note:
    MultiLayerNetwork/ComputationGraph no longer depend in any way on ND4J global datatype.
      The datatype of a network (DataType for it's parameters and activations) can be set during construction using NeuralNetConfigutation.Builder().dataType(DataType)
      Networks can be converted from one type to another (double to float, float to half etc) using MultiLayerNetwork/ComputationGraph.convertDataType(DataType) method
Main new methods:
    Nd4j.create(), zeros(), ones(), linspace(), etc methods with DataType argument
    INDArray.castTo(DataType) method - to convert INDArrays from one datatype to another
    New Nd4j.createFromArray(...) methods for
ND4J/DL4J: CUDA - 10.1 support added, CUDA 9.0 support dropped
CUDA versions supported in 1.0.0-beta4: CUDA 9.2, 10.0, 10.1.
ND4J: Mac/OSX CUDA support dropped
Mac (OSX) CUDA binaries are no longer provided. Linux (x86_64, ppc64le) and Windows (x86_64) CUDA support remains. OSX CPU support (x86_64) is still available.
DL4J/ND4J: MKL-DNN Support Added DL4J (and ND4J conv2d etc ops) now support MKL-DNN by default when running on CPU/native backend. MKL-DNN support is implemented for the following layer types:
    ConvolutionLayer and Convolution1DLayer (and Conv2D/Conv2DDerivative ND4J ops)
    SubsamplingLayer and Subsampling1DLayer (and MaxPooling2D/AvgPooling2D/Pooling2DDerivative ND4J ops)
    BatchNormalization layer (and BatchNorm ND4J op)
    LocalResponseNormalization layer (and LocalResponseNormalization ND4J op)
    Convolution3D layer (and Conv3D/Conv3DDerivative ND4J ops)
MKL-DNN support for other layer types (such as LSTM) will be added in a future release.
MKL-DNN can be disabled globally (ND4J and DL4J) using Nd4jCpu.Environment.getInstance().setUseMKLDNN(false);
MKL-DNN can be disabled globally for specific ops by setting ND4J_MKL_FALLBACK environment variable to the name of the operations to have MKL-DNN support disabled for. For example: ND4J_MKL_FALLBACK=conv2d,conv2d_bp
ND4J: Improved Performance due to Memory Management Changes
Prior releases of ND4J used periodic garbage collection (GC) to release memory that was not allocated in a memory workspace. (Note that DL4J uses workspaces for almost all operations by default hence periodic GC could frequently be disabled when training DL4J networks). However, the reliance on garbage collection resulted in a performance overhead that scaled with the number of objects in the JVM heap.
In 1.0.0-beta4, the periodic garbage collection is disabled by default; instead, GC will be called only when it is required to reclaim memory from arrays that are allocated outside of workspaces.
To re-enable periodic GC (as per the default in beta3) and set the GC frequency to every 5 seconds (5000ms) you can use:
ND4J: Improved Rank 0/1 Array Support
In prior versions of ND4J, scalars and vectors would sometimes be rank 2 instead of rank 0/1 when getting rows/columns, getting sub-arrays using INDArray.get(NDArrayIndex...) or when creating arrays from Java arrays/scalars. Now, behaviour should be more consistent for these rank 0/1 cases. Note to maintain old behaviour for getRow and getColumn (i.e., return rank 2 array with shape [1,x] and [x,1] respectively), the getRow(long,boolean) and getColumn(long,boolean) methods can be used.
DL4J: Attention layers added


Deeplearning4J: Features and Enhancements

    Added MKL-DNN support for Conv/Pool/BatchNorm/LRN layers. MKL-DNN will be used automatically when using nd4j-native backend. (Link, Link)
    L1/L2 regularization now made into a class; weight decay added, with better control as to when/how it is applied. See this page for more details on the difference between L2 and weight decay. In general, weight decay should be preferred to L2 regularization. (Link, Link)
    The parameter/activation datatypes for new models can be set for new networks using the dataType(DataType) method on NeuralNetConfiguration.Builder (Link)
    MultiLayerNetwork/ComputationGraph can be converted between (floating point) datatypes FP16/32/64 for the parameters and activations using the MultiLayerNetwork/ComputationGraph.convertDataType(DataType) methods (Link, Link)
    EmbeddingLayer and EmbeddingSequenceLayer builders now have .weightInit(INDArray) and .weightInit(Word2Vec) methods for initializing parameters from pretrained word vectors (Link)
    PerformanceListener can now be configured to report garbage collection information (number/duration) Link
    Evaluation class will now check for NaNs in the predicted output and throw an exception instead treating argMax(NaNs) as having value 0 (Link)
    Added ModelAdapter for ParallelInference for convenience and for use cases such as YOLO (allows improved performance by avoiding detached (out-of-workspace) arrays) (Link)
    Added GELU Activation function (Link)
    Added BertIterator (a MultiDataSetIterator for BERT training - supervised and unsupervised) Link
    Added validation to MultiLayerNetwork/ComputationGraph that throws an exception when attempting to perform Regression evaluation on a classifier, or vice-versa (Link, Link)
    Added ComputationGraph.output(List<String> layers, boolean train, INDArray[] features, INDArray[] featureMasks) method to get the activations for a specific set of layers/vertices only (without redundant calculations) (Link)
    Weight initialization for networks is now implemented as classes (not just enumerations) and hence is now extesible via IWeightInit interface (Link); i.e., custom weight initializations are now supported (Link, Link)
    Added Capsule Network layers (no GPU acceleration until next release) - CapsuleLayer, CapsuleStrengthLayer and PrimaryCapsules (Link)
    Added Cifar10DataSetIterator to replace CifarDataSetIterator (Link, Link)
    Keras import: Importing models from InputStream is now supported (Link, Link)
    Layer/NeuralNetConfiguration builders now have getter/setter methods also, for better Kotlin support (Link)
    Most JavaScript dependencies and fonts for UI have been migrated to WebJars (Link)
    CheckpointListener now has static availableCheckpoints(File), loadCheckpointMLN(File, int) and lostLastCheckpointMLN(File) etc methods (Link)
    MultiLayerNetwork/ComputationGraph now validate and throw an exception in certain incompatible RNN configurations, like truncated backpropagation through time combined with LastTimeStepLayer/Vertex (Link)
    Added BERT WordPiece tokenizers (Link)
    Deeplearning4j UI now has multi-user/multi-session support - use UIServer.getInstance(boolean multiSession, Function<String,StatsStorage>) to start UI in multi-session mode (Link)
    Layer/NeuralNetworkConfiguration builder method validation standardized and improved (Link)
    WordVectorSerializer now supports reading and exporting text forwat vectors via WordVectorSerializer.writeLookupTable and readLookupTable (Link]
    Updated to JavaCPP, JavaCPP presets, and JavaCV version 1.5 (Link)
    Added EvaluationBinary false alarm rate calculation (Link)
    ComputationGraph GraphBuilder now has an appendLayer method that can be used to add layers connected to the last added layer/vertex (Link)
    Added Wasserstein loss function (Link)
    Keras import: Improved errors/exceptions for lambda layer import (Link)
    Apache Lucene/Solr upgraded from 7.5.0 to 7.7.1 (Link)
    KMeans clustering strategy is now configurable (Link)

Deeplearning4J: Bug Fixes and Optimizations

    DL4J Spark training: fix for shared clusters (multiple simultaneous training jobs) - Aeron stream ID now generated randomly (Link)
    cuDNN helpers will no longer attempt to fall back on built-in layer implementations if an out-of-memory exception is thrown (Link)
    Batch normalization global variance reparameterized to avoid underflow and zero/negative variance in some cases during distributed training (Link)
    Fixed a bug where dropout instances were incorrectly shared between layers when using transfer learning with dropout (Link, Link)
    Fixed issue where tensorAlongDimension could result in an incorrect array order for edge cases and hence exceptions in LSTMs (Link)
    Fixed an edge case issue with ComputationGraph.getParam(String) where the layer name contains underscores (Link)
    Fixed an edge case with ParallelInference on CUDA where (very rarely) input array operations (such as normalization) may not be fully completed before transferring an array between threads (Link, Link)
    Fixed an edge case with KFoldIterator when the total number of examples is not a multiple of the batch size (Link, Link)
    Fixed an issue where DL4J UI could throw a NoClassDefFoundError on Java 9/10/11 (Link, Link)
    Keras import: added aliases for weight initialization (Link)
    Fixed issue where dropout instances would not be correctly cloned when network configuration was cloned (Link)
    Fixed workspace issue with ElementwiseVertex with single input (Link)
    Fixed issue with UI where detaching StatsStorage could attempt to remove storage twice, resulting in an exception (Link)
    Fixed issue where LossMultiLabel would generate NaNs when all labels in minibatch are the same class. Now 0 gradient is returned instead. (Link, Link)
    Fixed an issue where DepthwiseConv2D weight could be wrong shape on restoring network from saved format (Link)
    Fixed issue where BaseDatasetIterator.next() would not apply preprocessors, if one was set (Link)
    Improved default configuration for CenterLossOutputLayer (Link)
    Fixed an issue for UNet non-pretrained configuration (Link)
    Fixed an issue where Word2Vec VocabConstructor could deadlock under some circumstances (Link)
    SkipGram and CBOW (used in Word2Vec) were made native operations for better performance (Link)
    Fixed an issue where references to detached StatsListener instances would be maintained, potentially leading to memory issues when using InMemoryStatsListener (Link)
    Optimization: Workspaces were added to SequenceVectors and Word2Vec (Link)
    Improved validation for RecordReaderDataSetIterator (Link)
    Improved handling of unknown words in WordVectors implementation (Link)
    Yolo2OutputLayer: Added validation for incorrect labels shape. (Link)
    LastTimeStepLayer will now throw an exception when the input mask is all 0s (no data - no last time step) (Link)
    Fixed an issue where MultiLayerNetwork/ComputationGraph.setLearningRate method could lead to invalid updater state in some rare cases (Link)
    Fixed an issue where Conv1D layer would calculate output length in MultiLayerNetwork.summary() (Link)
    Async iterators are now used in EarlyStoppingTrained to improve data loading performance (Link)
    EmbeddingLayer and EmbeddingSequenceLayer performance has been improved on CUDA (Link)
    Removed outdated/legacy scala tools repository (Link, Link)
    Fixed issues in L2NormalizeVertex equals/hashcode methods (Link)
    Fixed Workspace issue in ConvolutionalListener (Link)
    Fixed EvaluationBinary falsePositiveRate calculation (Link)
    Added validation and useful exception for MultiLayerNetwork.output(DataSetIterator) methods (Link)
    Fixed minor issue where ComputationGraph.summary() would throw a NullPointerException if init() had not already been called (Link)
    Fixed a ComputationGraph issue where an input into a single layer/vertex repeated multiple times could fail during training (Link)
    Improved performance for KMeans implementation (Link)
    Fixed an issue with rnnGetPreviousState for RNNs in 'wrapper' layers such as FrozenLayer (Link)
    Keras import: Fixed an issue with order of words when importing some Keras tokenizers (Link)
    Keras import: fixed issue with possible UnsupportedOperationException in KerasTokenizer class (Link)
    Keras import: fixed an import issue with models combining embeddings, reshape and convolution layers (Link)
    Keras import: fixed an import issue with input type inference for some RNN models (Link)
    Fixed some padding issues in LocallyConnected1D/2D layers (Link)

ND4J and SameDiff

ND4J/SameDiff: Features and Enhancements

    Removed reliance on periodic garbage collection calls for handling memory management of out-of-workspace (detached) INDArrays (Link)
    Added INDArray.close() method to allow users to manually release off-heap memory immediately (Link)
    SameDiff: Added TensorFlowImportValidator tool to determine if a TensorFlow graph can likely be imported into SameDiff. Reports the operations used and whether they are supported in SameDiff (Link)
    Added Nd4j.createFromNpzFile method to load Numpy npz files (Link)
    Added support for importing BERT models into SameDiff (Link, Link)
    Added SameDiff GraphTransformUtil for performing transfer learning and other graph modifications (Link, Link, Link)
    Evaluation, RegressionEvaluation etc now support 4d (CNN segmentation) data formats; also added Evaluation.setAxis(int) method to support other data formats such as channels-last/NHWC for CNNs and NWC for CNN1D/RNNs. Defaults to axis 1 (which matches DL4J CNN and RNN data formats) (Link, Link)
    Added basic ("technology preview") of SameDiff UI. Should be considered early WIP with breaking API changes expected in future releases. Supports plotting of SameDiff graphs as well as various metrics (line charts, histograms, etc)
      Currenty embedding in the DL4J UI - call UIServer.getInstance() then go to localhost:9000/samediff to access.
      For more details, see 1, 2, 3
    Added DotProductAttention and MultiHeadDotProductAttention operations (Link)
    Added Nd4j.exec(Op) and Nd4j.exec(CustomOp) convenience methods (Link)
    SameDiff TensorFlow Import
      Import of TF Assertions added (Link)
      Support/fixes for control dependencies (Link)
      Support/fixes for TensorArray and related ops (Link, Link, Link)
    nd4j-common - tar/tar.gz support added; Zip file listing and single file extraction added (Link, Link)
    SameDiff: reductions operations now support "dynamic" (non-constant) inputs for axis argument (Link)
    ROCBinary now has .getROC(int outputNum) method (Link)
    SameDiff: L1/L2 regularization added (Link, Link)
    SameDiff: Added SDVariable.convertToVariable() and convertToConstant() - to change SDVariable type (Link)
    Added checks and useful exceptions for reductions on empty arrays (Link)
    SameDiff "op creator" methods (SameDiff.tanh(), SameDiff.conv2d(...) etc) have been moved to subclasses - access creators via SameDiff.math()/random()/nn()/cnn()/rnn()/loss() methods or SameDiff.math/random/nn/cnn/rnn/loss fields (Link)
    SameDiff TensorFlow import: import can now be overridden for cases such as user-defined functions (Link, Link)
    Libnd4j (c++) benchmarking framework added (Link)
    Added OpExecutioner.inspectArray(INDArray) method to get summary statistics for analysis/debugging purposes (Link)
    Added INDArray.reshape(char order, boolean enforceView, long... newShape) to reshape array whilst throwing an exception (instead of returning a copy) if the reshape cannot be performed (Link, Link)
    Added SDVariable method overloads (plus, minus, times, etc) for Kotlin (Link)
    Added SDVariable convenience methods for dot, reshape, permute (Link)
    Added SameDiff SDIndex.point(long, boolean keepDim) method (to keep point indices in output array as size 1 axis) (Link)
    Added SameDiff ProtoBufToFlatBufConversion command line tool for doing TensorFlow frozen model (protobuf) to SameDiff FlatBuffers conversion (Link)
    Improved DataType validation for SameDiff operations (Link)

ND4J/SameDiff: API Changes (Transition Guide): 1.0.0-beta3 to 1.0.0-beta4

    ND4J datatypes - significant changes, see highlights at top of this section
    nd4j-base64 module (deprecated in beta3) has been removed. Nd4jBase64 class has been moved to nd4j-api (Link)
    When specifying arguments for op execution along dimension (for example, reductions) the reduction axis are now specified in the operation constructor - not separately in the OpExecutioner call. (Link)
    Removed old Java loop-based BooleanIndexing methods. Equivalent native ops should be used instead. (Link)
    Removed Nd4j.ENFORCE_NUMERICAL_STABILITY, Nd4j.copyOnOps, etc (Link)
    SameDiff "op creator" methods (SameDiff.tanh(), SameDiff.conv2d(...) etc) have been moved to subclasses - access creators via SameDiff.math()/random()/nn()/cnn()/rnn()/loss() methods or SameDiff.math/random/nn/cnn/rnn/loss fields (Link)
    Nd4j.emptyLike(INDArray) has been removed. Use Nd4j.like(INDArray) instead (Link)
    org.nd4jutil.StringUtils removed; suggest using Apache commons lang3 StringUtils instead (Link)
    ND4J Jackson RowVector(De)Serializer has been deprecated due to datatype changes; NDArrayText(De)Serializer should be used instead (Link, Link)
    nd4j-instrumentation module has been removed due to lack of use/maintenance (Link)

ND4J/SameDiff: Bug Fixes and Optimizations

    Fixed bug with InvertMatrix.invert() with [1,1] shape matrices (Link)
    Fixed edge case bug for Updater instances with length 1 state arrays (Link)
    Fixed edge case with FileDocumentIterator with empty documents (Link)
    SameDiff: Numerous fixes and enhancements
      1, 2, 3, 4
      Improved functionality for losses (Link, Link, Link, Link)
      Improved errors for missing/misspelled placeholders (Link)
      Fixed edge cases in loops (Link, Link)
    Fixed issue with Nd4j.vstack on 1d arrays returning 1d output, not 2d stacked output (Link)
    Conv2D op can infer kernel size from input arrays directly when required (Link, Link)
    Fixed an issue with Numpy format export - Nd4j.toNpyByteArray(INDArray) (Link)
    Fixes for SameDiff when it is used within an external workspace (Link)
    Fixed an issue where empty NDArrays would be reported as having scalar shape information, length 1 (Link)
    Optimization: libnd4j (c++) indexing for ops will use uint for faster offset calculations when required and possible (Link)
    Optimization: libnd4j loops performance improved for faster execution of some operations (Link, Link, Link)
    Local response normalization op optimized (Link, Link)
    Fixed an issue with INDArray.repeat on some view arrays (Link)
    Improved performance for execution of some operations on view arrays (Link)
    Improved performance on broadcast operations (Link, Link, Link)
    Improved performance for non-EWS reduction along dimension operations (Link)
    Improved performance fo IndexReduce operations (Link) and small reductions (Link)
    Improved performonce of one_hot operation (Link), tanh operation (Link)
    Improved performance for transform operations (Link)
    Optimization: empty arrays are created only once and cached (as they are immutable) (Link)
    Improved performance on operations using tensor along dimension for parallelization (Link, Link)
    Improved performance on "reduce 3" reduction operations (Link)
    Improved handling of CUDA contexts in heavily multi-threaded environments (Link)
    Fixed an issue where Evaluation.reset() would incorrectly clear the String class labels (Link)
    SameDiff: Improved gradient calculation performance/efficiency; "gradients" are now no longer defined for non-floating-point variables, and variables that aren't required to calculate loss or parameter gradients (Link)
    Behaviour of IEvaluation instances now no longer depends on the global (default) datatype setting (Link)
    INDArray.get(point(x), y) or .get(y, point(x)) now returns rank 1 arrays when performed on rank 2 arrays (Link)
    Removed reliance on Guava for SameDiff, fixing potential issue for Java 11/12 and when earlier versions of Guava are on the classpath (Link, Link)
    ND4J indexing (INDArray.get) implementation rewritten for better performance and reliability (Link)
    Fixes for local response normalization backprop op (Link)

ND4J: Known Issues

    Most CustomOperation operations (such as those used in SameDiff) are CPU only until next release. GPU support was not completed in time for 1.0.0-beta4 release.
    Some users with Intel Skylake CPUs have reported deadlocks on MKL-DNN convolution 2d backprop operations (DL4J ConvolutionLayer backprop, ND4J "conv2d_bp" operation) when OMP_NUM_THREADS is set to 8 or higher. Investigations suggest this is likely an issue with MKL-DNN, not DL4J/ND4J. See Issue 7637. Workaround: Disable MKL-DNN for conv2d_bp operation via ND4J_MKL_FALLBACK (see earlier) or disable MKL-DNN globally, for Skylake CPUs.


DataVec: Features and Enhancements

    Added PythonTransform (arbitrary python code execution for pre processing) (Link, Link)
    Added FirstDigit (Benford's law) transform (Link, Link)
    StringToTimeTransform now supports setting Locale (Link, Link)
    Added StreamInputSplit for creating local data pipelines where data is stored remotely on storage such as HDFS or S3 (Link, Link)
    LineRecordReader (and subtypes) now have the option to define the character set (Link)
    Added TokenizerBagOfWordsTermSequenceIndexTransform (TFIDF transform), GazeteerTransform (binary vector for word present) and MultiNlpTransform transforms; added BagOfWordsTransform interface (Link)

DataVec: Optimizations and Bug Fixes

    Fixed issue with ImageLoader.scalingIfNeeded (Link)


Arbiter: Enhancements

    Arbiter now supports genetic algorithm search (Link)

Arbiter: Fixes

    Fixed an issue where early stopping used in Arbiter would result in a serialization exception (Link)