1 of 14

Reference

Analysis

Gather statistics on datasets.

Analysis of data

Sometimes datasets are too large or too abstract in their format to manually analyze and estimate statistics on certain columns or patterns. DataVec comes with some helper utilities for performing a data analysis, and maximums, means, minimums, and other useful metrics.

Using Spark for analysis

If you have loaded your data into Apache Spark, DataVec has a special AnalyzeSpark class which can generate histograms, collect statistics, and return information about the quality of the data. Assuming you have already loaded your data into a Spark RDD, pass the JavaRDD and Schema to the class.

If you are using DataVec in Scala and your data was loaded into a regular RDD class, you can convert it by calling .toJavaRDD() which returns a JavaRDD. If you need to convert it back, call rdd().

The code below demonstrates some of many analyses for a 2D dataset in Spark analysis using the RDD javaRdd and the schema mySchema:

Note that if you have sequence data, there are special methods for that as well:

Analyzing locally

The AnalyzeLocal class works very similarly to its Spark counterpart and has a similar API. Instead of passing an RDD, it accepts a RecordReader which allows it to iterate over the dataset.

Utilities

AnalyzeLocal

Analyse the specified data - returns a DataAnalysis object with summary information about each column

analyze

Analyse the specified data - returns a DataAnalysis object with summary information about each column

param schema Schema for data
param rr Data to analyze
return DataAnalysis for data

analyzeQualitySequence

Analyze the data quality of sequence data - provides a report on missing values, values that don’t comply with schema, etc

param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object

analyzeQuality

Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc

param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object

AnalyzeSpark

AnalizeSpark: static methods for analyzing and

analyzeSequence

param schema
param data
param maxHistogramBuckets
return

analyze

Analyse the specified data - returns a DataAnalysis object with summary information about each column

param schema Schema for data
param data Data to analyze
return DataAnalysis for data

analyzeQualitySequence

Randomly sample values from a single column

param count Number of values to sample
param columnName Name of the column to sample from
param schema Schema
param data Data to sample from
return A list of random samples

analyzeQuality

Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc

param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object

min

Randomly sample a set of invalid values from a specified column. Values are considered invalid according to the Schema / ColumnMetaData

param numToSample Maximum number of invalid values to sample
param columnName Same of the column from which to sample invalid values
param schema Data schema
param data Data
return List of invalid examples

max

Get the maximum value for the specified column

param allData All data
param columnName Name of the column to get the minimum value for
param schema Schema of the data
return Maximum value for the column

Conditions

BooleanCondition

BooleanCondition: used for creating compound conditions, such as AND(ConditionA, ConditionB, …) As a BooleanCondition is a condition, these can be chained together, like NOT(OR(AND(…),AND(…)))

outputColumnName

The output column name after the operation has been applied

return the output column name

columnName

The output column names This will often be the same as the input

return the output column names

condition

Condition on arbitrary input

param input the input to return the condition for
return true if the condition is met false otherwise

conditionSequence

Condition on arbitrary input

param sequence the sequence to do a condition on
return true if the condition for the sequence is met false otherwise

transform

Get the output schema for this transformation, given an input schema

param inputSchema

AND

And of all the given conditions

param conditions the conditions to and
return a joint and of all these conditions

OR

Or of all the given conditions

param conditions the conditions to or
return a joint and of all these conditions

NOT

Not of the given condition

param condition the conditions to and
return a joint and of all these condition

XOR

And of all the given conditions

param first the first condition
param second the second condition for xor
return the xor of these 2 conditions

SequenceConditionMode

For certain single-column conditions: how should we apply these to sequences? And: Condition applies to sequence only if it applies to ALL time steps Or: Condition applies to sequence if it applies to ANY time steps NoSequencMode: Condition cannot be applied to sequences at all (error condition)

BooleanColumnCondition

Created by agibsonccc on 11/26/16.

columnCondition

Returns whether the given element meets the condition set by this operation

param writable the element to test
return true if the condition is met false otherwise

condition

Condition on arbitrary input

param input the input to return the condition for
return true if the condition is met false otherwise

CategoricalColumnCondition

columnCondition

Constructor for conditions equal or not equal. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

param columnName Column to check for the condition
param op Operation (== or != only)
param value Value to use in the condition

condition

Condition on arbitrary input

param input the input to return the condition for
return true if the condition is met false otherwise

DoubleColumnCondition

columnCondition

Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Value to use in the condition

condition

Condition on arbitrary input

param input the input to return the condition for
return true if the condition is met false otherwise

InfiniteColumnCondition

A column condition that simply checks whether a floating point value is infinite

columnCondition

param columnName Column check for the condition

IntegerColumnCondition

columnCondition

Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Value to use in the condition

condition

Condition on arbitrary input

param input the input to return the condition for
return true if the condition is met false otherwise

InvalidValueColumnCondition

A Condition that applies to a single column. Whenever the specified value is invalid according to the schema, the condition applies.

For example, if a Writable contains String values in an Integer column (and these cannot be parsed to an integer), then the condition would return true, as these values are invalid according to the schema.

condition

Condition on arbitrary input

param input the input to return the condition for
return true if the condition is met false otherwise

LongColumnCondition

columnCondition

Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Value to use in the condition

condition

Condition on arbitrary input

param input the input to return the condition for
return true if the condition is met false otherwise

NaNColumnCondition

A column condition that simply checks whether a floating point value is NaN

columnCondition

param columnName Name of the column to check the condition for

NullWritableColumnCondition

Condition that applies to the values in any column. Specifically, condition is true if the Writable value is a NullWritable, and false for any other value

condition

Condition on arbitrary input

param input the input to return the condition for
return true if the condition is met false otherwise

StringColumnCondition

columnCondition

Constructor for conditions equal or not equal Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

param columnName Column to check for the condition
param op Operation (== or != only)
param value Value to use in the condition

condition

Condition on arbitrary input

param input the input to return the condition for
return true if the condition is met false otherwise

TimeColumnCondition

Condition that applies to the values

columnCondition

Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Time value (in epoch millisecond format) to use in the condition

condition

Condition on arbitrary input

param input the input to return the condition for
return true if the condition is met false otherwise

TrivialColumnCondition

Created by huitseeker on 5/17/17.

SequenceLengthCondition

A condition on sequence lengths

StringRegexColumnCondition

Condition that applies to the values in a String column, using a provided regex. Condition return true if the String matches the regex, or false otherwise Note: Uses Writable.toString(), hence can potentially be applied to non-String columns

condition

Condition on arbitrary input

param input the input to return the condition for
return true if the condition is met false otherwise

Executors

Execute ETL and vectorization in a local instance.

Local or remote execution?

Because datasets are commonly large by nature, you can decide on an execution mechanism that best suits your needs. For example, if you are vectorizing a large training dataset, you can process it in a distributed Spark cluster. However, if you need to do real-time inference, DataVec also provides a local executor that doesn't require any additional setup.

Executing a transform process

Once you've created your TransformProcess using your Schema, and you've either loaded your dataset into a Apache Spark JavaRDD or have a RecordReader that load your dataset, you can execute a transform.

Locally this looks like:

When using Spark this looks like:

Available executors

LocalTransformExecutor

Local transform executor

isTryCatch

Execute the specified TransformProcess with the given input data Note: this method can only be used if the TransformProcess returns non-sequence data. For TransformProcesses that return a sequence, use {- link #executeToSequence(List, TransformProcess)}

param inputWritables Input data to process
param transformProcess TransformProcess to execute
return Processed data

SparkTransformExecutor

Execute a datavec transform process on spark rdds.

isTryCatch

deprecated Use static methods instead of instance methods on SparkTransformExecutor

Filters

Selection of data using conditions.

Using filters

Filters are a part of transforms and gives a DSL for you to keep parts of your dataset. Filters can be one-liners for single conditions or include complex boolean logic.

TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
    .filter(new ConditionFilter(new CategoricalColumnCondition("MerchantCountryCode", ConditionOp.NotInSet, new HashSet<>(Arrays.asList("USA","CAN")))))
    .build();

You can also write your own filters by implementing the Filter interface, though it is much more often that you may want to create a custom condition instead.

Available filters

ConditionFilter

[source]

If condition is satisfied (returns true): remove the example or sequence If condition is not satisfied (returns false): keep the example or sequence

removeExample

public boolean removeExample(Object writables)

param writables Example
return true if example should be removed, false to keep

removeSequence

public boolean removeSequence(Object sequence)

param sequence sequence example
return true if example should be removed, false to keep

transform

public Schema transform(Schema inputSchema)

Get the output schema for this transformation, given an input schema

param inputSchema

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

Filter

[source]

Filter: a method of removing examples (or sequences) according to some condition

FilterInvalidValues

[source]

FilterInvalidValues: a filter operation that removes any examples (or sequences) if the examples/sequences contains invalid values in any of a specified set of columns. Invalid values are determined with respect to the schema

transform

public Schema transform(Schema inputSchema)

param columnsToFilterIfInvalid Columns to check for invalid values

removeExample

public boolean removeExample(Object writables)

param writables Example
return true if example should be removed, false to keep

removeSequence

public boolean removeSequence(Object sequence)

param sequence sequence example
return true if example should be removed, false to keep

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

InvalidNumColumns

[source]

Remove invalid records of a certain size.

removeExample

public boolean removeExample(Object writables)

param writables Example
return true if example should be removed, false to keep

removeSequence

public boolean removeSequence(Object sequence)

param sequence sequence example
return true if example should be removed, false to keep

removeExample

public boolean removeExample(List<Writable> writables)

param writables Example
return true if example should be removed, false to keep

removeSequence

public boolean removeSequence(List<List<Writable>> sequence)

param sequence sequence example
return true if example should be removed, false to keep

transform

public Schema transform(Schema inputSchema)

Get the output schema for this transformation, given an input schema

param inputSchema

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

Normalization

Why normalize?

Neural networks work best when the data they’re fed is normalized, constrained to a range between -1 and 1. There are several reasons for that. One is that nets are trained using gradient descent, and their activation functions usually having an active range somewhere between -1 and 1. Even when using an activation function that doesn’t saturate quickly, it is still good practice to constrain your values to this range to improve performance.

Available preprocessors

NormalizerMinMaxScaler

[source]

Pre processor for DataSets that normalizes feature values (and optionally label values) to lie between a minimum and maximum value (by default between 0 and 1)

NormalizerMinMaxScaler

public NormalizerMinMaxScaler(double minRange, double maxRange)

Preprocessor can take a range as minRange and maxRange

param minRange
param maxRange

load

public void load(File... statistics) throws IOException

Load the given min and max

param statistics the statistics to load
throws IOException

save

public void save(File... files) throws IOException

Save the current min and max

param files the statistics to save
throws IOException
deprecated use {- link NormalizerSerializer instead}

Normalizer

[source]

Base interface for all normalizers

ImageFlatteningDataSetPreProcessor

[source]

A DataSetPreProcessor used to flatten a 4d CNN features array to a flattened 2d format (for use in networks such as a DenseLayer/multi-layer perceptron)

MinMaxStrategy

[source]

statistics of the upper and lower bounds of the population

MinMaxStrategy

public MinMaxStrategy(double minRange, double maxRange)

param minRange the target range lower bound
param maxRange the target range upper bound

preProcess

public void preProcess(INDArray array, INDArray maskArray, MinMaxStats stats)

Normalize a data array

param array the data to normalize
param stats statistics of the data population

revert

public void revert(INDArray array, INDArray maskArray, MinMaxStats stats)

Denormalize a data array

param array the data to denormalize
param stats statistics of the data population

ImagePreProcessingScaler

[source]

Created by susaneraly on 6/23/16. A preprocessor specifically for images that applies min max scaling Can take a range, so pixel values can be scaled from 0->255 to minRange->maxRange default minRange = 0 and maxRange = 1; If pixel values are not 8 bits, you can specify the number of bits as the third argument in the constructor For values that are already floating point, specify the number of bits as 1

ImagePreProcessingScaler

public ImagePreProcessingScaler(double a, double b, int maxBits)

Preprocessor can take a range as minRange and maxRange

param a, default = 0
param b, default = 1
param maxBits in the image, default = 8

fit

public void fit(DataSet dataSet)

Fit a dataset (only compute based on the statistics from this dataset0

param dataSet the dataset to compute on

fit

public void fit(DataSetIterator iterator)

Iterates over a dataset accumulating statistics for normalization

param iterator the iterator to use for collecting statistics.

transform

public void transform(DataSet toPreProcess)

Transform the data

param toPreProcess the dataset to transform

CompositeMultiDataSetPreProcessor

[source]

A simple Composite MultiDataSetPreProcessor - allows you to apply multiple MultiDataSetPreProcessors sequentially on the one MultiDataSet, in the order they are passed to the constructor

CompositeMultiDataSetPreProcessor

public CompositeMultiDataSetPreProcessor(MultiDataSetPreProcessor... preProcessors)

param preProcessors Preprocessors to apply. They will be applied in this order

MultiNormalizerMinMaxScaler

[source]

Pre processor for MultiDataSet that normalizes feature values (and optionally label values) to lie between a minimum and maximum value (by default between 0 and 1)

MultiNormalizerMinMaxScaler

public MultiNormalizerMinMaxScaler(double minRange, double maxRange)

Preprocessor can take a range as minRange and maxRange

param minRange the target range lower bound
param maxRange the target range upper bound

MultiDataNormalization

[source]

An interface for multi dataset normalizers. Data normalizers compute some sort of statistics over a MultiDataSet and scale the data in some way.

ImageMultiPreProcessingScaler

[source]

A preprocessor specifically for images that applies min max scaling to one or more of the feature arrays in a MultiDataSet. Can take a range, so pixel values can be scaled from 0->255 to minRange->maxRange default minRange = 0 and maxRange = 1; If pixel values are not 8 bits, you can specify the number of bits as the third argument in the constructor For values that are already floating point, specify the number of bits as 1

ImageMultiPreProcessingScaler

public ImageMultiPreProcessingScaler(double a, double b, int maxBits, int[] featureIndices)

Preprocessor can take a range as minRange and maxRange

param a, default = 0
param b, default = 1
param maxBits in the image, default = 8
param featureIndices Indices of feature arrays to process. If only one feature array is present, this should always be 0

NormalizerStandardize

[source]

Created by susaneraly, Ede Meijer variance and mean Pre processor for DataSet that normalizes feature values (and optionally label values) to have 0 mean and a standard deviation of 1

load

public void load(File... files) throws IOException

Load the means and standard deviations from the file system

param files the files to load from. Needs 4 files if normalizing labels, otherwise 2.

save

public void save(File... files) throws IOException

param files the files to save to. Needs 4 files if normalizing labels, otherwise 2.
deprecated use {- link NormalizerSerializer} instead

Save the current means and standard deviations to the file system

StandardizeStrategy

[source]

of the means and standard deviations of the population

preProcess

public void preProcess(INDArray array, INDArray maskArray, DistributionStats stats)

Normalize a data array

param array the data to normalize
param stats statistics of the data population

revert

public void revert(INDArray array, INDArray maskArray, DistributionStats stats)

Denormalize a data array

param array the data to denormalize
param stats statistics of the data population

NormalizerStrategy

[source]

Interface for strategies that can normalize and denormalize data arrays based on statistics of the population

MultiNormalizerHybrid

[source]

Pre processor for MultiDataSet that can be configured to use different normalization strategies for different inputs and outputs, or none at all. Can be used for example when one input should be normalized, but a different one should be untouched because it’s the input for an embedding layer. Alternatively, one might want to mix standardization and min-max scaling for different inputs and outputs.

By default, no normalization is applied. There are methods to configure the desired normalization strategy for inputs and outputs either globally or on an individual input/output level. Specific input/output strategies will override global ones.

MultiNormalizerHybrid

public MultiNormalizerHybrid standardizeAllInputs()

Apply standardization to all inputs, except the ones individually configured

return the normalizer

minMaxScaleAllInputs

public MultiNormalizerHybrid minMaxScaleAllInputs()

Apply min-max scaling to all inputs, except the ones individually configured

return the normalizer

minMaxScaleAllInputs

public MultiNormalizerHybrid minMaxScaleAllInputs(double rangeFrom, double rangeTo)

Apply min-max scaling to all inputs, except the ones individually configured

param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer

standardizeInput

public MultiNormalizerHybrid standardizeInput(int input)

Apply standardization to a specific input, overriding the global input strategy if any

param input the index of the input
return the normalizer

minMaxScaleInput

public MultiNormalizerHybrid minMaxScaleInput(int input)

Apply min-max scaling to a specific input, overriding the global input strategy if any

param input the index of the input
return the normalizer

minMaxScaleInput

public MultiNormalizerHybrid minMaxScaleInput(int input, double rangeFrom, double rangeTo)

Apply min-max scaling to a specific input, overriding the global input strategy if any

param input the index of the input
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer

standardizeAllOutputs

public MultiNormalizerHybrid standardizeAllOutputs()

Apply standardization to all outputs, except the ones individually configured

return the normalizer

minMaxScaleAllOutputs

public MultiNormalizerHybrid minMaxScaleAllOutputs()

Apply min-max scaling to all outputs, except the ones individually configured

return the normalizer

minMaxScaleAllOutputs

public MultiNormalizerHybrid minMaxScaleAllOutputs(double rangeFrom, double rangeTo)

Apply min-max scaling to all outputs, except the ones individually configured

param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer

standardizeOutput

public MultiNormalizerHybrid standardizeOutput(int output)

Apply standardization to a specific output, overriding the global output strategy if any

param output the index of the input
return the normalizer

minMaxScaleOutput

public MultiNormalizerHybrid minMaxScaleOutput(int output)

Apply min-max scaling to a specific output, overriding the global output strategy if any

param output the index of the input
return the normalizer

minMaxScaleOutput

public MultiNormalizerHybrid minMaxScaleOutput(int output, double rangeFrom, double rangeTo)

Apply min-max scaling to a specific output, overriding the global output strategy if any

param output the index of the input
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer

getInputStats

public NormalizerStats getInputStats(int input)

Get normalization statistics for a given input.

param input the index of the input
return implementation of NormalizerStats corresponding to the normalization strategy selected

getOutputStats

public NormalizerStats getOutputStats(int output)

Get normalization statistics for a given output.

param output the index of the output
return implementation of NormalizerStats corresponding to the normalization strategy selected

fit

public void fit(@NonNull MultiDataSet dataSet)

Get the map of normalization statistics per input

return map of input indices pointing to NormalizerStats instances

fit

public void fit(@NonNull MultiDataSetIterator iterator)

Iterates over a dataset accumulating statistics for normalization

param iterator the iterator to use for collecting statistics

transform

public void transform(@NonNull MultiDataSet data)

Transform the dataset

param data the dataset to pre process

revert

public void revert(@NonNull MultiDataSet data)

Undo (revert) the normalization applied by this DataNormalization instance (arrays are modified in-place)

param data MultiDataSet to revert the normalization on

revertFeatures

public void revertFeatures(@NonNull INDArray[] features)

Undo (revert) the normalization applied by this DataNormalization instance to the entire inputs array

param features The normalized array of inputs

revertFeatures

public void revertFeatures(@NonNull INDArray[] features, INDArray[] maskArrays)

Undo (revert) the normalization applied by this DataNormalization instance to the entire inputs array

param features The normalized array of inputs
param maskArrays Optional mask arrays belonging to the inputs

revertFeatures

public void revertFeatures(@NonNull INDArray[] features, INDArray[] maskArrays, int input)

Undo (revert) the normalization applied by this DataNormalization instance to the features of a particular input

param features The normalized array of inputs
param maskArrays Optional mask arrays belonging to the inputs
param input the index of the input to revert normalization on

revertLabels

public void revertLabels(@NonNull INDArray[] labels)

Undo (revert) the normalization applied by this DataNormalization instance to the entire outputs array

param labels The normalized array of outputs

revertLabels

public void revertLabels(@NonNull INDArray[] labels, INDArray[] maskArrays)

Undo (revert) the normalization applied by this DataNormalization instance to the entire outputs array

param labels The normalized array of outputs
param maskArrays Optional mask arrays belonging to the outputs

revertLabels

public void revertLabels(@NonNull INDArray[] labels, INDArray[] maskArrays, int output)

Undo (revert) the normalization applied by this DataNormalization instance to the labels of a particular output

param labels The normalized array of outputs
param maskArrays Optional mask arrays belonging to the outputs
param output the index of the output to revert normalization on

CompositeDataSetPreProcessor

[source]

A simple Composite DataSetPreProcessor - allows you to apply multiple DataSetPreProcessors sequentially on the one DataSet, in the order they are passed to the constructor

CompositeDataSetPreProcessor

public CompositeDataSetPreProcessor(DataSetPreProcessor... preProcessors)

param preProcessors Preprocessors to apply. They will be applied in this order

MultiNormalizerStandardize

[source]

Pre processor for MultiDataSet that normalizes feature values (and optionally label values) to have 0 mean and a standard deviation of 1

load

public void load(@NonNull List<File> featureFiles, @NonNull List<File> labelFiles) throws IOException

Load means and standard deviations from the file system

param featureFiles source files for features, requires 2 files per input, alternating mean and stddev files
param labelFiles source files for labels, requires 2 files per output, alternating mean and stddev files

save

public void save(@NonNull List<File> featureFiles, @NonNull List<File> labelFiles) throws IOException

param featureFiles target files for features, requires 2 files per input, alternating mean and stddev files
param labelFiles target files for labels, requires 2 files per output, alternating mean and stddev files
deprecated use {- link MultiStandardizeSerializerStrategy} instead

Save the current means and standard deviations to the file system

VGG16ImagePreProcessor

[source]

This is a preprocessor specifically for VGG16. It subtracts the mean RGB value, computed on the training set, from each pixel as reported in: https://arxiv.org/pdf/1409.1556.pdf

fit

public void fit(DataSet dataSet)

Fit a dataset (only compute based on the statistics from this dataset0

param dataSet the dataset to compute on

fit

public void fit(DataSetIterator iterator)

Iterates over a dataset accumulating statistics for normalization

param iterator the iterator to use for collecting statistics.

transform

public void transform(DataSet toPreProcess)

Transform the data

param toPreProcess the dataset to transform

DataNormalization

[source]

An interface for data normalizers. Data normalizers compute some sort of statistics over a dataset and scale the data in some way.

Operations

Implementations for advanced transformation.

Usage

Operations, such as a Function, help execute transforms and load data into DataVec. The concept of operations is low-level, meaning that most of the time you will not need to worry about them.

Loading data into Spark

If you're using Apache Spark, functions will iterate over the dataset and load it into a Spark RDD and convert the raw data format into a Writable.

import org.datavec.api.writable.Writable;
import org.datavec.api.records.reader.impl.csv.CSVRecordReader;
import org.datavec.spark.transform.misc.StringToWritablesFunction;

SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf)

String customerInfoPath = new ClassPathResource("CustomerInfo.csv").getFile().getPath();
JavaRDD<List<Writable>> customerInfo = sc.textFile(customerInfoPath).map(new StringToWritablesFunction(rr));

The above code loads a CSV file into a 2D java RDD. Once your RDD is loaded, you can transform it, perform joins and use reducers to wrangle the data any way you want.

Available ops

AggregableCheckingOp

[source]

Created by huitseeker on 5/8/17.

AggregableMultiOp

[source]

It is used to execute many reduction operations in parallel on the same column, datavec#238

Created by huitseeker on 5/8/17.

ByteWritableOp

[source]

supports a conversion to Byte.

Created by huitseeker on 5/14/17.

DispatchOp

[source]

Created by huitseeker on 5/14/17.

DispatchWithConditionOp

[source]

before dispatching the appropriate column of this element to its operation.

Created by huitseeker on 5/14/17.

DoubleWritableOp

[source]

supports a conversion to Double.

Created by huitseeker on 5/14/17.

FloatWritableOp

[source]

supports a conversion to Float.

Created by huitseeker on 5/14/17.

IntWritableOp

[source]

supports a conversion to Integer.

Created by huitseeker on 5/14/17.

LongWritableOp

[source]

supports a conversion to Long.

Created by huitseeker on 5/14/17.

StringWritableOp

[source]

supports a conversion to TextWritable. Created by huitseeker on 5/14/17.

CalculateSortedRank

[source]

CalculateSortedRank: calculate the rank of each example, after sorting example. For example, we might have some numerical “score” column, and we want to know for the rank (sort order) for each example, according to that column. The rank of each example (after sorting) will be added in a new Long column. Indexing is done from 0; examples will have values 0 to dataSetSize - 1.

Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data. Furthermore, the current implementation can only sort on one column

transform

public Schema transform(Schema inputSchema)

param newColumnName Name of the new column (will contain the rank for each example)
param sortOnColumn Name of the column to sort on
param comparator Comparator used to sort examples

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

Readers

Read individual records from different formats.

Why readers?

Readers iterate records from a dataset in storage and load the data into DataVec. The usefulness of readers beyond individual entries in a dataset includes: what if you wanted to train a text generator on a corpus? Or programmatically compose two entries together to form a new record? Reader implementations are useful for complex file types or distributed storage mechanisms.

Readers return Writable classes that describe each column in a Record. These classes are used to convert each record to a tensor/ND-Array format.

Usage

Each reader implementation extends BaseRecordReader and provides a simple API for selecting the next record in a dataset, acting similarly to iterators.

Useful methods include:

next: Return a batch of Writable.
nextRecord: Return a single Record, optionally with RecordMetaData.
reset: Reset the underlying iterator.
hasNext: Iterator method to determine if another record is available.

Listeners

You can hook a custom RecordListener to a record reader for debugging or visualization purposes. Pass your custom listener to the addListener base method immediately after initializing your class.

Types of readers

ComposableRecordReader

RecordReader for each pipeline. Individual record is a concatenation of the two collections. Create a recordreader that takes recordreaders and iterates over them and concatenates them hasNext would be the & of all the recordreaders concatenation would be next & addAll on the collection return one record

initialize

ConcatenatingRecordReader

Combine multiple readers into a single reader. Records are read sequentially - thus if the first reader has 100 records, and the second reader has 200 records, ConcatenatingRecordReader will have 300 records.

FileRecordReader

File reader/writer

getCurrentLabel

Return the current label. The index of the current file’s parent directory in the label list

return The index of the current file’s parent directory

LineRecordReader

Reads files line by line

CollectionRecordReader

Collection record reader. Mainly used for testing.

CollectionSequenceRecordReader

Collection record reader for sequences. Mainly used for testing.

initialize

param records Collection of sequences. For example, List<List<List>> where the inner two lists are a sequence, and the outer list/collection is a list of sequences

ListStringRecordReader

Iterates through a list of strings return a record.

initialize

Called once at initialization.

param split the split that defines the range of records to read
throws IOException
throws InterruptedException

initialize

Called once at initialization.

param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException

hasNext

Get the next record

return The list of next record

reset

List of label strings

return

nextRecord

Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream

param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream

close

Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.

As noted in {- link AutoCloseable#close()}, cases where the close may fail require careful attention. It is strongly advised to relinquish the underlying resources and to internally mark the {- code Closeable} as closed, prior to throwing the {- code IOException}.

throws IOException if an I/O error occurs

setConf

Set the configuration to be used by this object.

param conf

getConf

Return the configuration used by this object.

CSVRecordReader

Simple csv record reader.

initialize

Skip first n lines

param skipNumLines the number of lines to skip

CSVRegexRecordReader

A CSVRecordReader that can split each column into additional columns using regexs.

CSVSequenceRecordReader

CSV Sequence Record Reader This reader is intended to read sequences of data in CSV format, where each sequence is defined in its own file (and there are multiple files) Each line in the file represents one time step

CSVVariableSlidingWindowRecordReader

A sliding window of variable size across an entire CSV.

In practice the sliding window size starts at 1, then linearly increase to maxLinesPer sequence, then linearly decrease back to 1.

initialize

No-arg constructor with the default number of lines per sequence (10)

LibSvmRecordReader

Record reader for libsvm format, which is closely related to SVMLight format. Similar to scikit-learn we use a single reader for both formats, so this class is a subclass of SVMLightRecordReader.

Further details on the format can be found at

MatlabRecordReader

Matlab record reader

SVMLightRecordReader

Record reader for SVMLight format, which can generally be described as

LABEL INDEX:VALUE INDEX:VALUE …

SVMLight format is well-suited to sparse data (e.g., bag-of-words) because it omits all features with value zero.

We support an “extended” version that allows for multiple targets (or labels) separated by a comma, as follows:

LABEL1,LABEL2,… INDEX:VALUE INDEX:VALUE …

This can be used to represent either multitask problems or multilabel problems with sparse binary labels (controlled via the “MULTILABEL” configuration option).

Like scikit-learn, we support both zero-based and one-based indexing.

Further details on the format can be found at

initialize

Must be called before attempting to read records.

param conf DataVec configuration
param split FileSplit
throws IOException
throws InterruptedException

setConf

Set configuration.

param conf DataVec configuration
throws IOException
throws InterruptedException

hasNext

Helper function to help detect lines that are commented out. May read ahead and cache a line.

return

nextRecord

Return next record as list of Writables.

return

RegexLineRecordReader

RegexLineRecordReader: Read a file, one line at a time, and split it into fields using a regex. To load an entire file using a

Example: Data in format “2016-01-01 23:59:59.001 1 DEBUG First entry message!” using regex String “(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) (\d+) ([A-Z]+) (.)” would be split into 4 Text writables: [“2016-01-01 23:59:59.001”, “1”, “DEBUG”, “First entry message!”]

RegexSequenceRecordReader

RegexSequenceRecordReader: Read an entire file (as a sequence), one line at a time and split each line into fields using a regex.

lines that don’t match the provided regex can result in an exception (FailOnInvalid), can be skipped silently (SkipInvalid), or skip invalid but log a warning (SkipInvalidWithWarning)

TransformProcessRecordReader

to have a transform process applied before being returned.

initialize

Called once at initialization.

param split the split that defines the range of records to read
throws IOException
throws InterruptedException

initialize

Called once at initialization.

param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException

hasNext

Get the next record

return

reset

List of label strings

return

nextRecord

Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream

param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream

loadFromMetaData

Load a single record from the given {- link RecordMetaData} instance Note: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient to load multiple records at once using {- link #loadFromMetaData(List)}

param recordMetaData Metadata for the record that we want to load from
return Single record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading

setListeners

Load multiple records from the given a list of {- link RecordMetaData} instances

param recordMetaDatas Metadata for the records that we want to load from
return Multiple records for the given RecordMetaData instances
throws IOException If I/O error occurs during loading

setListeners

Set the record listeners for this record reader.

param listeners

close

Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.

throws IOException if an I/O error occurs

setConf

Set the configuration to be used by this object.

param conf

getConf

Return the configuration used by this object.

TransformProcessSequenceRecordReader

to be transformed before being returned.

setConf

Set the configuration to be used by this object.

param conf

getConf

Return the configuration used by this object.

batchesSupported

Returns a sequence record.

return a sequence of records

nextSequence

Load a sequence record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream

param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream

loadSequenceFromMetaData

Load a single sequence record from the given {- link RecordMetaData} instance Note: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient to load multiple records at once using {- link #loadSequenceFromMetaData(List)}

param recordMetaData Metadata for the sequence record that we want to load from
return Single sequence record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading

initialize

Load multiple sequence records from the given a list of {- link RecordMetaData} instances

param recordMetaDatas Metadata for the records that we want to load from
return Multiple sequence record for the given RecordMetaData instances
throws IOException If I/O error occurs during loading

initialize

Called once at initialization.

param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException

hasNext

Get the next record

return

reset

List of label strings

return

nextRecord

Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream

param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream

loadFromMetaData

param recordMetaData Metadata for the record that we want to load from
return Single record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading

setListeners

Load multiple records from the given a list of {- link RecordMetaData} instances

param recordMetaDatas Metadata for the records that we want to load from
return Multiple records for the given RecordMetaData instances
throws IOException If I/O error occurs during loading

setListeners

Set the record listeners for this record reader.

param listeners

close

Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.

throws IOException if an I/O error occurs

NativeAudioRecordReader

Native audio file loader using FFmpeg.

WavFileRecordReader

Wav file loader

ImageRecordReader

Image record reader. Reads a local file system and parses images of a given height and width. All images are rescaled and converted to the given height, width, and number of channels.

Also appends the label if specified (one of k encoding based on the directory structure where each subdir of the root is an indexed label)

TfidfRecordReader

TFIDF record reader (wraps a tfidf vectorizer for delivering labels and conforming to the record reader interface)

Records

How to use data records in DataVec.

What is a record?

In the DataVec world a Record represents a single entry in a dataset. DataVec differentiates types of records to make data manipulation easier with built-in APIs. Sequences and 2D records are distinguishable.

Using records

Most of the time you do not need to interact with the record classes directly, unless you are manually iterating records for the purpose of forwarding through a neural network.

Types of records

Record

[source]

A standard implementation of the Record interface

SequenceRecord

[source]

A standard implementation of the SequenceRecord interface.

Reductions

Available reductions

GeographicMidpointReduction

[source]

delimiter is configurable), determine the geographic midpoint. See “geographic midpoint” at: http://www.geomidpoint.com/methods.html For implementation algorithm, see: http://www.geomidpoint.com/calculation.html

transform

public Schema transform(Schema inputSchema)

param delim Delimiter for the coordinates in text format. For example, if format is “lat,long” use “,”

StringReducer

[source]

A StringReducer is used to take a set of examples and reduce them. The idea: suppose you have a large number of columns, and you want to combine/reduce the values in each column. StringReducer allows you to specify different reductions for differently for different columns: min, max, sum, mean etc.

Uses are: (1) Reducing examples by a key (2) Reduction operations in time series (windowing ops, etc)

transform

public Schema transform(Schema schema)

Get the output schema, given the input schema

outputColumnName

public Builder outputColumnName(String outputColumnName)

Create a StringReducer builder, and set the default column reduction operation. For any columns that aren’t specified explicitly, they will use the default reduction operation. If a column does have a reduction operation explicitly specified, then it will override the default specified here.

param defaultOp Default reduction operation to perform

appendColumns

public Builder appendColumns(String... columns)

Reduce the specified columns by taking the minimum value

prependColumns

public Builder prependColumns(String... columns)

Reduce the specified columns by taking the maximum value

mergeColumns

public Builder mergeColumns(String... columns)

Reduce the specified columns by taking the sum of values

replaceColumn

public Builder replaceColumn(String... columns)

Reduce the specified columns by taking the mean of the values

customReduction

public Builder customReduction(String column, ColumnReduction columnReduction)

Reduce the specified column using a custom column reduction functionality.

param column Column to execute the custom reduction functionality on
param columnReduction Column reduction to execute on that column

setIgnoreInvalid

public Builder setIgnoreInvalid(String... columns)

When doing the reduction: set the specified columns to ignore any invalid values. Invalid: defined as being not valid according to the ColumnMetaData: {- link ColumnMetaData#isValid(Writable)}. For numerical columns, this typically means being unable to parse the Writable. For example, Writable.toLong() failing for a Long column. If the column has any restrictions (min/max values, regex for Strings etc) these will also be taken into account.

param columns Columns to set ‘ignore invalid’ for

Schemas

Schemas for datasets and transformation.

Why use schemas?

The unfortunate reality is that data is dirty. When trying to vecotrize a dataset for deep learning, it is quite rare to find files that have zero errors. Schema is important for maintaining the meaning of the data before using it for something like training a neural network.

Using schemas

Schemas are primarily used for programming transformations. Before you can properly execute a TransformProcess you will need to pass the schema of the data being transformed.

An example of a schema for merchant records may look like:

Joining schemas

If you have two different datasets that you want to merge together, DataVec provides a Join class with different join strategies such as Inner or RightOuter.

Once you've defined your join and you've loaded the data into DataVec, you must use an Executor to complete the join.

Classes and utilities

DataVec comes with a few Schema classes and helper utilities for 2D and sequence types of data.

Join

Join class: used to specify a join (like an SQL join)

setSchemas

Type of join Inner: Return examples where the join column values occur in both LeftOuter: Return all examples from left data, whether there is a matching right value or not. (If not: right values will have NullWritable instead) RightOuter: Return all examples from the right data, whether there is a matching left value or not. (If not: left values will have NullWritable instead) FullOuter: return all examples from both left/right, whether there is a matching value from the other side or not. (If not: other values will have NullWritable instead)

setKeyColumns

deprecated Use {- link #setJoinColumns(String…)}

setKeyColumnsLeft

deprecated Use {- link #setJoinColumnsLeft(String…)}

setKeyColumnsRight

deprecated Use {- link #setJoinColumnsRight(String…)}

setJoinColumnsLeft

Specify the names of the columns to join on, for the left data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i

param joinColumnNames Names of the columns to join on (for left data)

setJoinColumnsRight

Specify the names of the columns to join on, for the right data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i

param joinColumnNames Names of the columns to join on (for left data)

InferredSchema

If passed a CSV file that contains a header and a single row of sample data, it will return a Schema.

Only Double, Integer, Long, and String types are supported. If no number type can be inferred, the field type will become the default type. Note that if your column is actually categorical but is represented as a number, you will need to do additional transformation. Also, if your sample field is blank/null, it will also become the default type.

Schema

A Schema defines the layout of tabular data. Specifically, it contains names f or each column, as well as details of types (Integer, String, Long, Double, etc). Type information for each column may optionally include restrictions on the allowable values for each column.

sameTypes

Create a schema based on the given metadata

param columnMetaData the metadata to create the schema from

newSchema

Compute the difference in {- link ColumnMetaData} between this schema and the passed in schema. This is useful during the {- link org.datavec.api.transform.TransformProcess} to identify what a process will do to a given {- link Schema}.

param schema the schema to compute the difference for
return the metadata that is different (in order) between this schema and the other schema

numColumns

Returns the number of columns or fields for this schema

return the number of columns or fields for this schema

getName

Returns the name of a given column at the specified index

param column the index of the column to get the name for
return the name of the column at the specified index

getType

Returns the {- link ColumnType} for the column at the specified index

param column the index of the column to get the type for
return the type of the column to at the specified inde

getType

Returns the {- link ColumnType} for the column at the specified index

param columnName the index of the column to get the type for
return the type of the column to at the specified inde

getMetaData

Returns the {- link ColumnMetaData} at the specified column index

param column the index to get the metadata for
return the metadata at ths specified index

getMetaData

Retrieve the metadata for the given column name

param column the name of the column to get metadata for
return the metadata for the given column name

getIndexOfColumn

Return a copy of the list column names

return a copy of the list of column names for this schema

hasColumn

Return the indices of the columns, given their namess

param columnNames Name of the columns to get indices for
return Column indexes

toJson

Serialize this schema to json

return a json representation of this schema

toYaml

Serialize this schema to yaml

return the yaml representation of this schema

fromJson

Create a schema from a given json string

param json the json to create the schema from
return the created schema based on the json

fromYaml

Create a schema from the given yaml string

param yaml the yaml to create the schema from
return the created schema based on the yaml

addColumnFloat

Add a Float column with no restrictions on the allowable values, except for no NaN/infinite values allowed

param name Name of the column

addColumnFloat

Add a Float column with the specified restrictions (and no NaN/Infinite values allowed)

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
return

addColumnFloat

Add a Float column with the specified restrictions

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow

addColumnsFloat

Add multiple Float columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)

param columnNames Names of the columns to add

addColumnsFloat

A convenience method for adding multiple Float columns. For example, to add columns “myFloatCol_0”, “myFloatCol_1”, “myFloatCol_2”, use {- code addColumnsFloat(“myFloatCol_%d”,0,2)}

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsFloat

A convenience method for adding multiple Float columns, with additional restrictions that apply to all columns For example, to add columns “myFloatCol_0”, “myFloatCol_1”, “myFloatCol_2”, use {- code addColumnsFloat(“myFloatCol_%d”,0,2,null,null,false,false)}

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow

addColumnDouble

Add a Double column with no restrictions on the allowable values, except for no NaN/infinite values allowed

param name Name of the column

addColumnDouble

Add a Double column with the specified restrictions (and no NaN/Infinite values allowed)

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
return

addColumnDouble

Add a Double column with the specified restrictions

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow

addColumnsDouble

Add multiple Double columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)

param columnNames Names of the columns to add

addColumnsDouble

A convenience method for adding multiple Double columns. For example, to add columns “myDoubleCol_0”, “myDoubleCol_1”, “myDoubleCol_2”, use {- code addColumnsDouble(“myDoubleCol_%d”,0,2)}

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsDouble

A convenience method for adding multiple Double columns, with additional restrictions that apply to all columns For example, to add columns “myDoubleCol_0”, “myDoubleCol_1”, “myDoubleCol_2”, use {- code addColumnsDouble(“myDoubleCol_%d”,0,2,null,null,false,false)}

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow

addColumnInteger

Add an Integer column with no restrictions on the allowable values

param name Name of the column

addColumnInteger

Add an Integer column with the specified min/max allowable values

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

addColumnsInteger

Add multiple Integer columns with no restrictions on the min/max allowable values

param names Names of the integer columns to add

addColumnsInteger

A convenience method for adding multiple Integer columns. For example, to add columns “myIntegerCol_0”, “myIntegerCol_1”, “myIntegerCol_2”, use {- code addColumnsInteger(“myIntegerCol_%d”,0,2)}

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsInteger

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

addColumnCategorical

Add a Categorical column, with the specified state names

param name Name of the column
param stateNames Names of the allowable states for this categorical column

addColumnCategorical

Add a Categorical column, with the specified state names

param name Name of the column
param stateNames Names of the allowable states for this categorical column

addColumnLong

Add a Long column, with no restrictions on the min/max values

param name Name of the column

addColumnLong

Add a Long column with the specified min/max allowable values

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

addColumnsLong

Add multiple Long columns, with no restrictions on the allowable values

param names Names of the Long columns to add

addColumnsLong

A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsLong

A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

addColumn

Add a column

param metaData metadata for this column

addColumnString

Add a String column with no restrictions on the allowable values.

param name Name of the column

addColumnsString

Add multiple String columns with no restrictions on the allowable values

param columnNames Names of the String columns to add

addColumnString

Add a String column with the specified restrictions

param name Name of the column
param regex Regex that the String must match in order to be considered valid. If null: no regex restriction
param minAllowableLength Minimum allowable length for the String to be considered valid
param maxAllowableLength Maximum allowable length for the String to be considered valid

addColumnsString

A convenience method for adding multiple numbered String columns. For example, to add columns “myStringCol_0”, “myStringCol_1”, “myStringCol_2”, use {- code addColumnsString(“myStringCol_%d”,0,2)}

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsString

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param regex Regex that the String must match in order to be considered valid. If null: no regex restriction
param minAllowedLength Minimum allowed length of strings (inclusive). If null: no restriction
param maxAllowedLength Maximum allowed length of strings (inclusive). If null: no restriction

addColumnTime

Add a Time column with no restrictions on the min/max allowable times NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform

param columnName Name of the column
param timeZone Time zone of the time column

addColumnTime

param columnName Name of the column
param timeZone Time zone of the time column

addColumnTime

Add a Time column with the specified restrictions NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform

param columnName Name of the column
param timeZone Time zone of the time column
param minValidValue Minumum allowable time (in milliseconds). May be null.
param maxValidValue Maximum allowable time (in milliseconds). May be null.

addColumnNDArray

Add a NDArray column

param columnName Name of the column
param shape shape of the NDArray column. Use -1 in entries to specify as “variable length” in that dimension

build

Create the Schema

inferMultiple

Infers a schema based on the record. The column names are based on indexing.

param record the record to infer from
return the infered schema

infer

Infers a schema based on the record. The column names are based on indexing.

param record the record to infer from
return the infered schema

SequenceSchema

inferSequenceMulti

Infers a sequence schema based on the record

param record the record to infer the schema based on
return the inferred sequence schema

inferSequence

Infers a sequence schema based on the record

param record the record to infer the schema based on
return the inferred sequence schema

Serialization

Data wrangling and mapping from one schema to another.

Serializing transforms

DataVec comes with the ability to serialize transforms, which allows them to be more portable when they're needed for production environments. A TransformProcess is serialzied to a human-readable format such as JSON and can be saved as a file.

Serialization

The code below shows how you can serialize the transform process tp.

Deserialization

When you want to reinstantiate the transform process, call the static from<format> method.

Available serializers

JsonSerializer

Serializer used for converting objects (Transforms, Conditions, etc) to JSON format

YamlSerializer

Serializer used for converting objects (Transforms, Conditions, etc) to YAML format

Visualization

Utilities

HtmlAnalysis

createHtmlAnalysisString

Render a data analysis object as a HTML file. This will produce a summary table, along charts for numerical columns. The contents of the HTML file are returned as a String, which should be written to a .html file.

param analysis Data analysis object to render
see #createHtmlAnalysisFile(DataAnalysis, File)

createHtmlAnalysisFile

Render a data analysis object as a HTML file. This will produce a summary table, along charts for numerical columns

param dataAnalysis Data analysis object to render
param output Output file (should have extension .html)

HtmlSequencePlotting

A simple utility for plotting DataVec sequence data to HTML files. Each file contains only one sequence. Each column is plotted separately; only numerical and categorical columns are plotted.

createHtmlSequencePlots

Create a HTML file with plots for the given sequence.

param title Title of the page
param schema Schema for the data
param sequence Sequence to plot
return HTML file as a string

createHtmlSequencePlotFile

Create a HTML file with plots for the given sequence and write it to a file.

param title Title of the page
param schema Schema for the data
param sequence Sequence to plot

Transforms

Data wrangling and mapping from one schema to another.

Data wrangling

One of the key tools in DataVec is transformations. DataVec helps the user map a dataset from one schema to another, and provides a list of operations to convert types, format data, and convert a 2D dataset to sequence data.

Building a transform process

A transform process requires a Schema to successfully transform data. Both schema and transform process classes come with a helper Builder class which are useful for organizing code and avoiding complex constructors.

When both are combined together they look like the sample code below. Note how inputDataSchema is passed into the Builder constructor. Your transform process will fail to compile without it.

import org.datavec.api.transform.TransformProcess;

TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
    .removeColumns("CustomerID","MerchantID")
    .filter(new ConditionFilter(new CategoricalColumnCondition("MerchantCountryCode", ConditionOp.NotInSet, new HashSet<>(Arrays.asList("USA","CAN")))))
    .conditionalReplaceValueTransform(
        "TransactionAmountUSD",     //Column to operate on
        new DoubleWritable(0.0),    //New value to use, when the condition is satisfied
        new DoubleColumnCondition("TransactionAmountUSD",ConditionOp.LessThan, 0.0)) //Condition: amount < 0.0
    .stringToTimeTransform("DateTimeString","YYYY-MM-DD HH:mm:ss.SSS", DateTimeZone.UTC)
    .renameColumn("DateTimeString", "DateTime")
    .transform(new DeriveColumnsFromTimeTransform.Builder("DateTime").addIntegerDerivedColumn("HourOfDay", DateTimeFieldType.hourOfDay()).build())
    .removeColumns("DateTime")
    .build();

Executing a transformation

Different "backends" for executors are available. Using the tp transform process above, here's how you can execute it locally using plain DataVec.

import org.datavec.local.transforms.LocalTransformExecutor;

List<List<Writable>> processedData = LocalTransformExecutor.execute(originalData, tp);

Debugging

Each operation in a transform process represents a "step" in schema changes. Sometimes, the resulting transformation is not the intended result. You can debug this by printing each step in the transform tp with the following:

//Now, print the schema after each time step:
int numActions = tp.getActionList().size();

for(int i=0; i<numActions; i++ ){
    System.out.println("\n\n==================================================");
    System.out.println("-- Schema after step " + i + " (" + tp.getActionList().get(i) + ") --");

    System.out.println(tp.getSchemaAfterStep(i));
}

Available transformations and conversions

TransformProcess

[source]

A TransformProcess defines an ordered list of transformations to be executed on some data

getFinalSchema

public Schema getFinalSchema()

Get the action list that this transform process will execute

return

getSchemaAfterStep

public Schema getSchemaAfterStep(int step)

Return the schema after executing all steps up to and including the specified step. Steps are indexed from 0: so getSchemaAfterStep(0) is after one transform has been executed.

param step Index of the step
return Schema of the data, after that (and all prior) steps have been executed

toJson

public String toJson()

Execute the full sequence of transformations for a single example. May return null if example is filtered NOTE: Some TransformProcess operations cannot be done on examples individually. Most notably, ConvertToSequence and ConvertFromSequence operations require the full data set to be processed at once

param input
return

toYaml

public String toYaml()

Convert the TransformProcess to a YAML string

return TransformProcess, as YAML

fromJson

public static TransformProcess fromJson(String json)

Deserialize a JSON String (created by {- link #toJson()}) to a TransformProcess

return TransformProcess, from JSON

fromYaml

public static TransformProcess fromYaml(String yaml)

Deserialize a JSON String (created by {- link #toJson()}) to a TransformProcess

return TransformProcess, from JSON

transform

public Builder transform(Transform transform)

Infer the categories for the given record reader for a particular column Note that each “column index” is a column in the context of: List record = ...; record.get(columnIndex);

Note that anything passed in as a column will be automatically converted to a string for categorical purposes.

The expected input is strings or numbers (which have sensible toString() representations)

Note that the returned categories will be sorted alphabetically

param recordReader the record reader to iterate through
param columnIndex te column index to get categories for
return

filter

public Builder filter(Filter filter)

Add a filter operation to be executed after the previously-added operations have been executed

param filter Filter operation to execute

filter

public Builder filter(Condition condition)

Add a filter operation, based on the specified condition.

If condition is satisfied (returns true): remove the example or sequence If condition is not satisfied (returns false): keep the example or sequence

param condition Condition to filter on

removeColumns

public Builder removeColumns(String... columnNames)

Remove all of the specified columns, by name

param columnNames Names of the columns to remove

removeColumns

public Builder removeColumns(Collection<String> columnNames)

Remove all of the specified columns, by name

param columnNames Names of the columns to remove

removeAllColumnsExceptFor

public Builder removeAllColumnsExceptFor(String... columnNames)

Remove all columns, except for those that are specified here

param columnNames Names of the columns to keep

removeAllColumnsExceptFor

public Builder removeAllColumnsExceptFor(Collection<String> columnNames)

Remove all columns, except for those that are specified here

param columnNames Names of the columns to keep

renameColumn

public Builder renameColumn(String oldName, String newName)

Rename a single column

param oldName Original column name
param newName New column name

renameColumns

public Builder renameColumns(List<String> oldNames, List<String> newNames)

Rename multiple columns

param oldNames List of original column names
param newNames List of new column names

reorderColumns

public Builder reorderColumns(String... newOrder)

Reorder the columns using a partial or complete new ordering. If only some of the column names are specified for the new order, the remaining columns will be placed at the end, according to their current relative ordering

param newOrder Names of the columns, in the order they will appear in the output

duplicateColumn

public Builder duplicateColumn(String column, String newName)

Duplicate a single column

param column Name of the column to duplicate
param newName Name of the new (duplicate) column

duplicateColumns

public Builder duplicateColumns(List<String> columnNames, List<String> newNames)

Duplicate a set of columns

param columnNames Names of the columns to duplicate
param newNames Names of the new (duplicated) columns

integerMathOp

public Builder integerMathOp(String column, MathOp mathOp, int scalar)

Perform a mathematical operation (add, subtract, scalar max etc) on the specified integer column, with a scalar

param column The integer column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation

integerColumnsMathOp

public Builder integerColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)

Calculate and add a new integer column by performing a mathematical operation on a number of existing columns. New column is added to the end.

param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation

longMathOp

public Builder longMathOp(String columnName, MathOp mathOp, long scalar)

Perform a mathematical operation (add, subtract, scalar max etc) on the specified long column, with a scalar

param columnName The long column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation

longColumnsMathOp

public Builder longColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)

Calculate and add a new long column by performing a mathematical operation on a number of existing columns. New column is added to the end.

param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation

floatMathOp

public Builder floatMathOp(String columnName, MathOp mathOp, float scalar)

Perform a mathematical operation (add, subtract, scalar max etc) on the specified double column, with a scalar

param columnName The float column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation

floatColumnsMathOp

public Builder floatColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)

Calculate and add a new float column by performing a mathematical operation on a number of existing columns. New column is added to the end.

param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation

floatMathFunction

public Builder floatMathFunction(String columnName, MathFunction mathFunction)

Perform a mathematical operation (such as sin(x), ceil(x), exp(x) etc) on a column

param columnName Column name to operate on
param mathFunction MathFunction to apply to the column

doubleMathOp

public Builder doubleMathOp(String columnName, MathOp mathOp, double scalar)

Perform a mathematical operation (add, subtract, scalar max etc) on the specified double column, with a scalar

param columnName The double column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation

doubleColumnsMathOp

public Builder doubleColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)

Calculate and add a new double column by performing a mathematical operation on a number of existing columns. New column is added to the end.

param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation

doubleMathFunction

public Builder doubleMathFunction(String columnName, MathFunction mathFunction)

Perform a mathematical operation (such as sin(x), ceil(x), exp(x) etc) on a column

param columnName Column name to operate on
param mathFunction MathFunction to apply to the column

timeMathOp

public Builder timeMathOp(String columnName, MathOp mathOp, long timeQuantity, TimeUnit timeUnit)

Perform a mathematical operation (add, subtract, scalar min/max only) on the specified time column

param columnName The integer column to perform the operation on
param mathOp The mathematical operation
param timeQuantity The quantity used in the mathematical op
param timeUnit The unit that timeQuantity is specified in

categoricalToOneHot

public Builder categoricalToOneHot(String... columnNames)

Convert the specified column(s) from a categorical representation to a one-hot representation. This involves the creation of multiple new columns each.

param columnNames Names of the categorical column(s) to convert to a one-hot representation

categoricalToInteger

public Builder categoricalToInteger(String... columnNames)

Convert the specified column(s) from a categorical representation to an integer representation. This will replace the specified categorical column(s) with an integer repreesentation, where each integer has the value 0 to numCategories-1.

param columnNames Name of the categorical column(s) to convert to an integer representation

integerToCategorical

public Builder integerToCategorical(String columnName, List<String> categoryStateNames)

Convert the specified column from an integer representation (assume values 0 to numCategories-1) to a categorical representation, given the specified state names

param columnName Name of the column to convert
param categoryStateNames Names of the states for the categorical column

integerToCategorical

public Builder integerToCategorical(String columnName, Map<Integer, String> categoryIndexNameMap)

Convert the specified column from an integer representation to a categorical representation, given the specified mapping between integer indexes and state names

param columnName Name of the column to convert
param categoryIndexNameMap Names of the states for the categorical column

integerToOneHot

public Builder integerToOneHot(String columnName, int minValue, int maxValue)

Convert an integer column to a set of 1 hot columns, based on the value in integer column

param columnName Name of the integer column
param minValue Minimum value possible for the integer column (inclusive)
param maxValue Maximum value possible for the integer column (inclusive)

addConstantColumn

public Builder addConstantColumn(String newColumnName, ColumnType newColumnType, Writable fixedValue)

Add a new column, where all values in the column are identical and as specified.

param newColumnName Name of the new column
param newColumnType Type of the new column
param fixedValue Value in the new column for all records

addConstantDoubleColumn

public Builder addConstantDoubleColumn(String newColumnName, double value)

Add a new double column, where the value for that column (for all records) are identical

param newColumnName Name of the new column
param value Value in the new column for all records

addConstantIntegerColumn

public Builder addConstantIntegerColumn(String newColumnName, int value)

Add a new integer column, where th e value for that column (for all records) are identical

param newColumnName Name of the new column
param value Value of the new column for all records

addConstantLongColumn

public Builder addConstantLongColumn(String newColumnName, long value)

Add a new integer column, where the value for that column (for all records) are identical

param newColumnName Name of the new column
param value Value in the new column for all records

convertToString

public Builder convertToString(String inputColumn)

Convert the specified column to a string.

param inputColumn the input column to convert
return builder pattern

convertToDouble

public Builder convertToDouble(String inputColumn)

Convert the specified column to a double.

param inputColumn the input column to convert
return builder pattern

convertToInteger

public Builder convertToInteger(String inputColumn)

Convert the specified column to an integer.

param inputColumn the input column to convert
return builder pattern

normalize

public Builder normalize(String column, Normalize type, DataAnalysis da)

Normalize the specified column with a given type of normalization

param column Column to normalize
param type Type of normalization to apply
param da DataAnalysis object

convertToSequence

public Builder convertToSequence(String keyColumn, SequenceComparator comparator)

Convert a set of independent records/examples into a sequence, according to some key. Within each sequence, values are ordered using the provided {- link SequenceComparator}

param keyColumn Column to use as a key (values with the same key will be combined into sequences)
param comparator A SequenceComparator to order the values within each sequence (for example, by time or String order)

convertToSequence

public Builder convertToSequence()

Convert a set of independent records/examples into a sequence; each example is simply treated as a sequence of length 1, without any join/group operations. Note that more commonly, joining/grouping is required; use {- link #convertToSequence(List, SequenceComparator)} for this functionality

convertToSequence

public Builder convertToSequence(List<String> keyColumns, SequenceComparator comparator)

Convert a set of independent records/examples into a sequence, where each sequence is grouped according to one or more key values (i.e., the values in one or more columns) Within each sequence, values are ordered using the provided {- link SequenceComparator}

param keyColumns Column to use as a key (values with the same key will be combined into sequences)
param comparator A SequenceComparator to order the values within each sequence (for example, by time or String order)

convertFromSequence

public Builder convertFromSequence()

Convert a sequence to a set of individual values (by treating each value in each sequence as a separate example)

splitSequence

public Builder splitSequence(SequenceSplit split)

Split sequences into 1 or more other sequences. Used for example to split large sequences into a set of smaller sequences

param split SequenceSplit that defines how splits will occur

trimSequence

public Builder trimSequence(int numStepsToTrim, boolean trimFromStart)

SequenceTrimTranform removes the first or last N values in a sequence. Note that the resulting sequence may be of length 0, if the input sequence is less than or equal to N.

param numStepsToTrim Number of time steps to trim from the sequence
param trimFromStart If true: Trim values from the start of the sequence. If false: trim values from the end.

offsetSequence

public Builder offsetSequence(List<String> columnsToOffset, int offsetAmount,
                                      SequenceOffsetTransform.OperationType operationType)

Perform a sequence of operation on the specified columns. Note that this also truncates sequences by the specified offset amount by default. Use {- code transform(new SequenceOffsetTransform(…)} to change this. See {- link SequenceOffsetTransform} for details on exactly what this operation does and how.

param columnsToOffset Columns to offset
param offsetAmount Amount to offset the specified columns by (positive offset: ‘columnsToOffset’ are moved to later time steps)
param operationType Whether the offset should be done in-place or by adding a new column

reduce

public Builder reduce(IAssociativeReducer reducer)

Reduce (i.e., aggregate/combine) a set of examples (typically by key). Note: In the current implementation, reduction operations can be performed only on standard (i.e., non-sequence) data

param reducer Reducer to use

reduceSequence

public Builder reduceSequence(IAssociativeReducer reducer)

Reduce (i.e., aggregate/combine) a set of sequence examples - for each sequence individually. Note: This method results in non-sequence data. If you would instead prefer sequences of length 1 after the reduction, use {- code transform(new ReduceSequenceTransform(reducer))}.

param reducer Reducer to use to reduce each window

reduceSequenceByWindow

public Builder reduceSequenceByWindow(IAssociativeReducer reducer, WindowFunction windowFunction)

Reduce (i.e., aggregate/combine) a set of sequence examples - for each sequence individually - using a window function. For example, take all records/examples in each 24-hour period (i.e., using window function), and convert them into a singe value (using the reducer). In this example, the output is a sequence, with time period of 24 hours.

param reducer Reducer to use to reduce each window
param windowFunction Window function to find apply on each sequence individually

sequenceMovingWindowReduce

public Builder sequenceMovingWindowReduce(String columnName, int lookback, ReduceOp op)

SequenceMovingWindowReduceTransform: Adds a new column, where the value is derived by: (a) using a window of the last N values in a single column, (b) Apply a reduction op on the window to calculate a new value for example, this transformer can be used to implement a simple moving average of the last N values, or determine the minimum or maximum values in the last N time steps.

For example, for a simple moving average, length 20: {- code new SequenceMovingWindowReduceTransform(“myCol”, 20, ReduceOp.Mean)}

param columnName Column name to perform windowing on
param lookback Look back period for windowing
param op Reduction operation to perform on each window

calculateSortedRank

public Builder calculateSortedRank(String newColumnName, String sortOnColumn, WritableComparator comparator)

Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data Furthermore, the current implementation can only sort on one column

param newColumnName Name of the new column (will contain the rank for each example)
param sortOnColumn Column to sort on
param comparator Comparator used to sort examples

calculateSortedRank

public Builder calculateSortedRank(String newColumnName, String sortOnColumn, WritableComparator comparator,
                                           boolean ascending)

Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data Furthermore, the current implementation can only sort on one column

param newColumnName Name of the new column (will contain the rank for each example)
param sortOnColumn Column to sort on
param comparator Comparator used to sort examples
param ascending If true: sort ascending. False: descending

stringToCategorical

public Builder stringToCategorical(String columnName, List<String> stateNames)

Convert the specified String column to a categorical column. The state names must be provided.

param columnName Name of the String column to convert to categorical
param stateNames State names of the category

stringRemoveWhitespaceTransform

public Builder stringRemoveWhitespaceTransform(String columnName)

Remove all whitespace characters from the values in the specified String column

param columnName Name of the column to remove whitespace from

stringMapTransform

public Builder stringMapTransform(String columnName, Map<String, String> mapping)

Replace one or more String values in the specified column with new values.

Keys in the map are the original values; the Values in the map are their replacements. If a String appears in the data but does not appear in the provided map (as a key), that String values will not be modified.

param columnName Name of the column in which to do replacement
param mapping Map of oldValues -> newValues

stringToTimeTransform

public Builder stringToTimeTransform(String column, String format, DateTimeZone dateTimeZone)

Convert a String column (containing a date/time String) to a time column (by parsing the date/time String)

param column String column containing the date/time Strings
param format Format of the strings. Time format is specified as per http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
param dateTimeZone Timezone of the column

stringToTimeTransform

public Builder stringToTimeTransform(String column, String format, DateTimeZone dateTimeZone, Locale locale)

Convert a String column (containing a date/time String) to a time column (by parsing the date/time String)

param column String column containing the date/time Strings
param format Format of the strings. Time format is specified as per http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
param dateTimeZone Timezone of the column
param locale Locale of the column

appendStringColumnTransform

public Builder appendStringColumnTransform(String column, String toAppend)

Append a String to a specified column

param column Column to append the value to
param toAppend String to append to the end of each writable

conditionalReplaceValueTransform

public Builder conditionalReplaceValueTransform(String column, Writable newValue, Condition condition)

Replace the values in a specified column with a specified new value, if some condition holds. If the condition does not hold, the original values are not modified.

param column Column to operate on
param newValue Value to use as replacement, if condition is satisfied
param condition Condition that must be satisfied for replacement

conditionalReplaceValueTransformWithDefault

public Builder conditionalReplaceValueTransformWithDefault(String column, Writable yesVal, Writable noVal, Condition condition)

Replace the values in a specified column with a specified “yes” value, if some condition holds. Replace it with a “no” value, otherwise.

param column Column to operate on
param yesVal Value to use as replacement, if condition is satisfied
param noVal Value to use as replacement, if condition is not satisfied
param condition Condition that must be satisfied for replacement

conditionalCopyValueTransform

public Builder conditionalCopyValueTransform(String columnToReplace, String sourceColumn, Condition condition)

Replace the value in a specified column with a new value taken from another column, if a condition is satisfied/true. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.

param columnToReplace Name of the column in which values will be replaced (if condition is satisfied)
param sourceColumn Name of the column from which the new values will be
param condition Condition to use

replaceStringTransform

public Builder replaceStringTransform(String columnName, Map<String, String> mapping)

Replace one or more String values in the specified column that match regular expressions.

Keys in the map are the regular expressions; the Values in the map are their String replacements. For example:

Original

Regex

Replacement

Result

Data_Vec

DataVec

B1C2T3

one

BoneConeTone

' 4.25 '

^\s+|\s+$

'4.25'

param columnName Name of the column in which to do replacement
param mapping Map of old values or regular expression to new values

ndArrayScalarOpTransform

public Builder ndArrayScalarOpTransform(String columnName, MathOp op, double value)

Element-wise NDArray math operation (add, subtract, etc) on an NDArray column

param columnName Name of the NDArray column to perform the operation on
param op Operation to perform
param value Value for the operation

ndArrayColumnsMathOpTransform

public Builder ndArrayColumnsMathOpTransform(String newColumnName, MathOp mathOp, String... columnNames)

Perform an element wise mathematical operation (such as add, subtract, multiply) on NDArray columns. The existing columns are unchanged, a new NDArray column is added

param newColumnName Name of the new NDArray column
param mathOp Operation to perform
param columnNames Name of the columns used as input to the operation

ndArrayMathFunctionTransform

public Builder ndArrayMathFunctionTransform(String columnName, MathFunction mathFunction)

Apply an element wise mathematical function (sin, tanh, abs etc) to an NDArray column. This operation is performed in place.

param columnName Name of the column to perform the operation on
param mathFunction Mathematical function to apply

ndArrayDistanceTransform

public Builder ndArrayDistanceTransform(String newColumnName, Distance distance, String firstCol,
                                                String secondCol)

Calculate a distance (cosine similarity, Euclidean, Manhattan) on two equal-sized NDArray columns. This operation adds a new Double column (with the specified name) with the result.

param newColumnName Name of the new column (result) to add
param distance Distance to apply
param firstCol first column to use in the distance calculation
param secondCol second column to use in the distance calculation

firstDigitTransform

public Builder firstDigitTransform(String inputColumn, String outputColumn)

FirstDigitTransform converts a column to a categorical column, with values being the first digit of the number. For example, “3.1415” becomes “3” and “2.0” becomes “2”. Negative numbers ignore the sign: “-7.123” becomes “7”. Note that two {- link FirstDigitTransform.Mode}s are supported, which determines how non-numerical entries should be handled: EXCEPTION_ON_INVALID: output has 10 category values (“0”, …, “9”), and any non-numerical values result in an exception INCLUDE_OTHER_CATEGORY: output has 11 category values (“0”, …, “9”, “Other”), all non-numerical values are mapped to “Other”

FirstDigitTransform is useful (combined with {- link CategoricalToOneHotTransform} and Reductions) to implement Benford’s law.

param inputColumn Input column name
param outputColumn Output column name. If same as input, input column is replaced

firstDigitTransform

public Builder firstDigitTransform(String inputColumn, String outputColumn, FirstDigitTransform.Mode mode)

FirstDigitTransform is useful (combined with {- link CategoricalToOneHotTransform} and Reductions) to implement Benford’s law.

param inputColumn Input column name
param outputColumn Output column name. If same as input, input column is replaced
param mode See {- link FirstDigitTransform.Mode}

build

public TransformProcess build()

Create the TransformProcess object

CategoricalToIntegerTransform

[source]

Created by Alex on 4/03/2016.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

CategoricalToOneHotTransform

[source]

Created by Alex on 4/03/2016.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

IntegerToCategoricalTransform

[source]

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

PivotTransform

[source]

Pivot transform operates on two columns:

a categorical column that operates as a key, and
Another column that contains a value Essentially, Pivot transform takes keyvalue pairs and breaks them out into separate columns.

For example, with schema [col0, key, value, col3] and values with key in {a,b,c} Output schema is [col0, key[a], key[b], key[c], col3] and input (col0Val, b, x, col3Val) gets mapped to (col0Val, 0, x, 0, col3Val).

When expanding columns, a default value is used - for example 0 for numerical columns.

transform

public Schema transform(Schema inputSchema)

param keyColumnName Key column to expand
param valueColumnName Name of the column that contains the value

StringToCategoricalTransform

[source]

Convert a String column to a categorical column

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

AddConstantColumnTransform

[source]

Add a new column, where the values in that column for all records are identical (according to the specified value)

DuplicateColumnsTransform

[source]

Duplicate one or more columns. The duplicated columns are placed immediately after the original columns

transform

public Schema transform(Schema inputSchema)

param columnsToDuplicate List of columns to duplicate
param newColumnNames List of names for the new (duplicate) columns

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

RemoveAllColumnsExceptForTransform

[source]

Transform that removes all columns except for those that are explicitly specified as ones to keep To specify only the columns

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

RemoveColumnsTransform

[source]

Remove the specified columns from the data. To specify only the columns to keep,

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

RenameColumnsTransform

[source]

Rename one or more columns

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

ReorderColumnsTransform

[source]

Rearrange the order of the columns. Note: A partial list of columns can be used here. Any columns that are not explicitly mentioned will be placed after those that are in the output, without changing their relative order.

transform

public Schema transform(Schema inputSchema)

param newOrder A partial or complete order of the columns in the output

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

ConditionalCopyValueTransform

[source]

Note: For sequences, this transform use the convention that each step in the sequence is passed to the condition, and replaced (or not) separately (i.e., Condition.condition(List) is used on each time step individually)

transform

public Schema transform(Schema inputSchema)

param columnToReplace Name of the column in which to replace the old value
param sourceColumn Name of the column to get the new value from
param condition Condition

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

ConditionalReplaceValueTransform

[source]

Replace the value in a specified column with a new value, if a condition is satisfied/true. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.

transform

public Schema transform(Schema inputSchema)

param columnToReplace Name of the column in which to replace the old value with ‘newValue’, if the condition holds
param newValue New value to use
param condition Condition

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

ConditionalReplaceValueTransformWithDefault

[source]

Replace the value in a specified column with a ‘yes’ value, if a condition is satisfied/true. Replace the value of this same column with a ‘no’ value otherwise. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.

ConvertToDouble

[source]

Convert any value to an Double

map

public DoubleWritable map(Writable writable)

param column Name of the column to convert to a Double column

DoubleColumnsMathOpTransform

[source]

Add a new double column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

DoubleMathFunctionTransform

[source]

A simple transform to do common mathematical operations, such as sin(x), ceil(x), etc.

DoubleMathOpTransform

[source]

Double mathematical operation. This is an in-place operation of the double column value and a double scalar.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

Log2Normalizer

[source]

Normalize by taking scale log2((in-columnMin)/(mean-columnMin) + 1) Maps values in range (columnMin to infinity) to (0 to infinity) Most suitable for values with a geometric/negative exponential type distribution.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

MinMaxNormalizer

[source]

Normalizer to map (min to max) -> (newMin-to newMax) linearly.

Mathematically: (newMax-newMin)/(max-min) (x-min) + newMin

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

StandardizeNormalizer

[source]

Normalize using (x-mean)/stdev. Also known as a standard score, standardization etc.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

SubtractMeanNormalizer

[source]

Normalize by substracting the mean

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

ConvertToInteger

[source]

Convert any value to an Integer.

map

public IntWritable map(Writable writable)

param column Name of the column to convert to an integer

IntegerColumnsMathOpTransform

[source]

Add a new integer column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==MathOp.Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2. NOTE: Division here is using if a decimal output value is required.

toString

public String toString()

param newColumnName Name of the new column (output column)
param mathOp Mathematical operation. Only Add/Subtract/Multiply/Divide/Modulus is allowed here
param columns Columns to use in the mathematical operation

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

IntegerMathOpTransform

[source]

Integer mathematical operation. This is an in-place operation of the integer column value and an integer scalar.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

IntegerToOneHotTransform

[source]

Convert an integer column to a set of one-hot columns.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

ReplaceEmptyIntegerWithValueTransform

[source]

Replace an empty/missing integer with a certain value.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

ReplaceInvalidWithIntegerTransform

[source]

Replace an invalid (non-integer) value in a column with a specified integer

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

LongColumnsMathOpTransform

[source]

Add a new long column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==MathOp.Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2. if a decimal output value is required.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

LongMathOpTransform

[source]

Long mathematical operation. This is an in-place operation of the long column value and an long scalar.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

TextToCharacterIndexTransform

[source]

Convert each text value in a sequence to a longer sequence of integer indices. For example, “abc” would be converted to [1, 2, 3]. Values in other columns will be duplicated.

TextToTermIndexSequenceTransform

[source]

Convert each text value in a sequence to a longer sequence of integer indices. For example, “zero one two” would be converted to [0, 1, 2]. Values in other columns will be duplicated.

SequenceDifferenceTransform

[source]

SequenceDifferenceTransform: for an input sequence, calculate the difference on one column. For each time t, calculate someColumn(t) - someColumn(t-s), where s >= 1 is the ‘lookback’ period.

Note: at t=0 (i.e., the first step in a sequence; or more generally, for all times t < s), there is no previous value these time steps:

Default: output = someColumn(t) - someColumn(max(t-s, 0))
SpecifiedValue: output = someColumn(t) - someColumn(t-s) if t-s >= 0, or a custom Writable object (for example, a DoubleWritable(0) or NullWritable).

Note: this is an in-place operation: i.e., the values in each column are modified. If the original values are and apply the difference operation in-place on the copy.

outputColumnName

public String outputColumnName()

Create a SequenceDifferenceTransform with default lookback of 1, and using FirstStepMode.Default. Output column name is the same as the input column name.

param columnName Name of the column to perform the operation on.

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

SequenceMovingWindowReduceTransform

[source]

SequenceMovingWindowReduceTransform Adds a new column, where the value is derived by: (a) using a window of the last N values in a single column, (b) Apply a reduction op on the window to calculate a new value for example, this transformer can be used to implement a simple moving average of the last N values, or determine the minimum or maximum values in the last N time steps.

defaultOutputColumnName

public static String defaultOutputColumnName(String originalName, int lookback, ReduceOp op)

Enumeration to specify how each cases are handled: For example, for a look back period of 20, how should the first 19 output values be calculated? Default: Perform your former reduction as normal, with as many values are available SpecifiedValue: use the given/specified value instead of the actual output value. For example, you could assign values of 0 or NullWritable to positions 0 through 18 of the output.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

SequenceOffsetTransform

[source]

Sequence offset transform takes a sequence, and shifts The values in one or more columns by a specified number of times steps. It has 2 modes of operation (OperationType enum), with respect to the columns it operates on: InPlace: operations may be performed in-place, modifying the values in the specified columns NewColumn: operations may produce new columns, with the original (source) columns remaining unmodified

Additionally, there are 2 modes for handling values outside the original sequence (EdgeHandling enum): TrimSequence: the entire sequence is trimmed (start or end) by a specified number of steps SpecifiedValue: for any values outside of the original sequence, they are given a specified value

Note 1: When specifying offsets, they are done as follows: Positive offsets: move the values in the specified columns to a later time. Earlier time steps are either be trimmed or Given specified values; the last values in these columns will be truncated/removed.

Note 2: Care must be taken when using TrimSequence: for example, if we chain multiple sequence offset transforms on the one dataset, we may end up trimming much more than we want. In this case, it may be better to use SpecifiedValue, at the end.

AppendStringColumnTransform

[source]

Append a String to the values in a single column

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

ChangeCaseStringTransform

[source]

Change case (to, e.g, all lower case) of String column.

ConcatenateStringColumns

[source]

Concatenate values of one or more String columns into a new String column. Retains the constituent String columns so user must remove those manually, if desired.

TODO: use new String Reduce functionality in DataVec?

transform

public Schema transform(Schema inputSchema)

param columnsToConcatenate A partial or complete order of the columns in the output

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

ConvertToString

[source]

Convert any value to a string.

map

public Text map(Writable writable)

Transform the writable in to a string

param writable the writable to transform
return the string form of this writable

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

MapAllStringsExceptListTransform

[source]

This method maps all String values, except those is the specified list, to a single String value

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

RemoveWhiteSpaceTransform

[source]

String transform that removes all whitespace charaters

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

ReplaceEmptyStringTransform

[source]

Replace empty String values with the specified String

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

ReplaceStringTransform

[source]

Replaces String values that match regular expressions.

map

public Text map(final Writable writable)

Constructs a new ReplaceStringTransform using the specified

param columnName Name of the column
param map Key: regular expression; Value: replacement value

StringListToCategoricalSetTransform

[source]

Convert a delimited String to a list of binary categorical columns. Suppose the possible String values were {“a”,”b”,”c”,”d”} and the String column value to be converted contained the String “a,c”, then the 4 output columns would have values [“true”,”false”,”true”,”false”]

transform

public Schema transform(Schema inputSchema)

param columnName The name of the column to convert
param newColumnNames The names of the new columns to create
param categoryTokens The possible tokens that may be present. Note this list must have the same length and order as the newColumnNames list
param delimiter The delimiter for the Strings to convert

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

StringListToCountsNDArrayTransform

[source]

Converts String column into a bag-of-words (BOW) represented as an NDArray of “counts.” Note that the original column is removed in the process

transform

public Schema transform(Schema inputSchema)

param columnName The name of the column to convert
param vocabulary The possible tokens that may be present.
param delimiter The delimiter for the Strings to convert
param ignoreUnknown Whether to ignore unknown tokens

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

outputColumnName

public String outputColumnName()

The output column name after the operation has been applied

return the output column name

columnName

public String columnName()

The output column names This will often be the same as the input

return the output column names

StringListToIndicesNDArrayTransform

[source]

Converts String column into a sparse bag-of-words (BOW) represented as an NDArray of indices. Appropriate for embeddings or as efficient storage before being expanded into a dense array.

StringMapTransform

[source]

A simple String -> String map function.

map

public Text map(Writable writable)

param columnName Name of the column
param map Key: From. Value: To

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

DeriveColumnsFromTimeTransform

[source]

Create a number of new columns by deriving their values from a Time column. Can be used for example to create new columns with the year, month, day, hour, minute, second etc.

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

mapSequence

public Object mapSequence(Object sequence)

Transform a sequence

param sequence

toString

public String toString()

The output column name after the operation has been applied

return the output column name

StringToTimeTransform

[source]

Convert a String column to a time column by parsing the date/time String, using a JodaTime.

Time format is specified as per http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html

getNewColumnMetaData

public ColumnMetaData getNewColumnMetaData(String newName, ColumnMetaData oldColumnType)

Instantiate this without a time format specified. If this constructor is used, this transform will be allowed to handle several common transforms as defined in the static formats array.

param columnName Name of the String column
param timeZone Timezone for time parsing

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

TimeMathOpTransform

[source]

Transform math op on a time column

Note: only the following MathOps are supported: Add, Subtract, ScalarMin, ScalarMax For ScalarMin/Max, the TimeUnit must be milliseconds - i.e., value must be in epoch millisecond format

map

public Object map(Object input)

Transform an object in to another object

param input the record to transform
return the transformed writable

Normalization

Why normalize?

Available preprocessors

NormalizerMinMaxScaler

[source]

Pre processor for DataSets that normalizes feature values (and optionally label values) to lie between a minimum and maximum value (by default between 0 and 1)

NormalizerMinMaxScaler

public NormalizerMinMaxScaler(double minRange, double maxRange)

Preprocessor can take a range as minRange and maxRange

param minRange
param maxRange

load

public void load(File... statistics) throws IOException

Load the given min and max

param statistics the statistics to load
throws IOException

save

public void save(File... files) throws IOException

Save the current min and max

param files the statistics to save
throws IOException
deprecated use {- link NormalizerSerializer instead}

Normalizer

[source]

Base interface for all normalizers

ImageFlatteningDataSetPreProcessor

[source]

A DataSetPreProcessor used to flatten a 4d CNN features array to a flattened 2d format (for use in networks such as a DenseLayer/multi-layer perceptron)

MinMaxStrategy

[source]

statistics of the upper and lower bounds of the population

MinMaxStrategy

public MinMaxStrategy(double minRange, double maxRange)

param minRange the target range lower bound
param maxRange the target range upper bound

preProcess

public void preProcess(INDArray array, INDArray maskArray, MinMaxStats stats)

Normalize a data array

param array the data to normalize
param stats statistics of the data population

revert

public void revert(INDArray array, INDArray maskArray, MinMaxStats stats)

Denormalize a data array

param array the data to denormalize
param stats statistics of the data population

ImagePreProcessingScaler

[source]

ImagePreProcessingScaler

public ImagePreProcessingScaler(double a, double b, int maxBits)

Preprocessor can take a range as minRange and maxRange

param a, default = 0
param b, default = 1
param maxBits in the image, default = 8

fit

public void fit(DataSet dataSet)

Fit a dataset (only compute based on the statistics from this dataset0

param dataSet the dataset to compute on

fit

public void fit(DataSetIterator iterator)

Iterates over a dataset accumulating statistics for normalization

param iterator the iterator to use for collecting statistics.

transform

public void transform(DataSet toPreProcess)

Transform the data

param toPreProcess the dataset to transform

CompositeMultiDataSetPreProcessor

[source]

A simple Composite MultiDataSetPreProcessor - allows you to apply multiple MultiDataSetPreProcessors sequentially on the one MultiDataSet, in the order they are passed to the constructor

CompositeMultiDataSetPreProcessor

public CompositeMultiDataSetPreProcessor(MultiDataSetPreProcessor... preProcessors)

param preProcessors Preprocessors to apply. They will be applied in this order

MultiNormalizerMinMaxScaler

[source]

Pre processor for MultiDataSet that normalizes feature values (and optionally label values) to lie between a minimum and maximum value (by default between 0 and 1)

MultiNormalizerMinMaxScaler

public MultiNormalizerMinMaxScaler(double minRange, double maxRange)

Preprocessor can take a range as minRange and maxRange

param minRange the target range lower bound
param maxRange the target range upper bound

MultiDataNormalization

[source]

An interface for multi dataset normalizers. Data normalizers compute some sort of statistics over a MultiDataSet and scale the data in some way.

ImageMultiPreProcessingScaler

[source]

ImageMultiPreProcessingScaler

public ImageMultiPreProcessingScaler(double a, double b, int maxBits, int[] featureIndices)

Preprocessor can take a range as minRange and maxRange

param a, default = 0
param b, default = 1
param maxBits in the image, default = 8
param featureIndices Indices of feature arrays to process. If only one feature array is present, this should always be 0

NormalizerStandardize

[source]

Created by susaneraly, Ede Meijer variance and mean Pre processor for DataSet that normalizes feature values (and optionally label values) to have 0 mean and a standard deviation of 1

load

public void load(File... files) throws IOException

Load the means and standard deviations from the file system

param files the files to load from. Needs 4 files if normalizing labels, otherwise 2.

save

public void save(File... files) throws IOException

param files the files to save to. Needs 4 files if normalizing labels, otherwise 2.
deprecated use {- link NormalizerSerializer} instead

Save the current means and standard deviations to the file system

StandardizeStrategy

[source]

of the means and standard deviations of the population

preProcess

public void preProcess(INDArray array, INDArray maskArray, DistributionStats stats)

Normalize a data array

param array the data to normalize
param stats statistics of the data population

revert

public void revert(INDArray array, INDArray maskArray, DistributionStats stats)

Denormalize a data array

param array the data to denormalize
param stats statistics of the data population

NormalizerStrategy

[source]

Interface for strategies that can normalize and denormalize data arrays based on statistics of the population

MultiNormalizerHybrid

[source]

MultiNormalizerHybrid

public MultiNormalizerHybrid standardizeAllInputs()

Apply standardization to all inputs, except the ones individually configured

return the normalizer

minMaxScaleAllInputs

public MultiNormalizerHybrid minMaxScaleAllInputs()

Apply min-max scaling to all inputs, except the ones individually configured

return the normalizer

minMaxScaleAllInputs

public MultiNormalizerHybrid minMaxScaleAllInputs(double rangeFrom, double rangeTo)

Apply min-max scaling to all inputs, except the ones individually configured

param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer

standardizeInput

public MultiNormalizerHybrid standardizeInput(int input)

Apply standardization to a specific input, overriding the global input strategy if any

param input the index of the input
return the normalizer

minMaxScaleInput

public MultiNormalizerHybrid minMaxScaleInput(int input)

Apply min-max scaling to a specific input, overriding the global input strategy if any

param input the index of the input
return the normalizer

minMaxScaleInput

public MultiNormalizerHybrid minMaxScaleInput(int input, double rangeFrom, double rangeTo)

Apply min-max scaling to a specific input, overriding the global input strategy if any

param input the index of the input
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer

standardizeAllOutputs

public MultiNormalizerHybrid standardizeAllOutputs()

Apply standardization to all outputs, except the ones individually configured

return the normalizer

minMaxScaleAllOutputs

public MultiNormalizerHybrid minMaxScaleAllOutputs()

Apply min-max scaling to all outputs, except the ones individually configured

return the normalizer

minMaxScaleAllOutputs

public MultiNormalizerHybrid minMaxScaleAllOutputs(double rangeFrom, double rangeTo)

Apply min-max scaling to all outputs, except the ones individually configured

param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer

standardizeOutput

public MultiNormalizerHybrid standardizeOutput(int output)

Apply standardization to a specific output, overriding the global output strategy if any

param output the index of the input
return the normalizer

minMaxScaleOutput

public MultiNormalizerHybrid minMaxScaleOutput(int output)

Apply min-max scaling to a specific output, overriding the global output strategy if any

param output the index of the input
return the normalizer

minMaxScaleOutput

public MultiNormalizerHybrid minMaxScaleOutput(int output, double rangeFrom, double rangeTo)

Apply min-max scaling to a specific output, overriding the global output strategy if any

param output the index of the input
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer

getInputStats

public NormalizerStats getInputStats(int input)

Get normalization statistics for a given input.

param input the index of the input
return implementation of NormalizerStats corresponding to the normalization strategy selected

getOutputStats

public NormalizerStats getOutputStats(int output)

Get normalization statistics for a given output.

param output the index of the output
return implementation of NormalizerStats corresponding to the normalization strategy selected

fit

public void fit(@NonNull MultiDataSet dataSet)

Get the map of normalization statistics per input

return map of input indices pointing to NormalizerStats instances

fit

public void fit(@NonNull MultiDataSetIterator iterator)

Iterates over a dataset accumulating statistics for normalization

param iterator the iterator to use for collecting statistics

transform

public void transform(@NonNull MultiDataSet data)

Transform the dataset

param data the dataset to pre process

revert

public void revert(@NonNull MultiDataSet data)

Undo (revert) the normalization applied by this DataNormalization instance (arrays are modified in-place)

param data MultiDataSet to revert the normalization on

revertFeatures

public void revertFeatures(@NonNull INDArray[] features)

Undo (revert) the normalization applied by this DataNormalization instance to the entire inputs array

param features The normalized array of inputs

revertFeatures

public void revertFeatures(@NonNull INDArray[] features, INDArray[] maskArrays)

Undo (revert) the normalization applied by this DataNormalization instance to the entire inputs array

param features The normalized array of inputs
param maskArrays Optional mask arrays belonging to the inputs

revertFeatures

public void revertFeatures(@NonNull INDArray[] features, INDArray[] maskArrays, int input)

Undo (revert) the normalization applied by this DataNormalization instance to the features of a particular input

param features The normalized array of inputs
param maskArrays Optional mask arrays belonging to the inputs
param input the index of the input to revert normalization on

revertLabels

public void revertLabels(@NonNull INDArray[] labels)

Undo (revert) the normalization applied by this DataNormalization instance to the entire outputs array

param labels The normalized array of outputs

revertLabels

public void revertLabels(@NonNull INDArray[] labels, INDArray[] maskArrays)

Undo (revert) the normalization applied by this DataNormalization instance to the entire outputs array

param labels The normalized array of outputs
param maskArrays Optional mask arrays belonging to the outputs

revertLabels

public void revertLabels(@NonNull INDArray[] labels, INDArray[] maskArrays, int output)

Undo (revert) the normalization applied by this DataNormalization instance to the labels of a particular output

param labels The normalized array of outputs
param maskArrays Optional mask arrays belonging to the outputs
param output the index of the output to revert normalization on

CompositeDataSetPreProcessor

[source]

A simple Composite DataSetPreProcessor - allows you to apply multiple DataSetPreProcessors sequentially on the one DataSet, in the order they are passed to the constructor

CompositeDataSetPreProcessor

public CompositeDataSetPreProcessor(DataSetPreProcessor... preProcessors)

param preProcessors Preprocessors to apply. They will be applied in this order

MultiNormalizerStandardize

[source]

Pre processor for MultiDataSet that normalizes feature values (and optionally label values) to have 0 mean and a standard deviation of 1

load

public void load(@NonNull List<File> featureFiles, @NonNull List<File> labelFiles) throws IOException

Load means and standard deviations from the file system

param featureFiles source files for features, requires 2 files per input, alternating mean and stddev files
param labelFiles source files for labels, requires 2 files per output, alternating mean and stddev files

save

public void save(@NonNull List<File> featureFiles, @NonNull List<File> labelFiles) throws IOException

param featureFiles target files for features, requires 2 files per input, alternating mean and stddev files
param labelFiles target files for labels, requires 2 files per output, alternating mean and stddev files
deprecated use {- link MultiStandardizeSerializerStrategy} instead

Save the current means and standard deviations to the file system

VGG16ImagePreProcessor

[source]

This is a preprocessor specifically for VGG16. It subtracts the mean RGB value, computed on the training set, from each pixel as reported in: https://arxiv.org/pdf/1409.1556.pdf

fit

public void fit(DataSet dataSet)

Fit a dataset (only compute based on the statistics from this dataset0

param dataSet the dataset to compute on

fit

public void fit(DataSetIterator iterator)

Iterates over a dataset accumulating statistics for normalization

param iterator the iterator to use for collecting statistics.

transform

public void transform(DataSet toPreProcess)

Transform the data

param toPreProcess the dataset to transform

DataNormalization

[source]

An interface for data normalizers. Data normalizers compute some sort of statistics over a dataset and scale the data in some way.

Readers

Read individual records from different formats.

Why readers?

Readers return Writable classes that describe each column in a Record. These classes are used to convert each record to a tensor/ND-Array format.

Usage

Each reader implementation extends BaseRecordReader and provides a simple API for selecting the next record in a dataset, acting similarly to iterators.

Useful methods include:

next: Return a batch of Writable.
nextRecord: Return a single Record, optionally with RecordMetaData.
reset: Reset the underlying iterator.
hasNext: Iterator method to determine if another record is available.

Listeners

Types of readers

ComposableRecordReader

initialize

public void initialize(InputSplit split) throws IOException, InterruptedException

ConcatenatingRecordReader

FileRecordReader

File reader/writer

getCurrentLabel

public int getCurrentLabel()

Return the current label. The index of the current file’s parent directory in the label list

return The index of the current file’s parent directory

LineRecordReader

Reads files line by line

CollectionRecordReader

Collection record reader. Mainly used for testing.

CollectionSequenceRecordReader

Collection record reader for sequences. Mainly used for testing.

initialize

public void initialize(InputSplit split) throws IOException, InterruptedException

param records Collection of sequences. For example, List<List<List>> where the inner two lists are a sequence, and the outer list/collection is a list of sequences

ListStringRecordReader

Iterates through a list of strings return a record.

initialize

public void initialize(InputSplit split) throws IOException, InterruptedException

Called once at initialization.

param split the split that defines the range of records to read
throws IOException
throws InterruptedException

initialize

public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException

Called once at initialization.

param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException

hasNext

public boolean hasNext()

Get the next record

return The list of next record

reset

public void reset()

List of label strings

return

nextRecord

public Record nextRecord()

Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream

param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream

close

public void close() throws IOException

Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.

throws IOException if an I/O error occurs

setConf

public void setConf(Configuration conf)

Set the configuration to be used by this object.

param conf

getConf

public Configuration getConf()

Return the configuration used by this object.

CSVRecordReader

Simple csv record reader.

initialize

public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException

Skip first n lines

param skipNumLines the number of lines to skip

CSVRegexRecordReader

A CSVRecordReader that can split each column into additional columns using regexs.

CSVSequenceRecordReader

CSVVariableSlidingWindowRecordReader

A sliding window of variable size across an entire CSV.

In practice the sliding window size starts at 1, then linearly increase to maxLinesPer sequence, then linearly decrease back to 1.

initialize

public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException

No-arg constructor with the default number of lines per sequence (10)

LibSvmRecordReader

Record reader for libsvm format, which is closely related to SVMLight format. Similar to scikit-learn we use a single reader for both formats, so this class is a subclass of SVMLightRecordReader.

Further details on the format can be found at

MatlabRecordReader

Matlab record reader

SVMLightRecordReader

Record reader for SVMLight format, which can generally be described as

LABEL INDEX:VALUE INDEX:VALUE …

SVMLight format is well-suited to sparse data (e.g., bag-of-words) because it omits all features with value zero.

We support an “extended” version that allows for multiple targets (or labels) separated by a comma, as follows:

LABEL1,LABEL2,… INDEX:VALUE INDEX:VALUE …

This can be used to represent either multitask problems or multilabel problems with sparse binary labels (controlled via the “MULTILABEL” configuration option).

Like scikit-learn, we support both zero-based and one-based indexing.

Further details on the format can be found at

initialize

public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException

Must be called before attempting to read records.

param conf DataVec configuration
param split FileSplit
throws IOException
throws InterruptedException

setConf

public void setConf(Configuration conf)

Set configuration.

param conf DataVec configuration
throws IOException
throws InterruptedException

hasNext

public boolean hasNext()

Helper function to help detect lines that are commented out. May read ahead and cache a line.

return

nextRecord

public Record nextRecord()

Return next record as list of Writables.

return

RegexLineRecordReader

RegexLineRecordReader: Read a file, one line at a time, and split it into fields using a regex. To load an entire file using a

RegexSequenceRecordReader

RegexSequenceRecordReader: Read an entire file (as a sequence), one line at a time and split each line into fields using a regex.

lines that don’t match the provided regex can result in an exception (FailOnInvalid), can be skipped silently (SkipInvalid), or skip invalid but log a warning (SkipInvalidWithWarning)

TransformProcessRecordReader

to have a transform process applied before being returned.

initialize

public void initialize(InputSplit split) throws IOException, InterruptedException

Called once at initialization.

param split the split that defines the range of records to read
throws IOException
throws InterruptedException

initialize

public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException

Called once at initialization.

param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException

hasNext

public boolean hasNext()

Get the next record

return

reset

public void reset()

List of label strings

return

nextRecord

public Record nextRecord()

Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream

param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream

loadFromMetaData

public Record loadFromMetaData(RecordMetaData recordMetaData) throws IOException

param recordMetaData Metadata for the record that we want to load from
return Single record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading

setListeners

public void setListeners(RecordListener... listeners)

Load multiple records from the given a list of {- link RecordMetaData} instances

param recordMetaDatas Metadata for the records that we want to load from
return Multiple records for the given RecordMetaData instances
throws IOException If I/O error occurs during loading

setListeners

public void setListeners(Collection<RecordListener> listeners)

Set the record listeners for this record reader.

param listeners

close

public void close() throws IOException

Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.

throws IOException if an I/O error occurs

setConf

public void setConf(Configuration conf)

Set the configuration to be used by this object.

param conf

getConf

public Configuration getConf()

Return the configuration used by this object.

TransformProcessSequenceRecordReader

to be transformed before being returned.

setConf

public void setConf(Configuration conf)

Set the configuration to be used by this object.

param conf

getConf

public Configuration getConf()

Return the configuration used by this object.

batchesSupported

public boolean batchesSupported()

Returns a sequence record.

return a sequence of records

nextSequence

public SequenceRecord nextSequence()

param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream

loadSequenceFromMetaData

public SequenceRecord loadSequenceFromMetaData(RecordMetaData recordMetaData) throws IOException

param recordMetaData Metadata for the sequence record that we want to load from
return Single sequence record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading

initialize

public void initialize(InputSplit split) throws IOException, InterruptedException

Load multiple sequence records from the given a list of {- link RecordMetaData} instances

param recordMetaDatas Metadata for the records that we want to load from
return Multiple sequence record for the given RecordMetaData instances
throws IOException If I/O error occurs during loading

initialize

public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException

Called once at initialization.

param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException

hasNext

public boolean hasNext()

Get the next record

return

reset

public void reset()

List of label strings

return

nextRecord

public Record nextRecord()

Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream

param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream

loadFromMetaData

public Record loadFromMetaData(RecordMetaData recordMetaData) throws IOException

param recordMetaData Metadata for the record that we want to load from
return Single record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading

setListeners

public void setListeners(RecordListener... listeners)

Load multiple records from the given a list of {- link RecordMetaData} instances

param recordMetaDatas Metadata for the records that we want to load from
return Multiple records for the given RecordMetaData instances
throws IOException If I/O error occurs during loading

setListeners

public void setListeners(Collection<RecordListener> listeners)

Set the record listeners for this record reader.

param listeners

close

public void close() throws IOException

Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.

throws IOException if an I/O error occurs

NativeAudioRecordReader

Native audio file loader using FFmpeg.

WavFileRecordReader

Wav file loader

ImageRecordReader

Image record reader. Reads a local file system and parses images of a given height and width. All images are rescaled and converted to the given height, width, and number of channels.

Also appends the label if specified (one of k encoding based on the directory structure where each subdir of the root is an indexed label)

TfidfRecordReader

TFIDF record reader (wraps a tfidf vectorizer for delivering labels and conforming to the record reader interface)

Schemas

Schemas for datasets and transformation.

Why use schemas?

Using schemas

Schemas are primarily used for programming transformations. Before you can properly execute a TransformProcess you will need to pass the schema of the data being transformed.

An example of a schema for merchant records may look like:

Joining schemas

If you have two different datasets that you want to merge together, DataVec provides a Join class with different join strategies such as Inner or RightOuter.

Once you've defined your join and you've loaded the data into DataVec, you must use an Executor to complete the join.

Classes and utilities

DataVec comes with a few Schema classes and helper utilities for 2D and sequence types of data.

Join

Join class: used to specify a join (like an SQL join)

setSchemas

public Builder setSchemas(Schema left, Schema right)

setKeyColumns

public Builder setKeyColumns(String... keyColumnNames)

deprecated Use {- link #setJoinColumns(String…)}

setKeyColumnsLeft

public Builder setKeyColumnsLeft(String... keyColumnNames)

deprecated Use {- link #setJoinColumnsLeft(String…)}

setKeyColumnsRight

public Builder setKeyColumnsRight(String... keyColumnNames)

deprecated Use {- link #setJoinColumnsRight(String…)}

setJoinColumnsLeft

public Builder setJoinColumnsLeft(String... joinColumnNames)

Specify the names of the columns to join on, for the left data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i

param joinColumnNames Names of the columns to join on (for left data)

setJoinColumnsRight

public Builder setJoinColumnsRight(String... joinColumnNames)

Specify the names of the columns to join on, for the right data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i

param joinColumnNames Names of the columns to join on (for left data)

InferredSchema

If passed a CSV file that contains a header and a single row of sample data, it will return a Schema.

Schema

sameTypes

public boolean sameTypes(Schema schema)

Create a schema based on the given metadata

param columnMetaData the metadata to create the schema from

newSchema

public Schema newSchema(List<ColumnMetaData> columnMetaData)

param schema the schema to compute the difference for
return the metadata that is different (in order) between this schema and the other schema

numColumns

public int numColumns()

Returns the number of columns or fields for this schema

return the number of columns or fields for this schema

getName

public String getName(int column)

Returns the name of a given column at the specified index

param column the index of the column to get the name for
return the name of the column at the specified index

getType

public ColumnType getType(int column)

Returns the {- link ColumnType} for the column at the specified index

param column the index of the column to get the type for
return the type of the column to at the specified inde

getType

public ColumnType getType(String columnName)

Returns the {- link ColumnType} for the column at the specified index

param columnName the index of the column to get the type for
return the type of the column to at the specified inde

getMetaData

public ColumnMetaData getMetaData(int column)

Returns the {- link ColumnMetaData} at the specified column index

param column the index to get the metadata for
return the metadata at ths specified index

getMetaData

public ColumnMetaData getMetaData(String column)

Retrieve the metadata for the given column name

param column the name of the column to get metadata for
return the metadata for the given column name

getIndexOfColumn

public int getIndexOfColumn(String columnName)

Return a copy of the list column names

return a copy of the list of column names for this schema

hasColumn

public boolean hasColumn(String columnName)

Return the indices of the columns, given their namess

param columnNames Name of the columns to get indices for
return Column indexes

toJson

public String toJson()

Serialize this schema to json

return a json representation of this schema

toYaml

public String toYaml()

Serialize this schema to yaml

return the yaml representation of this schema

fromJson

public static Schema fromJson(String json)

Create a schema from a given json string

param json the json to create the schema from
return the created schema based on the json

fromYaml

public static Schema fromYaml(String yaml)

Create a schema from the given yaml string

param yaml the yaml to create the schema from
return the created schema based on the yaml

addColumnFloat

public Builder addColumnFloat(String name)

Add a Float column with no restrictions on the allowable values, except for no NaN/infinite values allowed

param name Name of the column

addColumnFloat

public Builder addColumnFloat(String name, Float minAllowedValue, Float maxAllowedValue)

Add a Float column with the specified restrictions (and no NaN/Infinite values allowed)

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
return

addColumnFloat

public Builder addColumnFloat(String name, Float minAllowedValue, Float maxAllowedValue, boolean allowNaN,
                                       boolean allowInfinite)

Add a Float column with the specified restrictions

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow

addColumnsFloat

public Builder addColumnsFloat(String... columnNames)

Add multiple Float columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)

param columnNames Names of the columns to add

addColumnsFloat

public Builder addColumnsFloat(String pattern, int minIdxInclusive, int maxIdxInclusive)

A convenience method for adding multiple Float columns. For example, to add columns “myFloatCol_0”, “myFloatCol_1”, “myFloatCol_2”, use {- code addColumnsFloat(“myFloatCol_%d”,0,2)}

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsFloat

public Builder addColumnsFloat(String pattern, int minIdxInclusive, int maxIdxInclusive,
                                        Float minAllowedValue, Float maxAllowedValue, boolean allowNaN, boolean allowInfinite)

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow

addColumnDouble

public Builder addColumnDouble(String name)

Add a Double column with no restrictions on the allowable values, except for no NaN/infinite values allowed

param name Name of the column

addColumnDouble

public Builder addColumnDouble(String name, Double minAllowedValue, Double maxAllowedValue)

Add a Double column with the specified restrictions (and no NaN/Infinite values allowed)

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
return

addColumnDouble

public Builder addColumnDouble(String name, Double minAllowedValue, Double maxAllowedValue, boolean allowNaN,
                        boolean allowInfinite)

Add a Double column with the specified restrictions

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow

addColumnsDouble

public Builder addColumnsDouble(String... columnNames)

Add multiple Double columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)

param columnNames Names of the columns to add

addColumnsDouble

public Builder addColumnsDouble(String pattern, int minIdxInclusive, int maxIdxInclusive)

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsDouble

public Builder addColumnsDouble(String pattern, int minIdxInclusive, int maxIdxInclusive,
                        Double minAllowedValue, Double maxAllowedValue, boolean allowNaN, boolean allowInfinite)

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow

addColumnInteger

public Builder addColumnInteger(String name)

Add an Integer column with no restrictions on the allowable values

param name Name of the column

addColumnInteger

public Builder addColumnInteger(String name, Integer minAllowedValue, Integer maxAllowedValue)

Add an Integer column with the specified min/max allowable values

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

addColumnsInteger

public Builder addColumnsInteger(String... names)

Add multiple Integer columns with no restrictions on the min/max allowable values

param names Names of the integer columns to add

addColumnsInteger

public Builder addColumnsInteger(String pattern, int minIdxInclusive, int maxIdxInclusive)

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsInteger

public Builder addColumnsInteger(String pattern, int minIdxInclusive, int maxIdxInclusive,
                        Integer minAllowedValue, Integer maxAllowedValue)

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

addColumnCategorical

public Builder addColumnCategorical(String name, String... stateNames)

Add a Categorical column, with the specified state names

param name Name of the column
param stateNames Names of the allowable states for this categorical column

addColumnCategorical

public Builder addColumnCategorical(String name, List<String> stateNames)

Add a Categorical column, with the specified state names

param name Name of the column
param stateNames Names of the allowable states for this categorical column

addColumnLong

public Builder addColumnLong(String name)

Add a Long column, with no restrictions on the min/max values

param name Name of the column

addColumnLong

public Builder addColumnLong(String name, Long minAllowedValue, Long maxAllowedValue)

Add a Long column with the specified min/max allowable values

param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

addColumnsLong

public Builder addColumnsLong(String... names)

Add multiple Long columns, with no restrictions on the allowable values

param names Names of the Long columns to add

addColumnsLong

public Builder addColumnsLong(String pattern, int minIdxInclusive, int maxIdxInclusive)

A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsLong

public Builder addColumnsLong(String pattern, int minIdxInclusive, int maxIdxInclusive, Long minAllowedValue,
                        Long maxAllowedValue)

A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

addColumn

public Builder addColumn(ColumnMetaData metaData)

Add a column

param metaData metadata for this column

addColumnString

public Builder addColumnString(String name)

Add a String column with no restrictions on the allowable values.

param name Name of the column

addColumnsString

public Builder addColumnsString(String... columnNames)

Add multiple String columns with no restrictions on the allowable values

param columnNames Names of the String columns to add

addColumnString

public Builder addColumnString(String name, String regex, Integer minAllowableLength,
                        Integer maxAllowableLength)

Add a String column with the specified restrictions

param name Name of the column
param regex Regex that the String must match in order to be considered valid. If null: no regex restriction
param minAllowableLength Minimum allowable length for the String to be considered valid
param maxAllowableLength Maximum allowable length for the String to be considered valid

addColumnsString

public Builder addColumnsString(String pattern, int minIdxInclusive, int maxIdxInclusive)

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsString

param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param regex Regex that the String must match in order to be considered valid. If null: no regex restriction
param minAllowedLength Minimum allowed length of strings (inclusive). If null: no restriction
param maxAllowedLength Maximum allowed length of strings (inclusive). If null: no restriction

addColumnTime

param columnName Name of the column
param timeZone Time zone of the time column

addColumnTime

param columnName Name of the column
param timeZone Time zone of the time column

addColumnTime

param columnName Name of the column
param timeZone Time zone of the time column
param minValidValue Minumum allowable time (in milliseconds). May be null.
param maxValidValue Maximum allowable time (in milliseconds). May be null.

addColumnNDArray

Add a NDArray column

param columnName Name of the column
param shape shape of the NDArray column. Use -1 in entries to specify as “variable length” in that dimension

build

Create the Schema

inferMultiple

Infers a schema based on the record. The column names are based on indexing.

param record the record to infer from
return the infered schema

infer

Infers a schema based on the record. The column names are based on indexing.

param record the record to infer from
return the infered schema

SequenceSchema

inferSequenceMulti

Infers a sequence schema based on the record

param record the record to infer the schema based on
return the inferred sequence schema

inferSequence

Infers a sequence schema based on the record

param record the record to infer the schema based on
return the inferred sequence schema