All pages
Powered by GitBook
1 of 14

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Analysis

Gather statistics on datasets.

Analysis of data

Sometimes datasets are too large or too abstract in their format to manually analyze and estimate statistics on certain columns or patterns. DataVec comes with some helper utilities for performing a data analysis, and maximums, means, minimums, and other useful metrics.

Using Spark for analysis

If you have loaded your data into Apache Spark, DataVec has a special AnalyzeSpark class which can generate histograms, collect statistics, and return information about the quality of the data. Assuming you have already loaded your data into a Spark RDD, pass the JavaRDD and Schema to the class.

If you are using DataVec in Scala and your data was loaded into a regular RDD class, you can convert it by calling .toJavaRDD() which returns a JavaRDD. If you need to convert it back, call rdd().

The code below demonstrates some of many analyses for a 2D dataset in Spark analysis using the RDD javaRdd and the schema mySchema:

Note that if you have sequence data, there are special methods for that as well:

Analyzing locally

The AnalyzeLocal class works very similarly to its Spark counterpart and has a similar API. Instead of passing an RDD, it accepts a RecordReader which allows it to iterate over the dataset.

Utilities

AnalyzeLocal

Analyse the specified data - returns a DataAnalysis object with summary information about each column

analyze

Analyse the specified data - returns a DataAnalysis object with summary information about each column

  • param schema Schema for data

  • param rr Data to analyze

  • return DataAnalysis for data

analyzeQualitySequence

Analyze the data quality of sequence data - provides a report on missing values, values that don’t comply with schema, etc

  • param schema Schema for data

  • param data Data to analyze

  • return DataQualityAnalysis object

analyzeQuality

Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc

  • param schema Schema for data

  • param data Data to analyze

  • return DataQualityAnalysis object

AnalyzeSpark

AnalizeSpark: static methods for analyzing and

analyzeSequence

  • param schema

  • param data

  • param maxHistogramBuckets

  • return

analyze

Analyse the specified data - returns a DataAnalysis object with summary information about each column

  • param schema Schema for data

  • param data Data to analyze

  • return DataAnalysis for data

analyzeQualitySequence

Randomly sample values from a single column

  • param count Number of values to sample

  • param columnName Name of the column to sample from

  • param schema Schema

  • param data Data to sample from

analyzeQuality

Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc

  • param schema Schema for data

  • param data Data to analyze

  • return DataQualityAnalysis object

min

Randomly sample a set of invalid values from a specified column. Values are considered invalid according to the Schema / ColumnMetaData

  • param numToSample Maximum number of invalid values to sample

  • param columnName Same of the column from which to sample invalid values

  • param schema Data schema

  • param data Data

max

Get the maximum value for the specified column

  • param allData All data

  • param columnName Name of the column to get the minimum value for

  • param schema Schema of the data

  • return Maximum value for the column

return A list of random samples

return List of invalid examples

[source]
[source]
import org.datavec.spark.transform.AnalyzeSpark;
import org.datavec.api.writable.Writable;
import org.datavec.api.transform.analysis.*;

int maxHistogramBuckets = 10
DataAnalysis analysis = AnalyzeSpark.analyze(mySchema, javaRdd, maxHistogramBuckets)

DataQualityAnalysis analysis = AnalyzeSpark.analyzeQuality(mySchema, javaRdd)

Writable max = AnalyzeSpark.max(javaRdd, "myColumn", mySchema)

int numSamples = 5
List<Writable> sample = AnalyzeSpark.sampleFromColumn(numSamples, "myColumn", mySchema, javaRdd)
SequenceDataAnalysis seqAnalysis = AnalyzeSpark.analyzeSequence(mySchema, sequenceRdd)

List<Writable> uniqueSequence = AnalyzeSpark.getUniqueSequence("myColumn", seqSchema, sequenceRdd)
import org.datavec.local.transforms.AnalyzeLocal;

int maxHistogramBuckets = 10
DataAnalysis analysis = AnalyzeLocal.analyze(mySchema, csvRecordReader, maxHistogramBuckets)
public static DataAnalysis analyze(Schema schema, RecordReader rr, int maxHistogramBuckets)
public static DataQualityAnalysis analyzeQualitySequence(Schema schema, SequenceRecordReader data)
public static DataQualityAnalysis analyzeQuality(final Schema schema, final RecordReader data)
public static SequenceDataAnalysis analyzeSequence(Schema schema, JavaRDD<List<List<Writable>>> data,
                    int maxHistogramBuckets)
public static DataAnalysis analyze(Schema schema, JavaRDD<List<Writable>> data)
public static DataQualityAnalysis analyzeQualitySequence(Schema schema, JavaRDD<List<List<Writable>>> data)
public static DataQualityAnalysis analyzeQuality(final Schema schema, final JavaRDD<List<Writable>> data)
public static Writable min(JavaRDD<List<Writable>> allData, String columnName, Schema schema)
public static Writable max(JavaRDD<List<Writable>> allData, String columnName, Schema schema)

Reference

Serialization

Data wrangling and mapping from one schema to another.

Serializing transforms

DataVec comes with the ability to serialize transforms, which allows them to be more portable when they're needed for production environments. A TransformProcess is serialzied to a human-readable format such as JSON and can be saved as a file.

Serialization

The code below shows how you can serialize the transform process tp.

Deserialization

When you want to reinstantiate the transform process, call the static from<format> method.

Available serializers

JsonSerializer

Serializer used for converting objects (Transforms, Conditions, etc) to JSON format

YamlSerializer

Serializer used for converting objects (Transforms, Conditions, etc) to YAML format

[source]
[source]
String serializedTransformString = tp.toJson()
TransformProcess tp = TransformProcess.fromJson(serializedTransformString)

Records

How to use data records in DataVec.

What is a record?

In the DataVec world a Record represents a single entry in a dataset. DataVec differentiates types of records to make data manipulation easier with built-in APIs. Sequences and 2D records are distinguishable.

Using records

Most of the time you do not need to interact with the record classes directly, unless you are manually iterating records for the purpose of forwarding through a neural network.

Types of records

Record

A standard implementation of the Record interface

SequenceRecord

A standard implementation of the SequenceRecord interface.

Reductions

Available reductions

GeographicMidpointReduction

[source]
[source]
delimiter is configurable), determine the geographic midpoint. See “geographic midpoint” at: http://www.geomidpoint.com/methods.html For implementation algorithm, see: http://www.geomidpoint.com/calculation.html

transform

  • param delim Delimiter for the coordinates in text format. For example, if format is “lat,long” use “,”

StringReducer

[source]

A StringReducer is used to take a set of examples and reduce them. The idea: suppose you have a large number of columns, and you want to combine/reduce the values in each column. StringReducer allows you to specify different reductions for differently for different columns: min, max, sum, mean etc.

Uses are: (1) Reducing examples by a key (2) Reduction operations in time series (windowing ops, etc)

transform

Get the output schema, given the input schema

outputColumnName

Create a StringReducer builder, and set the default column reduction operation. For any columns that aren’t specified explicitly, they will use the default reduction operation. If a column does have a reduction operation explicitly specified, then it will override the default specified here.

  • param defaultOp Default reduction operation to perform

appendColumns

Reduce the specified columns by taking the minimum value

prependColumns

Reduce the specified columns by taking the maximum value

mergeColumns

Reduce the specified columns by taking the sum of values

replaceColumn

Reduce the specified columns by taking the mean of the values

customReduction

Reduce the specified column using a custom column reduction functionality.

  • param column Column to execute the custom reduction functionality on

  • param columnReduction Column reduction to execute on that column

setIgnoreInvalid

When doing the reduction: set the specified columns to ignore any invalid values. Invalid: defined as being not valid according to the ColumnMetaData: {- link ColumnMetaData#isValid(Writable)}. For numerical columns, this typically means being unable to parse the Writable. For example, Writable.toLong() failing for a Long column. If the column has any restrictions (min/max values, regex for Strings etc) these will also be taken into account.

  • param columns Columns to set ‘ignore invalid’ for

[source]
public Schema transform(Schema inputSchema)
public Schema transform(Schema schema)
public Builder outputColumnName(String outputColumnName)
public Builder appendColumns(String... columns)
public Builder prependColumns(String... columns)
public Builder mergeColumns(String... columns)
public Builder replaceColumn(String... columns)
public Builder customReduction(String column, ColumnReduction columnReduction)
public Builder setIgnoreInvalid(String... columns)

Visualization

Utilities

HtmlAnalysis

[source]

createHtmlAnalysisString

Render a data analysis object as a HTML file. This will produce a summary table, along charts for numerical columns. The contents of the HTML file are returned as a String, which should be written to a .html file.

  • param analysis Data analysis object to render

  • see #createHtmlAnalysisFile(DataAnalysis, File)

createHtmlAnalysisFile

Render a data analysis object as a HTML file. This will produce a summary table, along charts for numerical columns

  • param dataAnalysis Data analysis object to render

  • param output Output file (should have extension .html)

HtmlSequencePlotting

A simple utility for plotting DataVec sequence data to HTML files. Each file contains only one sequence. Each column is plotted separately; only numerical and categorical columns are plotted.

createHtmlSequencePlots

Create a HTML file with plots for the given sequence.

  • param title Title of the page

  • param schema Schema for the data

  • param sequence Sequence to plot

  • return HTML file as a string

createHtmlSequencePlotFile

Create a HTML file with plots for the given sequence and write it to a file.

  • param title Title of the page

  • param schema Schema for the data

  • param sequence Sequence to plot

[source]
public static String createHtmlAnalysisString(DataAnalysis analysis) throws Exception
public static void createHtmlAnalysisFile(DataAnalysis dataAnalysis, File output) throws Exception
public static String createHtmlSequencePlots(String title, Schema schema, List<List<Writable>> sequence)
                    throws Exception
public static void createHtmlSequencePlotFile(String title, Schema schema, List<List<Writable>> sequence,
                    File output) throws Exception

Operations

Implementations for advanced transformation.

Usage

Operations, such as a Function, help execute transforms and load data into DataVec. The concept of operations is low-level, meaning that most of the time you will not need to worry about them.

Loading data into Spark

If you're using Apache Spark, functions will iterate over the dataset and load it into a Spark RDD and convert the raw data format into a Writable.

The above code loads a CSV file into a 2D java RDD. Once your RDD is loaded, you can transform it, perform joins and use reducers to wrangle the data any way you want.

Available ops

AggregableCheckingOp

Created by huitseeker on 5/8/17.

AggregableMultiOp

It is used to execute many reduction operations in parallel on the same column, datavec#238

Created by huitseeker on 5/8/17.

ByteWritableOp

supports a conversion to Byte.

Created by huitseeker on 5/14/17.

DispatchOp

Created by huitseeker on 5/14/17.

DispatchWithConditionOp

before dispatching the appropriate column of this element to its operation.

Created by huitseeker on 5/14/17.

DoubleWritableOp

supports a conversion to Double.

Created by huitseeker on 5/14/17.

FloatWritableOp

supports a conversion to Float.

Created by huitseeker on 5/14/17.

IntWritableOp

supports a conversion to Integer.

Created by huitseeker on 5/14/17.

LongWritableOp

supports a conversion to Long.

Created by huitseeker on 5/14/17.

StringWritableOp

supports a conversion to TextWritable. Created by huitseeker on 5/14/17.

CalculateSortedRank

CalculateSortedRank: calculate the rank of each example, after sorting example. For example, we might have some numerical “score” column, and we want to know for the rank (sort order) for each example, according to that column. The rank of each example (after sorting) will be added in a new Long column. Indexing is done from 0; examples will have values 0 to dataSetSize - 1.

Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data. Furthermore, the current implementation can only sort on one column

transform

  • param newColumnName Name of the new column (will contain the rank for each example)

  • param sortOnColumn Name of the column to sort on

  • param comparator Comparator used to sort examples

outputColumnName

The output column name after the operation has been applied

  • return the output column name

columnName

The output column names This will often be the same as the input

  • return the output column names

Filters

Selection of data using conditions.

Using filters

Filters are a part of transforms and gives a DSL for you to keep parts of your dataset. Filters can be one-liners for single conditions or include complex boolean logic.

You can also write your own filters by implementing the Filter interface, though it is much more often that you may want to create a custom condition instead.

Conditions

BooleanCondition

BooleanCondition: used for creating compound conditions, such as AND(ConditionA, ConditionB, …) As a BooleanCondition is a condition, these can be chained together, like NOT(OR(AND(…),AND(…)))

[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
Available filters

ConditionFilter

[source]

If condition is satisfied (returns true): remove the example or sequence If condition is not satisfied (returns false): keep the example or sequence

removeExample

  • param writables Example

  • return true if example should be removed, false to keep

removeSequence

  • param sequence sequence example

  • return true if example should be removed, false to keep

transform

Get the output schema for this transformation, given an input schema

  • param inputSchema

outputColumnName

The output column name after the operation has been applied

  • return the output column name

columnName

The output column names This will often be the same as the input

  • return the output column names

Filter

[source]

Filter: a method of removing examples (or sequences) according to some condition

FilterInvalidValues

[source]

FilterInvalidValues: a filter operation that removes any examples (or sequences) if the examples/sequences contains invalid values in any of a specified set of columns. Invalid values are determined with respect to the schema

transform

  • param columnsToFilterIfInvalid Columns to check for invalid values

removeExample

  • param writables Example

  • return true if example should be removed, false to keep

removeSequence

  • param sequence sequence example

  • return true if example should be removed, false to keep

outputColumnName

The output column name after the operation has been applied

  • return the output column name

columnName

The output column names This will often be the same as the input

  • return the output column names

InvalidNumColumns

[source]

Remove invalid records of a certain size.

removeExample

  • param writables Example

  • return true if example should be removed, false to keep

removeSequence

  • param sequence sequence example

  • return true if example should be removed, false to keep

removeExample

  • param writables Example

  • return true if example should be removed, false to keep

removeSequence

  • param sequence sequence example

  • return true if example should be removed, false to keep

transform

Get the output schema for this transformation, given an input schema

  • param inputSchema

outputColumnName

The output column name after the operation has been applied

  • return the output column name

columnName

The output column names This will often be the same as the input

  • return the output column names

outputColumnName

The output column name after the operation has been applied

  • return the output column name

columnName

The output column names This will often be the same as the input

  • return the output column names

condition

Condition on arbitrary input

  • param input the input to return the condition for

  • return true if the condition is met false otherwise

conditionSequence

Condition on arbitrary input

  • param sequence the sequence to do a condition on

  • return true if the condition for the sequence is met false otherwise

transform

Get the output schema for this transformation, given an input schema

  • param inputSchema

AND

And of all the given conditions

  • param conditions the conditions to and

  • return a joint and of all these conditions

OR

Or of all the given conditions

  • param conditions the conditions to or

  • return a joint and of all these conditions

NOT

Not of the given condition

  • param condition the conditions to and

  • return a joint and of all these condition

XOR

And of all the given conditions

  • param first the first condition

  • param second the second condition for xor

  • return the xor of these 2 conditions

SequenceConditionMode

[source]

For certain single-column conditions: how should we apply these to sequences? And: Condition applies to sequence only if it applies to ALL time steps Or: Condition applies to sequence if it applies to ANY time steps NoSequencMode: Condition cannot be applied to sequences at all (error condition)

BooleanColumnCondition

[source]

Created by agibsonccc on 11/26/16.

columnCondition

Returns whether the given element meets the condition set by this operation

  • param writable the element to test

  • return true if the condition is met false otherwise

condition

Condition on arbitrary input

  • param input the input to return the condition for

  • return true if the condition is met false otherwise

CategoricalColumnCondition

[source]

columnCondition

Constructor for conditions equal or not equal. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

  • param columnName Column to check for the condition

  • param op Operation (== or != only)

  • param value Value to use in the condition

condition

Condition on arbitrary input

  • param input the input to return the condition for

  • return true if the condition is met false otherwise

DoubleColumnCondition

[source]

columnCondition

Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

  • param columnName Column to check for the condition

  • param op Operation (<, >=, !=, etc)

  • param value Value to use in the condition

condition

Condition on arbitrary input

  • param input the input to return the condition for

  • return true if the condition is met false otherwise

InfiniteColumnCondition

[source]

A column condition that simply checks whether a floating point value is infinite

columnCondition

  • param columnName Column check for the condition

IntegerColumnCondition

[source]

columnCondition

Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

  • param columnName Column to check for the condition

  • param op Operation (<, >=, !=, etc)

  • param value Value to use in the condition

condition

Condition on arbitrary input

  • param input the input to return the condition for

  • return true if the condition is met false otherwise

InvalidValueColumnCondition

[source]

A Condition that applies to a single column. Whenever the specified value is invalid according to the schema, the condition applies.

For example, if a Writable contains String values in an Integer column (and these cannot be parsed to an integer), then the condition would return true, as these values are invalid according to the schema.

condition

Condition on arbitrary input

  • param input the input to return the condition for

  • return true if the condition is met false otherwise

LongColumnCondition

[source]

columnCondition

Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

  • param columnName Column to check for the condition

  • param op Operation (<, >=, !=, etc)

  • param value Value to use in the condition

condition

Condition on arbitrary input

  • param input the input to return the condition for

  • return true if the condition is met false otherwise

NaNColumnCondition

[source]

A column condition that simply checks whether a floating point value is NaN

columnCondition

  • param columnName Name of the column to check the condition for

NullWritableColumnCondition

[source]

Condition that applies to the values in any column. Specifically, condition is true if the Writable value is a NullWritable, and false for any other value

condition

Condition on arbitrary input

  • param input the input to return the condition for

  • return true if the condition is met false otherwise

StringColumnCondition

[source]

columnCondition

Constructor for conditions equal or not equal Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

  • param columnName Column to check for the condition

  • param op Operation (== or != only)

  • param value Value to use in the condition

condition

Condition on arbitrary input

  • param input the input to return the condition for

  • return true if the condition is met false otherwise

TimeColumnCondition

[source]

Condition that applies to the values

columnCondition

Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}

  • param columnName Column to check for the condition

  • param op Operation (<, >=, !=, etc)

  • param value Time value (in epoch millisecond format) to use in the condition

condition

Condition on arbitrary input

  • param input the input to return the condition for

  • return true if the condition is met false otherwise

TrivialColumnCondition

[source]

Created by huitseeker on 5/17/17.

SequenceLengthCondition

[source]

A condition on sequence lengths

StringRegexColumnCondition

[source]

Condition that applies to the values in a String column, using a provided regex. Condition return true if the String matches the regex, or false otherwise Note: Uses Writable.toString(), hence can potentially be applied to non-String columns

condition

Condition on arbitrary input

  • param input the input to return the condition for

  • return true if the condition is met false otherwise

[source]
import org.datavec.api.writable.Writable;
import org.datavec.api.records.reader.impl.csv.CSVRecordReader;
import org.datavec.spark.transform.misc.StringToWritablesFunction;

SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf)

String customerInfoPath = new ClassPathResource("CustomerInfo.csv").getFile().getPath();
JavaRDD<List<Writable>> customerInfo = sc.textFile(customerInfoPath).map(new StringToWritablesFunction(rr));
public Schema transform(Schema inputSchema)
public String outputColumnName()
public String columnName()
TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
    .filter(new ConditionFilter(new CategoricalColumnCondition("MerchantCountryCode", ConditionOp.NotInSet, new HashSet<>(Arrays.asList("USA","CAN")))))
    .build();
public boolean removeExample(Object writables)
public boolean removeSequence(Object sequence)
public Schema transform(Schema inputSchema)
public String outputColumnName()
public String columnName()
public Schema transform(Schema inputSchema)
public boolean removeExample(Object writables)
public boolean removeSequence(Object sequence)
public String outputColumnName()
public String columnName()
public boolean removeExample(Object writables)
public boolean removeSequence(Object sequence)
public boolean removeExample(List<Writable> writables)
public boolean removeSequence(List<List<Writable>> sequence)
public Schema transform(Schema inputSchema)
public String outputColumnName()
public String columnName()
public String outputColumnName()
public String columnName()
public boolean condition(Object input)
public boolean conditionSequence(Object sequence)
public Schema transform(Schema inputSchema)
public static Condition AND(Condition... conditions)
public static Condition OR(Condition... conditions)
public static Condition NOT(Condition condition)
public static Condition XOR(Condition first, Condition second)
public boolean columnCondition(Writable writable)
public boolean condition(Object input)
public boolean columnCondition(Writable writable)
public boolean condition(Object input)
public boolean columnCondition(Writable writable)
public boolean condition(Object input)
public boolean columnCondition(Writable writable)
public boolean columnCondition(Writable writable)
public boolean condition(Object input)
public boolean condition(Object input)
public boolean columnCondition(Writable writable)
public boolean condition(Object input)
public boolean columnCondition(Writable writable)
public boolean condition(Object input)
public boolean columnCondition(Writable writable)
public boolean condition(Object input)
public boolean columnCondition(Writable writable)
public boolean condition(Object input)
public boolean condition(Object input)

Executors

Execute ETL and vectorization in a local instance.

Local or remote execution?

Because datasets are commonly large by nature, you can decide on an execution mechanism that best suits your needs. For example, if you are vectorizing a large training dataset, you can process it in a distributed Spark cluster. However, if you need to do real-time inference, DataVec also provides a local executor that doesn't require any additional setup.

Executing a transform process

Once you've created your TransformProcess using your Schema, and you've either loaded your dataset into a Apache Spark JavaRDD or have a RecordReader that load your dataset, you can execute a transform.

Locally this looks like:

When using Spark this looks like:

Available executors

LocalTransformExecutor

Local transform executor

isTryCatch

Execute the specified TransformProcess with the given input data Note: this method can only be used if the TransformProcess returns non-sequence data. For TransformProcesses that return a sequence, use {- link #executeToSequence(List, TransformProcess)}

  • param inputWritables Input data to process

  • param transformProcess TransformProcess to execute

  • return Processed data

SparkTransformExecutor

Execute a datavec transform process on spark rdds.

isTryCatch

  • deprecated Use static methods instead of instance methods on SparkTransformExecutor

[source]
[source]
import org.datavec.local.transforms.LocalTransformExecutor;

List<List<Writable>> transformed = LocalTransformExecutor.execute(recordReader, transformProcess)

List<List<List<Writable>>> transformedSeq = LocalTransformExecutor.executeToSequence(sequenceReader, transformProcess)

List<List<Writable>> joined = LocalTransformExecutor.executeJoin(join, leftReader, rightReader)
import org.datavec.spark.transforms.SparkTransformExecutor;

JavaRDD<List<Writable>> transformed = SparkTransformExecutor.execute(inputRdd, transformProcess)

JavaRDD<List<List<Writable>>> transformedSeq = SparkTransformExecutor.executeToSequence(inputSequenceRdd, transformProcess)

JavaRDD<List<Writable>> joined = SparkTransformExecutor.executeJoin(join, leftRdd, rightRdd)
public static boolean isTryCatch()
public static boolean isTryCatch()

Normalization

Why normalize?

Neural networks work best when the data they’re fed is normalized, constrained to a range between -1 and 1. There are several reasons for that. One is that nets are trained using gradient descent, and their activation functions usually having an active range somewhere between -1 and 1. Even when using an activation function that doesn’t saturate quickly, it is still good practice to constrain your values to this range to improve performance.

Available preprocessors

NormalizerMinMaxScaler

Pre processor for DataSets that normalizes feature values (and optionally label values) to lie between a minimum and maximum value (by default between 0 and 1)

NormalizerMinMaxScaler

Preprocessor can take a range as minRange and maxRange

  • param minRange

  • param maxRange

load

Load the given min and max

  • param statistics the statistics to load

  • throws IOException

save

Save the current min and max

  • param files the statistics to save

  • throws IOException

  • deprecated use {- link NormalizerSerializer instead}

Normalizer

Base interface for all normalizers

ImageFlatteningDataSetPreProcessor

A DataSetPreProcessor used to flatten a 4d CNN features array to a flattened 2d format (for use in networks such as a DenseLayer/multi-layer perceptron)

MinMaxStrategy

statistics of the upper and lower bounds of the population

MinMaxStrategy

  • param minRange the target range lower bound

  • param maxRange the target range upper bound

preProcess

Normalize a data array

  • param array the data to normalize

  • param stats statistics of the data population

revert

Denormalize a data array

  • param array the data to denormalize

  • param stats statistics of the data population

ImagePreProcessingScaler

Created by susaneraly on 6/23/16. A preprocessor specifically for images that applies min max scaling Can take a range, so pixel values can be scaled from 0->255 to minRange->maxRange default minRange = 0 and maxRange = 1; If pixel values are not 8 bits, you can specify the number of bits as the third argument in the constructor For values that are already floating point, specify the number of bits as 1

ImagePreProcessingScaler

Preprocessor can take a range as minRange and maxRange

  • param a, default = 0

  • param b, default = 1

  • param maxBits in the image, default = 8

fit

Fit a dataset (only compute based on the statistics from this dataset0

  • param dataSet the dataset to compute on

fit

Iterates over a dataset accumulating statistics for normalization

  • param iterator the iterator to use for collecting statistics.

transform

Transform the data

  • param toPreProcess the dataset to transform

CompositeMultiDataSetPreProcessor

A simple Composite MultiDataSetPreProcessor - allows you to apply multiple MultiDataSetPreProcessors sequentially on the one MultiDataSet, in the order they are passed to the constructor

CompositeMultiDataSetPreProcessor

  • param preProcessors Preprocessors to apply. They will be applied in this order

MultiNormalizerMinMaxScaler

Pre processor for MultiDataSet that normalizes feature values (and optionally label values) to lie between a minimum and maximum value (by default between 0 and 1)

MultiNormalizerMinMaxScaler

Preprocessor can take a range as minRange and maxRange

  • param minRange the target range lower bound

  • param maxRange the target range upper bound

MultiDataNormalization

An interface for multi dataset normalizers. Data normalizers compute some sort of statistics over a MultiDataSet and scale the data in some way.

ImageMultiPreProcessingScaler

A preprocessor specifically for images that applies min max scaling to one or more of the feature arrays in a MultiDataSet. Can take a range, so pixel values can be scaled from 0->255 to minRange->maxRange default minRange = 0 and maxRange = 1; If pixel values are not 8 bits, you can specify the number of bits as the third argument in the constructor For values that are already floating point, specify the number of bits as 1

ImageMultiPreProcessingScaler

Preprocessor can take a range as minRange and maxRange

  • param a, default = 0

  • param b, default = 1

  • param maxBits in the image, default = 8

  • param featureIndices Indices of feature arrays to process. If only one feature array is present, this should always be 0

NormalizerStandardize

Created by susaneraly, Ede Meijer variance and mean Pre processor for DataSet that normalizes feature values (and optionally label values) to have 0 mean and a standard deviation of 1

load

Load the means and standard deviations from the file system

  • param files the files to load from. Needs 4 files if normalizing labels, otherwise 2.

save

  • param files the files to save to. Needs 4 files if normalizing labels, otherwise 2.

  • deprecated use {- link NormalizerSerializer} instead

Save the current means and standard deviations to the file system

StandardizeStrategy

of the means and standard deviations of the population

preProcess

Normalize a data array

  • param array the data to normalize

  • param stats statistics of the data population

revert

Denormalize a data array

  • param array the data to denormalize

  • param stats statistics of the data population

NormalizerStrategy

Interface for strategies that can normalize and denormalize data arrays based on statistics of the population

MultiNormalizerHybrid

Pre processor for MultiDataSet that can be configured to use different normalization strategies for different inputs and outputs, or none at all. Can be used for example when one input should be normalized, but a different one should be untouched because it’s the input for an embedding layer. Alternatively, one might want to mix standardization and min-max scaling for different inputs and outputs.

By default, no normalization is applied. There are methods to configure the desired normalization strategy for inputs and outputs either globally or on an individual input/output level. Specific input/output strategies will override global ones.

MultiNormalizerHybrid

Apply standardization to all inputs, except the ones individually configured

  • return the normalizer

minMaxScaleAllInputs

Apply min-max scaling to all inputs, except the ones individually configured

  • return the normalizer

minMaxScaleAllInputs

Apply min-max scaling to all inputs, except the ones individually configured

  • param rangeFrom lower bound of the target range

  • param rangeTo upper bound of the target range

  • return the normalizer

standardizeInput

Apply standardization to a specific input, overriding the global input strategy if any

  • param input the index of the input

  • return the normalizer

minMaxScaleInput

Apply min-max scaling to a specific input, overriding the global input strategy if any

  • param input the index of the input

  • return the normalizer

minMaxScaleInput

Apply min-max scaling to a specific input, overriding the global input strategy if any

  • param input the index of the input

  • param rangeFrom lower bound of the target range

  • param rangeTo upper bound of the target range

  • return the normalizer

standardizeAllOutputs

Apply standardization to all outputs, except the ones individually configured

  • return the normalizer

minMaxScaleAllOutputs

Apply min-max scaling to all outputs, except the ones individually configured

  • return the normalizer

minMaxScaleAllOutputs

Apply min-max scaling to all outputs, except the ones individually configured

  • param rangeFrom lower bound of the target range

  • param rangeTo upper bound of the target range

  • return the normalizer

standardizeOutput

Apply standardization to a specific output, overriding the global output strategy if any

  • param output the index of the input

  • return the normalizer

minMaxScaleOutput

Apply min-max scaling to a specific output, overriding the global output strategy if any

  • param output the index of the input

  • return the normalizer

minMaxScaleOutput

Apply min-max scaling to a specific output, overriding the global output strategy if any

  • param output the index of the input

  • param rangeFrom lower bound of the target range

  • param rangeTo upper bound of the target range

  • return the normalizer

getInputStats

Get normalization statistics for a given input.

  • param input the index of the input

  • return implementation of NormalizerStats corresponding to the normalization strategy selected

getOutputStats

Get normalization statistics for a given output.

  • param output the index of the output

  • return implementation of NormalizerStats corresponding to the normalization strategy selected

fit

Get the map of normalization statistics per input

  • return map of input indices pointing to NormalizerStats instances

fit

Iterates over a dataset accumulating statistics for normalization

  • param iterator the iterator to use for collecting statistics

transform

Transform the dataset

  • param data the dataset to pre process

revert

Undo (revert) the normalization applied by this DataNormalization instance (arrays are modified in-place)

  • param data MultiDataSet to revert the normalization on

revertFeatures

Undo (revert) the normalization applied by this DataNormalization instance to the entire inputs array

  • param features The normalized array of inputs

revertFeatures

Undo (revert) the normalization applied by this DataNormalization instance to the entire inputs array

  • param features The normalized array of inputs

  • param maskArrays Optional mask arrays belonging to the inputs

revertFeatures

Undo (revert) the normalization applied by this DataNormalization instance to the features of a particular input

  • param features The normalized array of inputs

  • param maskArrays Optional mask arrays belonging to the inputs

  • param input the index of the input to revert normalization on

revertLabels

Undo (revert) the normalization applied by this DataNormalization instance to the entire outputs array

  • param labels The normalized array of outputs

revertLabels

Undo (revert) the normalization applied by this DataNormalization instance to the entire outputs array

  • param labels The normalized array of outputs

  • param maskArrays Optional mask arrays belonging to the outputs

revertLabels

Undo (revert) the normalization applied by this DataNormalization instance to the labels of a particular output

  • param labels The normalized array of outputs

  • param maskArrays Optional mask arrays belonging to the outputs

  • param output the index of the output to revert normalization on

CompositeDataSetPreProcessor

A simple Composite DataSetPreProcessor - allows you to apply multiple DataSetPreProcessors sequentially on the one DataSet, in the order they are passed to the constructor

CompositeDataSetPreProcessor

  • param preProcessors Preprocessors to apply. They will be applied in this order

MultiNormalizerStandardize

Pre processor for MultiDataSet that normalizes feature values (and optionally label values) to have 0 mean and a standard deviation of 1

load

Load means and standard deviations from the file system

  • param featureFiles source files for features, requires 2 files per input, alternating mean and stddev files

  • param labelFiles source files for labels, requires 2 files per output, alternating mean and stddev files

save

  • param featureFiles target files for features, requires 2 files per input, alternating mean and stddev files

  • param labelFiles target files for labels, requires 2 files per output, alternating mean and stddev files

  • deprecated use {- link MultiStandardizeSerializerStrategy} instead

Save the current means and standard deviations to the file system

VGG16ImagePreProcessor

This is a preprocessor specifically for VGG16. It subtracts the mean RGB value, computed on the training set, from each pixel as reported in:

fit

Fit a dataset (only compute based on the statistics from this dataset0

  • param dataSet the dataset to compute on

fit

Iterates over a dataset accumulating statistics for normalization

  • param iterator the iterator to use for collecting statistics.

transform

Transform the data

  • param toPreProcess the dataset to transform

DataNormalization

An interface for data normalizers. Data normalizers compute some sort of statistics over a dataset and scale the data in some way.

[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
[source]
https://arxiv.org/pdf/1409.1556.pdf
[source]
public NormalizerMinMaxScaler(double minRange, double maxRange)
public void load(File... statistics) throws IOException
public void save(File... files) throws IOException
public MinMaxStrategy(double minRange, double maxRange)
public void preProcess(INDArray array, INDArray maskArray, MinMaxStats stats)
public void revert(INDArray array, INDArray maskArray, MinMaxStats stats)
public ImagePreProcessingScaler(double a, double b, int maxBits)
public void fit(DataSet dataSet)
public void fit(DataSetIterator iterator)
public void transform(DataSet toPreProcess)
public CompositeMultiDataSetPreProcessor(MultiDataSetPreProcessor... preProcessors)
public MultiNormalizerMinMaxScaler(double minRange, double maxRange)
public ImageMultiPreProcessingScaler(double a, double b, int maxBits, int[] featureIndices)
public void load(File... files) throws IOException
public void save(File... files) throws IOException
public void preProcess(INDArray array, INDArray maskArray, DistributionStats stats)
public void revert(INDArray array, INDArray maskArray, DistributionStats stats)
public MultiNormalizerHybrid standardizeAllInputs()
public MultiNormalizerHybrid minMaxScaleAllInputs()
public MultiNormalizerHybrid minMaxScaleAllInputs(double rangeFrom, double rangeTo)
public MultiNormalizerHybrid standardizeInput(int input)
public MultiNormalizerHybrid minMaxScaleInput(int input)
public MultiNormalizerHybrid minMaxScaleInput(int input, double rangeFrom, double rangeTo)
public MultiNormalizerHybrid standardizeAllOutputs()
public MultiNormalizerHybrid minMaxScaleAllOutputs()
public MultiNormalizerHybrid minMaxScaleAllOutputs(double rangeFrom, double rangeTo)
public MultiNormalizerHybrid standardizeOutput(int output)
public MultiNormalizerHybrid minMaxScaleOutput(int output)
public MultiNormalizerHybrid minMaxScaleOutput(int output, double rangeFrom, double rangeTo)
public NormalizerStats getInputStats(int input)
public NormalizerStats getOutputStats(int output)
public void fit(@NonNull MultiDataSet dataSet)
public void fit(@NonNull MultiDataSetIterator iterator)
public void transform(@NonNull MultiDataSet data)
public void revert(@NonNull MultiDataSet data)
public void revertFeatures(@NonNull INDArray[] features)
public void revertFeatures(@NonNull INDArray[] features, INDArray[] maskArrays)
public void revertFeatures(@NonNull INDArray[] features, INDArray[] maskArrays, int input)
public void revertLabels(@NonNull INDArray[] labels)
public void revertLabels(@NonNull INDArray[] labels, INDArray[] maskArrays)
public void revertLabels(@NonNull INDArray[] labels, INDArray[] maskArrays, int output)
public CompositeDataSetPreProcessor(DataSetPreProcessor... preProcessors)
public void load(@NonNull List<File> featureFiles, @NonNull List<File> labelFiles) throws IOException
public void save(@NonNull List<File> featureFiles, @NonNull List<File> labelFiles) throws IOException
public void fit(DataSet dataSet)
public void fit(DataSetIterator iterator)
public void transform(DataSet toPreProcess)

Schemas

Schemas for datasets and transformation.

Why use schemas?

The unfortunate reality is that data is dirty. When trying to vecotrize a dataset for deep learning, it is quite rare to find files that have zero errors. Schema is important for maintaining the meaning of the data before using it for something like training a neural network.

Using schemas

Schemas are primarily used for programming transformations. Before you can properly execute a TransformProcess you will need to pass the schema of the data being transformed.

An example of a schema for merchant records may look like:

Joining schemas

If you have two different datasets that you want to merge together, DataVec provides a Join class with different join strategies such as Inner or RightOuter.

Once you've defined your join and you've loaded the data into DataVec, you must use an Executor to complete the join.

Classes and utilities

DataVec comes with a few Schema classes and helper utilities for 2D and sequence types of data.

Join

Join class: used to specify a join (like an SQL join)

setSchemas

Type of join Inner: Return examples where the join column values occur in both LeftOuter: Return all examples from left data, whether there is a matching right value or not. (If not: right values will have NullWritable instead) RightOuter: Return all examples from the right data, whether there is a matching left value or not. (If not: left values will have NullWritable instead) FullOuter: return all examples from both left/right, whether there is a matching value from the other side or not. (If not: other values will have NullWritable instead)

setKeyColumns

  • deprecated Use {- link #setJoinColumns(String…)}

setKeyColumnsLeft

  • deprecated Use {- link #setJoinColumnsLeft(String…)}

setKeyColumnsRight

  • deprecated Use {- link #setJoinColumnsRight(String…)}

setJoinColumnsLeft

Specify the names of the columns to join on, for the left data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i

  • param joinColumnNames Names of the columns to join on (for left data)

setJoinColumnsRight

Specify the names of the columns to join on, for the right data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i

  • param joinColumnNames Names of the columns to join on (for left data)

InferredSchema

If passed a CSV file that contains a header and a single row of sample data, it will return a Schema.

Only Double, Integer, Long, and String types are supported. If no number type can be inferred, the field type will become the default type. Note that if your column is actually categorical but is represented as a number, you will need to do additional transformation. Also, if your sample field is blank/null, it will also become the default type.

Schema

A Schema defines the layout of tabular data. Specifically, it contains names f or each column, as well as details of types (Integer, String, Long, Double, etc). Type information for each column may optionally include restrictions on the allowable values for each column.

sameTypes

Create a schema based on the given metadata

  • param columnMetaData the metadata to create the schema from

newSchema

Compute the difference in {- link ColumnMetaData} between this schema and the passed in schema. This is useful during the {- link org.datavec.api.transform.TransformProcess} to identify what a process will do to a given {- link Schema}.

  • param schema the schema to compute the difference for

  • return the metadata that is different (in order) between this schema and the other schema

numColumns

Returns the number of columns or fields for this schema

  • return the number of columns or fields for this schema

getName

Returns the name of a given column at the specified index

  • param column the index of the column to get the name for

  • return the name of the column at the specified index

getType

Returns the {- link ColumnType} for the column at the specified index

  • param column the index of the column to get the type for

  • return the type of the column to at the specified inde

getType

Returns the {- link ColumnType} for the column at the specified index

  • param columnName the index of the column to get the type for

  • return the type of the column to at the specified inde

getMetaData

Returns the {- link ColumnMetaData} at the specified column index

  • param column the index to get the metadata for

  • return the metadata at ths specified index

getMetaData

Retrieve the metadata for the given column name

  • param column the name of the column to get metadata for

  • return the metadata for the given column name

getIndexOfColumn

Return a copy of the list column names

  • return a copy of the list of column names for this schema

hasColumn

Return the indices of the columns, given their namess

  • param columnNames Name of the columns to get indices for

  • return Column indexes

toJson

Serialize this schema to json

  • return a json representation of this schema

toYaml

Serialize this schema to yaml

  • return the yaml representation of this schema

fromJson

Create a schema from a given json string

  • param json the json to create the schema from

  • return the created schema based on the json

fromYaml

Create a schema from the given yaml string

  • param yaml the yaml to create the schema from

  • return the created schema based on the yaml

addColumnFloat

Add a Float column with no restrictions on the allowable values, except for no NaN/infinite values allowed

  • param name Name of the column

addColumnFloat

Add a Float column with the specified restrictions (and no NaN/Infinite values allowed)

  • param name Name of the column

  • param minAllowedValue Minimum allowed value (inclusive). If null: no restriction

  • param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

  • return

addColumnFloat

Add a Float column with the specified restrictions

  • param name Name of the column

  • param minAllowedValue Minimum allowed value (inclusive). If null: no restriction

  • param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

  • param allowNaN If false: don’t allow NaN values. If true: allow.

addColumnsFloat

Add multiple Float columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)

  • param columnNames Names of the columns to add

addColumnsFloat

A convenience method for adding multiple Float columns. For example, to add columns “myFloatCol_0”, “myFloatCol_1”, “myFloatCol_2”, use {- code addColumnsFloat(“myFloatCol_%d”,0,2)}

  • param pattern Pattern to use (via String.format). “%d” is replaced with column numbers

  • param minIdxInclusive Minimum column index to use (inclusive)

  • param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsFloat

A convenience method for adding multiple Float columns, with additional restrictions that apply to all columns For example, to add columns “myFloatCol_0”, “myFloatCol_1”, “myFloatCol_2”, use {- code addColumnsFloat(“myFloatCol_%d”,0,2,null,null,false,false)}

  • param pattern Pattern to use (via String.format). “%d” is replaced with column numbers

  • param minIdxInclusive Minimum column index to use (inclusive)

  • param maxIdxInclusive Maximum column index to use (inclusive)

  • param minAllowedValue Minimum allowed value (inclusive). If null: no restriction

addColumnDouble

Add a Double column with no restrictions on the allowable values, except for no NaN/infinite values allowed

  • param name Name of the column

addColumnDouble

Add a Double column with the specified restrictions (and no NaN/Infinite values allowed)

  • param name Name of the column

  • param minAllowedValue Minimum allowed value (inclusive). If null: no restriction

  • param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

  • return

addColumnDouble

Add a Double column with the specified restrictions

  • param name Name of the column

  • param minAllowedValue Minimum allowed value (inclusive). If null: no restriction

  • param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

  • param allowNaN If false: don’t allow NaN values. If true: allow.

addColumnsDouble

Add multiple Double columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)

  • param columnNames Names of the columns to add

addColumnsDouble

A convenience method for adding multiple Double columns. For example, to add columns “myDoubleCol_0”, “myDoubleCol_1”, “myDoubleCol_2”, use {- code addColumnsDouble(“myDoubleCol_%d”,0,2)}

  • param pattern Pattern to use (via String.format). “%d” is replaced with column numbers

  • param minIdxInclusive Minimum column index to use (inclusive)

  • param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsDouble

A convenience method for adding multiple Double columns, with additional restrictions that apply to all columns For example, to add columns “myDoubleCol_0”, “myDoubleCol_1”, “myDoubleCol_2”, use {- code addColumnsDouble(“myDoubleCol_%d”,0,2,null,null,false,false)}

  • param pattern Pattern to use (via String.format). “%d” is replaced with column numbers

  • param minIdxInclusive Minimum column index to use (inclusive)

  • param maxIdxInclusive Maximum column index to use (inclusive)

  • param minAllowedValue Minimum allowed value (inclusive). If null: no restriction

addColumnInteger

Add an Integer column with no restrictions on the allowable values

  • param name Name of the column

addColumnInteger

Add an Integer column with the specified min/max allowable values

  • param name Name of the column

  • param minAllowedValue Minimum allowed value (inclusive). If null: no restriction

  • param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

addColumnsInteger

Add multiple Integer columns with no restrictions on the min/max allowable values

  • param names Names of the integer columns to add

addColumnsInteger

A convenience method for adding multiple Integer columns. For example, to add columns “myIntegerCol_0”, “myIntegerCol_1”, “myIntegerCol_2”, use {- code addColumnsInteger(“myIntegerCol_%d”,0,2)}

  • param pattern Pattern to use (via String.format). “%d” is replaced with column numbers

  • param minIdxInclusive Minimum column index to use (inclusive)

  • param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsInteger

A convenience method for adding multiple Integer columns. For example, to add columns “myIntegerCol_0”, “myIntegerCol_1”, “myIntegerCol_2”, use {- code addColumnsInteger(“myIntegerCol_%d”,0,2)}

  • param pattern Pattern to use (via String.format). “%d” is replaced with column numbers

  • param minIdxInclusive Minimum column index to use (inclusive)

  • param maxIdxInclusive Maximum column index to use (inclusive)

  • param minAllowedValue Minimum allowed value (inclusive). If null: no restriction

addColumnCategorical

Add a Categorical column, with the specified state names

  • param name Name of the column

  • param stateNames Names of the allowable states for this categorical column

addColumnCategorical

Add a Categorical column, with the specified state names

  • param name Name of the column

  • param stateNames Names of the allowable states for this categorical column

addColumnLong

Add a Long column, with no restrictions on the min/max values

  • param name Name of the column

addColumnLong

Add a Long column with the specified min/max allowable values

  • param name Name of the column

  • param minAllowedValue Minimum allowed value (inclusive). If null: no restriction

  • param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

addColumnsLong

Add multiple Long columns, with no restrictions on the allowable values

  • param names Names of the Long columns to add

addColumnsLong

A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}

  • param pattern Pattern to use (via String.format). “%d” is replaced with column numbers

  • param minIdxInclusive Minimum column index to use (inclusive)

  • param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsLong

A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}

  • param pattern Pattern to use (via String.format). “%d” is replaced with column numbers

  • param minIdxInclusive Minimum column index to use (inclusive)

  • param maxIdxInclusive Maximum column index to use (inclusive)

  • param minAllowedValue Minimum allowed value (inclusive). If null: no restriction

addColumn

Add a column

  • param metaData metadata for this column

addColumnString

Add a String column with no restrictions on the allowable values.

  • param name Name of the column

addColumnsString

Add multiple String columns with no restrictions on the allowable values

  • param columnNames Names of the String columns to add

addColumnString

Add a String column with the specified restrictions

  • param name Name of the column

  • param regex Regex that the String must match in order to be considered valid. If null: no regex restriction

  • param minAllowableLength Minimum allowable length for the String to be considered valid

  • param maxAllowableLength Maximum allowable length for the String to be considered valid

addColumnsString

A convenience method for adding multiple numbered String columns. For example, to add columns “myStringCol_0”, “myStringCol_1”, “myStringCol_2”, use {- code addColumnsString(“myStringCol_%d”,0,2)}

  • param pattern Pattern to use (via String.format). “%d” is replaced with column numbers

  • param minIdxInclusive Minimum column index to use (inclusive)

  • param maxIdxInclusive Maximum column index to use (inclusive)

addColumnsString

A convenience method for adding multiple numbered String columns. For example, to add columns “myStringCol_0”, “myStringCol_1”, “myStringCol_2”, use {- code addColumnsString(“myStringCol_%d”,0,2)}

  • param pattern Pattern to use (via String.format). “%d” is replaced with column numbers

  • param minIdxInclusive Minimum column index to use (inclusive)

  • param maxIdxInclusive Maximum column index to use (inclusive)

  • param regex Regex that the String must match in order to be considered valid. If null: no regex restriction

addColumnTime

Add a Time column with no restrictions on the min/max allowable times NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform

  • param columnName Name of the column

  • param timeZone Time zone of the time column

addColumnTime

Add a Time column with no restrictions on the min/max allowable times NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform

  • param columnName Name of the column

  • param timeZone Time zone of the time column

addColumnTime

Add a Time column with the specified restrictions NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform

  • param columnName Name of the column

  • param timeZone Time zone of the time column

  • param minValidValue Minumum allowable time (in milliseconds). May be null.

  • param maxValidValue Maximum allowable time (in milliseconds). May be null.

addColumnNDArray

Add a NDArray column

  • param columnName Name of the column

  • param shape shape of the NDArray column. Use -1 in entries to specify as “variable length” in that dimension

build

Create the Schema

inferMultiple

Infers a schema based on the record. The column names are based on indexing.

  • param record the record to infer from

  • return the infered schema

infer

Infers a schema based on the record. The column names are based on indexing.

  • param record the record to infer from

  • return the infered schema

SequenceSchema

inferSequenceMulti

Infers a sequence schema based on the record

  • param record the record to infer the schema based on

  • return the inferred sequence schema

inferSequence

Infers a sequence schema based on the record

  • param record the record to infer the schema based on

  • return the inferred sequence schema

param allowInfinite If false: don’t allow infinite values. If true: allow

param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

  • param allowNaN If false: don’t allow NaN values. If true: allow.

  • param allowInfinite If false: don’t allow infinite values. If true: allow

  • param allowInfinite If false: don’t allow infinite values. If true: allow

    param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

  • param allowNaN If false: don’t allow NaN values. If true: allow.

  • param allowInfinite If false: don’t allow infinite values. If true: allow

  • param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

    param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction

  • param minAllowedLength Minimum allowed length of strings (inclusive). If null: no restriction

  • param maxAllowedLength Maximum allowed length of strings (inclusive). If null: no restriction

  • [source]
    [source]
    [source]
    [source]
    Schema inputDataSchema = new Schema.Builder()
        .addColumnsString("DateTimeString", "CustomerID", "MerchantID")
        .addColumnInteger("NumItemsInTransaction")
        .addColumnCategorical("MerchantCountryCode", Arrays.asList("USA","CAN","FR","MX"))
        .addColumnDouble("TransactionAmountUSD",0.0,null,false,false)   //$0.0 or more, no maximum limit, no NaN and no Infinite values
        .addColumnCategorical("FraudLabel", Arrays.asList("Fraud","Legit"))
        .build();
    Schema customerInfoSchema = new Schema.Builder()
        .addColumnLong("customerID")
        .addColumnString("customerName")
        .addColumnCategorical("customerCountry", Arrays.asList("USA","France","Japan","UK"))
        .build();
    
    Schema customerPurchasesSchema = new Schema.Builder()
        .addColumnLong("customerID")
        .addColumnTime("purchaseTimestamp", DateTimeZone.UTC)
        .addColumnLong("productID")
        .addColumnInteger("purchaseQty")
        .addColumnDouble("unitPriceUSD")
        .build();
    
    Join join = new Join.Builder(Join.JoinType.Inner)
        .setJoinColumns("customerID")
        .setSchemas(customerInfoSchema, customerPurchasesSchema)
        .build();
    public Builder setSchemas(Schema left, Schema right)
    public Builder setKeyColumns(String... keyColumnNames)
    public Builder setKeyColumnsLeft(String... keyColumnNames)
    public Builder setKeyColumnsRight(String... keyColumnNames)
    public Builder setJoinColumnsLeft(String... joinColumnNames)
    public Builder setJoinColumnsRight(String... joinColumnNames)
    public boolean sameTypes(Schema schema)
    public Schema newSchema(List<ColumnMetaData> columnMetaData)
    public int numColumns()
    public String getName(int column)
    public ColumnType getType(int column)
    public ColumnType getType(String columnName)
    public ColumnMetaData getMetaData(int column)
    public ColumnMetaData getMetaData(String column)
    public int getIndexOfColumn(String columnName)
    public boolean hasColumn(String columnName)
    public String toJson()
    public String toYaml()
    public static Schema fromJson(String json)
    public static Schema fromYaml(String yaml)
    public Builder addColumnFloat(String name)
    public Builder addColumnFloat(String name, Float minAllowedValue, Float maxAllowedValue)
    public Builder addColumnFloat(String name, Float minAllowedValue, Float maxAllowedValue, boolean allowNaN,
                                           boolean allowInfinite)
    public Builder addColumnsFloat(String... columnNames)
    public Builder addColumnsFloat(String pattern, int minIdxInclusive, int maxIdxInclusive)
    public Builder addColumnsFloat(String pattern, int minIdxInclusive, int maxIdxInclusive,
                                            Float minAllowedValue, Float maxAllowedValue, boolean allowNaN, boolean allowInfinite)
    public Builder addColumnDouble(String name)
    public Builder addColumnDouble(String name, Double minAllowedValue, Double maxAllowedValue)
    public Builder addColumnDouble(String name, Double minAllowedValue, Double maxAllowedValue, boolean allowNaN,
                            boolean allowInfinite)
    public Builder addColumnsDouble(String... columnNames)
    public Builder addColumnsDouble(String pattern, int minIdxInclusive, int maxIdxInclusive)
    public Builder addColumnsDouble(String pattern, int minIdxInclusive, int maxIdxInclusive,
                            Double minAllowedValue, Double maxAllowedValue, boolean allowNaN, boolean allowInfinite)
    public Builder addColumnInteger(String name)
    public Builder addColumnInteger(String name, Integer minAllowedValue, Integer maxAllowedValue)
    public Builder addColumnsInteger(String... names)
    public Builder addColumnsInteger(String pattern, int minIdxInclusive, int maxIdxInclusive)
    public Builder addColumnsInteger(String pattern, int minIdxInclusive, int maxIdxInclusive,
                            Integer minAllowedValue, Integer maxAllowedValue)
    public Builder addColumnCategorical(String name, String... stateNames)
    public Builder addColumnCategorical(String name, List<String> stateNames)
    public Builder addColumnLong(String name)
    public Builder addColumnLong(String name, Long minAllowedValue, Long maxAllowedValue)
    public Builder addColumnsLong(String... names)
    public Builder addColumnsLong(String pattern, int minIdxInclusive, int maxIdxInclusive)
    public Builder addColumnsLong(String pattern, int minIdxInclusive, int maxIdxInclusive, Long minAllowedValue,
                            Long maxAllowedValue)
    public Builder addColumn(ColumnMetaData metaData)
    public Builder addColumnString(String name)
    public Builder addColumnsString(String... columnNames)
    public Builder addColumnString(String name, String regex, Integer minAllowableLength,
                            Integer maxAllowableLength)
    public Builder addColumnsString(String pattern, int minIdxInclusive, int maxIdxInclusive)
    public Builder addColumnsString(String pattern, int minIdxInclusive, int maxIdxInclusive, String regex,
                            Integer minAllowedLength, Integer maxAllowedLength)
    public Builder addColumnTime(String columnName, TimeZone timeZone)
    public Builder addColumnTime(String columnName, DateTimeZone timeZone)
    public Builder addColumnTime(String columnName, DateTimeZone timeZone, Long minValidValue, Long maxValidValue)
    public Builder addColumnNDArray(String columnName, long[] shape)
    public Schema build()
    public static Schema inferMultiple(List<List<Writable>> record)
    public static Schema infer(List<Writable> record)
    public static SequenceSchema inferSequenceMulti(List<List<List<Writable>>> record)
    public static SequenceSchema inferSequence(List<List<Writable>> record)

    Readers

    Read individual records from different formats.

    Why readers?

    Readers iterate records from a dataset in storage and load the data into DataVec. The usefulness of readers beyond individual entries in a dataset includes: what if you wanted to train a text generator on a corpus? Or programmatically compose two entries together to form a new record? Reader implementations are useful for complex file types or distributed storage mechanisms.

    Readers return Writable classes that describe each column in a Record. These classes are used to convert each record to a tensor/ND-Array format.

    Usage

    Each reader implementation extends BaseRecordReader and provides a simple API for selecting the next record in a dataset, acting similarly to iterators.

    Useful methods include:

    • next: Return a batch of Writable.

    • nextRecord: Return a single Record, optionally with RecordMetaData.

    Listeners

    You can hook a custom RecordListener to a record reader for debugging or visualization purposes. Pass your custom listener to the addListener base method immediately after initializing your class.

    Types of readers

    ComposableRecordReader

    RecordReader for each pipeline. Individual record is a concatenation of the two collections. Create a recordreader that takes recordreaders and iterates over them and concatenates them hasNext would be the & of all the recordreaders concatenation would be next & addAll on the collection return one record

    initialize

    ConcatenatingRecordReader

    Combine multiple readers into a single reader. Records are read sequentially - thus if the first reader has 100 records, and the second reader has 200 records, ConcatenatingRecordReader will have 300 records.

    FileRecordReader

    File reader/writer

    getCurrentLabel

    Return the current label. The index of the current file’s parent directory in the label list

    • return The index of the current file’s parent directory

    LineRecordReader

    Reads files line by line

    CollectionRecordReader

    Collection record reader. Mainly used for testing.

    CollectionSequenceRecordReader

    Collection record reader for sequences. Mainly used for testing.

    initialize

    • param records Collection of sequences. For example, List<List<List>> where the inner two lists are a sequence, and the outer list/collection is a list of sequences

    ListStringRecordReader

    Iterates through a list of strings return a record.

    initialize

    Called once at initialization.

    • param split the split that defines the range of records to read

    • throws IOException

    • throws InterruptedException

    initialize

    Called once at initialization.

    • param conf a configuration for initialization

    • param split the split that defines the range of records to read

    • throws IOException

    • throws InterruptedException

    hasNext

    Get the next record

    • return The list of next record

    reset

    List of label strings

    • return

    nextRecord

    Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream

    • param uri

    • param dataInputStream

    • throws IOException if error occurs during reading from the input stream

    close

    Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.

    As noted in {- link AutoCloseable#close()}, cases where the close may fail require careful attention. It is strongly advised to relinquish the underlying resources and to internally mark the {- code Closeable} as closed, prior to throwing the {- code IOException}.

    • throws IOException if an I/O error occurs

    setConf

    Set the configuration to be used by this object.

    • param conf

    getConf

    Return the configuration used by this object.

    CSVRecordReader

    Simple csv record reader.

    initialize

    Skip first n lines

    • param skipNumLines the number of lines to skip

    CSVRegexRecordReader

    A CSVRecordReader that can split each column into additional columns using regexs.

    CSVSequenceRecordReader

    CSV Sequence Record Reader This reader is intended to read sequences of data in CSV format, where each sequence is defined in its own file (and there are multiple files) Each line in the file represents one time step

    CSVVariableSlidingWindowRecordReader

    A sliding window of variable size across an entire CSV.

    In practice the sliding window size starts at 1, then linearly increase to maxLinesPer sequence, then linearly decrease back to 1.

    initialize

    No-arg constructor with the default number of lines per sequence (10)

    LibSvmRecordReader

    Record reader for libsvm format, which is closely related to SVMLight format. Similar to scikit-learn we use a single reader for both formats, so this class is a subclass of SVMLightRecordReader.

    Further details on the format can be found at

    MatlabRecordReader

    Matlab record reader

    SVMLightRecordReader

    Record reader for SVMLight format, which can generally be described as

    LABEL INDEX:VALUE INDEX:VALUE …

    SVMLight format is well-suited to sparse data (e.g., bag-of-words) because it omits all features with value zero.

    We support an “extended” version that allows for multiple targets (or labels) separated by a comma, as follows:

    LABEL1,LABEL2,… INDEX:VALUE INDEX:VALUE …

    This can be used to represent either multitask problems or multilabel problems with sparse binary labels (controlled via the “MULTILABEL” configuration option).

    Like scikit-learn, we support both zero-based and one-based indexing.

    Further details on the format can be found at

    initialize

    Must be called before attempting to read records.

    • param conf DataVec configuration

    • param split FileSplit

    • throws IOException

    • throws InterruptedException

    setConf

    Set configuration.

    • param conf DataVec configuration

    • throws IOException

    • throws InterruptedException

    hasNext

    Helper function to help detect lines that are commented out. May read ahead and cache a line.

    • return

    nextRecord

    Return next record as list of Writables.

    • return

    RegexLineRecordReader

    RegexLineRecordReader: Read a file, one line at a time, and split it into fields using a regex. To load an entire file using a

    Example: Data in format “2016-01-01 23:59:59.001 1 DEBUG First entry message!” using regex String “(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) (\d+) ([A-Z]+) (.)” would be split into 4 Text writables: [“2016-01-01 23:59:59.001”, “1”, “DEBUG”, “First entry message!”]

    RegexSequenceRecordReader

    RegexSequenceRecordReader: Read an entire file (as a sequence), one line at a time and split each line into fields using a regex.

    Example: Data in format “2016-01-01 23:59:59.001 1 DEBUG First entry message!” using regex String “(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) (\d+) ([A-Z]+) (.)” would be split into 4 Text writables: [“2016-01-01 23:59:59.001”, “1”, “DEBUG”, “First entry message!”]

    lines that don’t match the provided regex can result in an exception (FailOnInvalid), can be skipped silently (SkipInvalid), or skip invalid but log a warning (SkipInvalidWithWarning)

    TransformProcessRecordReader

    to have a transform process applied before being returned.

    initialize

    Called once at initialization.

    • param split the split that defines the range of records to read

    • throws IOException

    • throws InterruptedException

    initialize

    Called once at initialization.

    • param conf a configuration for initialization

    • param split the split that defines the range of records to read

    • throws IOException

    • throws InterruptedException

    hasNext

    Get the next record

    • return

    reset

    List of label strings

    • return

    nextRecord

    Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream

    • param uri

    • param dataInputStream

    • throws IOException if error occurs during reading from the input stream

    loadFromMetaData

    Load a single record from the given {- link RecordMetaData} instance Note: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient to load multiple records at once using {- link #loadFromMetaData(List)}

    • param recordMetaData Metadata for the record that we want to load from

    • return Single record for the given RecordMetaData instance

    • throws IOException If I/O error occurs during loading

    setListeners

    Load multiple records from the given a list of {- link RecordMetaData} instances

    • param recordMetaDatas Metadata for the records that we want to load from

    • return Multiple records for the given RecordMetaData instances

    • throws IOException If I/O error occurs during loading

    setListeners

    Set the record listeners for this record reader.

    • param listeners

    close

    Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.

    As noted in {- link AutoCloseable#close()}, cases where the close may fail require careful attention. It is strongly advised to relinquish the underlying resources and to internally mark the {- code Closeable} as closed, prior to throwing the {- code IOException}.

    • throws IOException if an I/O error occurs

    setConf

    Set the configuration to be used by this object.

    • param conf

    getConf

    Return the configuration used by this object.

    TransformProcessSequenceRecordReader

    to be transformed before being returned.

    setConf

    Set the configuration to be used by this object.

    • param conf

    getConf

    Return the configuration used by this object.

    batchesSupported

    Returns a sequence record.

    • return a sequence of records

    nextSequence

    Load a sequence record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream

    • param uri

    • param dataInputStream

    • throws IOException if error occurs during reading from the input stream

    loadSequenceFromMetaData

    Load a single sequence record from the given {- link RecordMetaData} instance Note: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient to load multiple records at once using {- link #loadSequenceFromMetaData(List)}

    • param recordMetaData Metadata for the sequence record that we want to load from

    • return Single sequence record for the given RecordMetaData instance

    • throws IOException If I/O error occurs during loading

    initialize

    Load multiple sequence records from the given a list of {- link RecordMetaData} instances

    • param recordMetaDatas Metadata for the records that we want to load from

    • return Multiple sequence record for the given RecordMetaData instances

    • throws IOException If I/O error occurs during loading

    initialize

    Called once at initialization.

    • param conf a configuration for initialization

    • param split the split that defines the range of records to read

    • throws IOException

    • throws InterruptedException

    hasNext

    Get the next record

    • return

    reset

    List of label strings

    • return

    nextRecord

    Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream

    • param uri

    • param dataInputStream

    • throws IOException if error occurs during reading from the input stream

    loadFromMetaData

    Load a single record from the given {- link RecordMetaData} instance Note: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient to load multiple records at once using {- link #loadFromMetaData(List)}

    • param recordMetaData Metadata for the record that we want to load from

    • return Single record for the given RecordMetaData instance

    • throws IOException If I/O error occurs during loading

    setListeners

    Load multiple records from the given a list of {- link RecordMetaData} instances

    • param recordMetaDatas Metadata for the records that we want to load from

    • return Multiple records for the given RecordMetaData instances

    • throws IOException If I/O error occurs during loading

    setListeners

    Set the record listeners for this record reader.

    • param listeners

    close

    Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.

    As noted in {- link AutoCloseable#close()}, cases where the close may fail require careful attention. It is strongly advised to relinquish the underlying resources and to internally mark the {- code Closeable} as closed, prior to throwing the {- code IOException}.

    • throws IOException if an I/O error occurs

    NativeAudioRecordReader

    Native audio file loader using FFmpeg.

    WavFileRecordReader

    Wav file loader

    ImageRecordReader

    Image record reader. Reads a local file system and parses images of a given height and width. All images are rescaled and converted to the given height, width, and number of channels.

    Also appends the label if specified (one of k encoding based on the directory structure where each subdir of the root is an indexed label)

    TfidfRecordReader

    TFIDF record reader (wraps a tfidf vectorizer for delivering labels and conforming to the record reader interface)

    reset: Reset the underlying iterator.
  • hasNext: Iterator method to determine if another record is available.

  • [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    http://svmlight.joachims.org/
    http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html
    http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html
    [source]
    [source]
    http://svmlight.joachims.org/
    http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html
    http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    public void initialize(InputSplit split) throws IOException, InterruptedException
    public int getCurrentLabel()
    public void initialize(InputSplit split) throws IOException, InterruptedException
    public void initialize(InputSplit split) throws IOException, InterruptedException
    public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
    public boolean hasNext()
    public void reset()
    public Record nextRecord()
    public void close() throws IOException
    public void setConf(Configuration conf)
    public Configuration getConf()
    public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
    public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
    public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
    public void setConf(Configuration conf)
    public boolean hasNext()
    public Record nextRecord()
    public void initialize(InputSplit split) throws IOException, InterruptedException
    public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
    public boolean hasNext()
    public void reset()
    public Record nextRecord()
    public Record loadFromMetaData(RecordMetaData recordMetaData) throws IOException
    public void setListeners(RecordListener... listeners)
    public void setListeners(Collection<RecordListener> listeners)
    public void close() throws IOException
    public void setConf(Configuration conf)
    public Configuration getConf()
    public void setConf(Configuration conf)
    public Configuration getConf()
    public boolean batchesSupported()
    public SequenceRecord nextSequence()
    public SequenceRecord loadSequenceFromMetaData(RecordMetaData recordMetaData) throws IOException
    public void initialize(InputSplit split) throws IOException, InterruptedException
    public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
    public boolean hasNext()
    public void reset()
    public Record nextRecord()
    public Record loadFromMetaData(RecordMetaData recordMetaData) throws IOException
    public void setListeners(RecordListener... listeners)
    public void setListeners(Collection<RecordListener> listeners)
    public void close() throws IOException

    Transforms

    Data wrangling and mapping from one schema to another.

    Data wrangling

    One of the key tools in DataVec is transformations. DataVec helps the user map a dataset from one schema to another, and provides a list of operations to convert types, format data, and convert a 2D dataset to sequence data.

    Building a transform process

    A transform process requires a Schema to successfully transform data. Both schema and transform process classes come with a helper Builder class which are useful for organizing code and avoiding complex constructors.

    When both are combined together they look like the sample code below. Note how inputDataSchema is passed into the Builder constructor. Your transform process will fail to compile without it.

    Executing a transformation

    Different "backends" for executors are available. Using the tp transform process above, here's how you can execute it locally using plain DataVec.

    Debugging

    Each operation in a transform process represents a "step" in schema changes. Sometimes, the resulting transformation is not the intended result. You can debug this by printing each step in the transform tp with the following:

    Available transformations and conversions

    TransformProcess

    A TransformProcess defines an ordered list of transformations to be executed on some data

    getFinalSchema

    Get the action list that this transform process will execute

    • return

    getSchemaAfterStep

    Return the schema after executing all steps up to and including the specified step. Steps are indexed from 0: so getSchemaAfterStep(0) is after one transform has been executed.

    • param step Index of the step

    • return Schema of the data, after that (and all prior) steps have been executed

    toJson

    Execute the full sequence of transformations for a single example. May return null if example is filtered NOTE: Some TransformProcess operations cannot be done on examples individually. Most notably, ConvertToSequence and ConvertFromSequence operations require the full data set to be processed at once

    • param input

    • return

    toYaml

    Convert the TransformProcess to a YAML string

    • return TransformProcess, as YAML

    fromJson

    Deserialize a JSON String (created by {- link #toJson()}) to a TransformProcess

    • return TransformProcess, from JSON

    fromYaml

    Deserialize a JSON String (created by {- link #toJson()}) to a TransformProcess

    • return TransformProcess, from JSON

    transform

    Infer the categories for the given record reader for a particular column Note that each “column index” is a column in the context of: List record = ...; record.get(columnIndex);

    Note that anything passed in as a column will be automatically converted to a string for categorical purposes.

    The expected input is strings or numbers (which have sensible toString() representations)

    Note that the returned categories will be sorted alphabetically

    • param recordReader the record reader to iterate through

    • param columnIndex te column index to get categories for

    • return

    filter

    Add a filter operation to be executed after the previously-added operations have been executed

    • param filter Filter operation to execute

    filter

    Add a filter operation, based on the specified condition.

    If condition is satisfied (returns true): remove the example or sequence If condition is not satisfied (returns false): keep the example or sequence

    • param condition Condition to filter on

    removeColumns

    Remove all of the specified columns, by name

    • param columnNames Names of the columns to remove

    removeColumns

    Remove all of the specified columns, by name

    • param columnNames Names of the columns to remove

    removeAllColumnsExceptFor

    Remove all columns, except for those that are specified here

    • param columnNames Names of the columns to keep

    removeAllColumnsExceptFor

    Remove all columns, except for those that are specified here

    • param columnNames Names of the columns to keep

    renameColumn

    Rename a single column

    • param oldName Original column name

    • param newName New column name

    renameColumns

    Rename multiple columns

    • param oldNames List of original column names

    • param newNames List of new column names

    reorderColumns

    Reorder the columns using a partial or complete new ordering. If only some of the column names are specified for the new order, the remaining columns will be placed at the end, according to their current relative ordering

    • param newOrder Names of the columns, in the order they will appear in the output

    duplicateColumn

    Duplicate a single column

    • param column Name of the column to duplicate

    • param newName Name of the new (duplicate) column

    duplicateColumns

    Duplicate a set of columns

    • param columnNames Names of the columns to duplicate

    • param newNames Names of the new (duplicated) columns

    integerMathOp

    Perform a mathematical operation (add, subtract, scalar max etc) on the specified integer column, with a scalar

    • param column The integer column to perform the operation on

    • param mathOp The mathematical operation

    • param scalar The scalar value to use in the mathematical operation

    integerColumnsMathOp

    Calculate and add a new integer column by performing a mathematical operation on a number of existing columns. New column is added to the end.

    • param newColumnName Name of the new/derived column

    • param mathOp Mathematical operation to execute on the columns

    • param columnNames Names of the columns to use in the mathematical operation

    longMathOp

    Perform a mathematical operation (add, subtract, scalar max etc) on the specified long column, with a scalar

    • param columnName The long column to perform the operation on

    • param mathOp The mathematical operation

    • param scalar The scalar value to use in the mathematical operation

    longColumnsMathOp

    Calculate and add a new long column by performing a mathematical operation on a number of existing columns. New column is added to the end.

    • param newColumnName Name of the new/derived column

    • param mathOp Mathematical operation to execute on the columns

    • param columnNames Names of the columns to use in the mathematical operation

    floatMathOp

    Perform a mathematical operation (add, subtract, scalar max etc) on the specified double column, with a scalar

    • param columnName The float column to perform the operation on

    • param mathOp The mathematical operation

    • param scalar The scalar value to use in the mathematical operation

    floatColumnsMathOp

    Calculate and add a new float column by performing a mathematical operation on a number of existing columns. New column is added to the end.

    • param newColumnName Name of the new/derived column

    • param mathOp Mathematical operation to execute on the columns

    • param columnNames Names of the columns to use in the mathematical operation

    floatMathFunction

    Perform a mathematical operation (such as sin(x), ceil(x), exp(x) etc) on a column

    • param columnName Column name to operate on

    • param mathFunction MathFunction to apply to the column

    doubleMathOp

    Perform a mathematical operation (add, subtract, scalar max etc) on the specified double column, with a scalar

    • param columnName The double column to perform the operation on

    • param mathOp The mathematical operation

    • param scalar The scalar value to use in the mathematical operation

    doubleColumnsMathOp

    Calculate and add a new double column by performing a mathematical operation on a number of existing columns. New column is added to the end.

    • param newColumnName Name of the new/derived column

    • param mathOp Mathematical operation to execute on the columns

    • param columnNames Names of the columns to use in the mathematical operation

    doubleMathFunction

    Perform a mathematical operation (such as sin(x), ceil(x), exp(x) etc) on a column

    • param columnName Column name to operate on

    • param mathFunction MathFunction to apply to the column

    timeMathOp

    Perform a mathematical operation (add, subtract, scalar min/max only) on the specified time column

    • param columnName The integer column to perform the operation on

    • param mathOp The mathematical operation

    • param timeQuantity The quantity used in the mathematical op

    • param timeUnit The unit that timeQuantity is specified in

    categoricalToOneHot

    Convert the specified column(s) from a categorical representation to a one-hot representation. This involves the creation of multiple new columns each.

    • param columnNames Names of the categorical column(s) to convert to a one-hot representation

    categoricalToInteger

    Convert the specified column(s) from a categorical representation to an integer representation. This will replace the specified categorical column(s) with an integer repreesentation, where each integer has the value 0 to numCategories-1.

    • param columnNames Name of the categorical column(s) to convert to an integer representation

    integerToCategorical

    Convert the specified column from an integer representation (assume values 0 to numCategories-1) to a categorical representation, given the specified state names

    • param columnName Name of the column to convert

    • param categoryStateNames Names of the states for the categorical column

    integerToCategorical

    Convert the specified column from an integer representation to a categorical representation, given the specified mapping between integer indexes and state names

    • param columnName Name of the column to convert

    • param categoryIndexNameMap Names of the states for the categorical column

    integerToOneHot

    Convert an integer column to a set of 1 hot columns, based on the value in integer column

    • param columnName Name of the integer column

    • param minValue Minimum value possible for the integer column (inclusive)

    • param maxValue Maximum value possible for the integer column (inclusive)

    addConstantColumn

    Add a new column, where all values in the column are identical and as specified.

    • param newColumnName Name of the new column

    • param newColumnType Type of the new column

    • param fixedValue Value in the new column for all records

    addConstantDoubleColumn

    Add a new double column, where the value for that column (for all records) are identical

    • param newColumnName Name of the new column

    • param value Value in the new column for all records

    addConstantIntegerColumn

    Add a new integer column, where th e value for that column (for all records) are identical

    • param newColumnName Name of the new column

    • param value Value of the new column for all records

    addConstantLongColumn

    Add a new integer column, where the value for that column (for all records) are identical

    • param newColumnName Name of the new column

    • param value Value in the new column for all records

    convertToString

    Convert the specified column to a string.

    • param inputColumn the input column to convert

    • return builder pattern

    convertToDouble

    Convert the specified column to a double.

    • param inputColumn the input column to convert

    • return builder pattern

    convertToInteger

    Convert the specified column to an integer.

    • param inputColumn the input column to convert

    • return builder pattern

    normalize

    Normalize the specified column with a given type of normalization

    • param column Column to normalize

    • param type Type of normalization to apply

    • param da DataAnalysis object

    convertToSequence

    Convert a set of independent records/examples into a sequence, according to some key. Within each sequence, values are ordered using the provided {- link SequenceComparator}

    • param keyColumn Column to use as a key (values with the same key will be combined into sequences)

    • param comparator A SequenceComparator to order the values within each sequence (for example, by time or String order)

    convertToSequence

    Convert a set of independent records/examples into a sequence; each example is simply treated as a sequence of length 1, without any join/group operations. Note that more commonly, joining/grouping is required; use {- link #convertToSequence(List, SequenceComparator)} for this functionality

    convertToSequence

    Convert a set of independent records/examples into a sequence, where each sequence is grouped according to one or more key values (i.e., the values in one or more columns) Within each sequence, values are ordered using the provided {- link SequenceComparator}

    • param keyColumns Column to use as a key (values with the same key will be combined into sequences)

    • param comparator A SequenceComparator to order the values within each sequence (for example, by time or String order)

    convertFromSequence

    Convert a sequence to a set of individual values (by treating each value in each sequence as a separate example)

    splitSequence

    Split sequences into 1 or more other sequences. Used for example to split large sequences into a set of smaller sequences

    • param split SequenceSplit that defines how splits will occur

    trimSequence

    SequenceTrimTranform removes the first or last N values in a sequence. Note that the resulting sequence may be of length 0, if the input sequence is less than or equal to N.

    • param numStepsToTrim Number of time steps to trim from the sequence

    • param trimFromStart If true: Trim values from the start of the sequence. If false: trim values from the end.

    offsetSequence

    Perform a sequence of operation on the specified columns. Note that this also truncates sequences by the specified offset amount by default. Use {- code transform(new SequenceOffsetTransform(…)} to change this. See {- link SequenceOffsetTransform} for details on exactly what this operation does and how.

    • param columnsToOffset Columns to offset

    • param offsetAmount Amount to offset the specified columns by (positive offset: ‘columnsToOffset’ are moved to later time steps)

    • param operationType Whether the offset should be done in-place or by adding a new column

    reduce

    Reduce (i.e., aggregate/combine) a set of examples (typically by key). Note: In the current implementation, reduction operations can be performed only on standard (i.e., non-sequence) data

    • param reducer Reducer to use

    reduceSequence

    Reduce (i.e., aggregate/combine) a set of sequence examples - for each sequence individually. Note: This method results in non-sequence data. If you would instead prefer sequences of length 1 after the reduction, use {- code transform(new ReduceSequenceTransform(reducer))}.

    • param reducer Reducer to use to reduce each window

    reduceSequenceByWindow

    Reduce (i.e., aggregate/combine) a set of sequence examples - for each sequence individually - using a window function. For example, take all records/examples in each 24-hour period (i.e., using window function), and convert them into a singe value (using the reducer). In this example, the output is a sequence, with time period of 24 hours.

    • param reducer Reducer to use to reduce each window

    • param windowFunction Window function to find apply on each sequence individually

    sequenceMovingWindowReduce

    SequenceMovingWindowReduceTransform: Adds a new column, where the value is derived by: (a) using a window of the last N values in a single column, (b) Apply a reduction op on the window to calculate a new value for example, this transformer can be used to implement a simple moving average of the last N values, or determine the minimum or maximum values in the last N time steps.

    For example, for a simple moving average, length 20: {- code new SequenceMovingWindowReduceTransform(“myCol”, 20, ReduceOp.Mean)}

    • param columnName Column name to perform windowing on

    • param lookback Look back period for windowing

    • param op Reduction operation to perform on each window

    calculateSortedRank

    CalculateSortedRank: calculate the rank of each example, after sorting example. For example, we might have some numerical “score” column, and we want to know for the rank (sort order) for each example, according to that column. The rank of each example (after sorting) will be added in a new Long column. Indexing is done from 0; examples will have values 0 to dataSetSize-1.

    Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data Furthermore, the current implementation can only sort on one column

    • param newColumnName Name of the new column (will contain the rank for each example)

    • param sortOnColumn Column to sort on

    • param comparator Comparator used to sort examples

    calculateSortedRank

    CalculateSortedRank: calculate the rank of each example, after sorting example. For example, we might have some numerical “score” column, and we want to know for the rank (sort order) for each example, according to that column. The rank of each example (after sorting) will be added in a new Long column. Indexing is done from 0; examples will have values 0 to dataSetSize-1.

    Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data Furthermore, the current implementation can only sort on one column

    • param newColumnName Name of the new column (will contain the rank for each example)

    • param sortOnColumn Column to sort on

    • param comparator Comparator used to sort examples

    • param ascending If true: sort ascending. False: descending

    stringToCategorical

    Convert the specified String column to a categorical column. The state names must be provided.

    • param columnName Name of the String column to convert to categorical

    • param stateNames State names of the category

    stringRemoveWhitespaceTransform

    Remove all whitespace characters from the values in the specified String column

    • param columnName Name of the column to remove whitespace from

    stringMapTransform

    Replace one or more String values in the specified column with new values.

    Keys in the map are the original values; the Values in the map are their replacements. If a String appears in the data but does not appear in the provided map (as a key), that String values will not be modified.

    • param columnName Name of the column in which to do replacement

    • param mapping Map of oldValues -> newValues

    stringToTimeTransform

    Convert a String column (containing a date/time String) to a time column (by parsing the date/time String)

    • param column String column containing the date/time Strings

    • param format Format of the strings. Time format is specified as per

    • param dateTimeZone Timezone of the column

    stringToTimeTransform

    Convert a String column (containing a date/time String) to a time column (by parsing the date/time String)

    • param column String column containing the date/time Strings

    • param format Format of the strings. Time format is specified as per

    • param dateTimeZone Timezone of the column

    • param locale Locale of the column

    appendStringColumnTransform

    Append a String to a specified column

    • param column Column to append the value to

    • param toAppend String to append to the end of each writable

    conditionalReplaceValueTransform

    Replace the values in a specified column with a specified new value, if some condition holds. If the condition does not hold, the original values are not modified.

    • param column Column to operate on

    • param newValue Value to use as replacement, if condition is satisfied

    • param condition Condition that must be satisfied for replacement

    conditionalReplaceValueTransformWithDefault

    Replace the values in a specified column with a specified “yes” value, if some condition holds. Replace it with a “no” value, otherwise.

    • param column Column to operate on

    • param yesVal Value to use as replacement, if condition is satisfied

    • param noVal Value to use as replacement, if condition is not satisfied

    • param condition Condition that must be satisfied for replacement

    conditionalCopyValueTransform

    Replace the value in a specified column with a new value taken from another column, if a condition is satisfied/true. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.

    • param columnToReplace Name of the column in which values will be replaced (if condition is satisfied)

    • param sourceColumn Name of the column from which the new values will be

    • param condition Condition to use

    replaceStringTransform

    Replace one or more String values in the specified column that match regular expressions.

    Keys in the map are the regular expressions; the Values in the map are their String replacements. For example:

    • param columnName Name of the column in which to do replacement

    • param mapping Map of old values or regular expression to new values

    ndArrayScalarOpTransform

    Element-wise NDArray math operation (add, subtract, etc) on an NDArray column

    • param columnName Name of the NDArray column to perform the operation on

    • param op Operation to perform

    • param value Value for the operation

    ndArrayColumnsMathOpTransform

    Perform an element wise mathematical operation (such as add, subtract, multiply) on NDArray columns. The existing columns are unchanged, a new NDArray column is added

    • param newColumnName Name of the new NDArray column

    • param mathOp Operation to perform

    • param columnNames Name of the columns used as input to the operation

    ndArrayMathFunctionTransform

    Apply an element wise mathematical function (sin, tanh, abs etc) to an NDArray column. This operation is performed in place.

    • param columnName Name of the column to perform the operation on

    • param mathFunction Mathematical function to apply

    ndArrayDistanceTransform

    Calculate a distance (cosine similarity, Euclidean, Manhattan) on two equal-sized NDArray columns. This operation adds a new Double column (with the specified name) with the result.

    • param newColumnName Name of the new column (result) to add

    • param distance Distance to apply

    • param firstCol first column to use in the distance calculation

    • param secondCol second column to use in the distance calculation

    firstDigitTransform

    FirstDigitTransform converts a column to a categorical column, with values being the first digit of the number. For example, “3.1415” becomes “3” and “2.0” becomes “2”. Negative numbers ignore the sign: “-7.123” becomes “7”. Note that two {- link FirstDigitTransform.Mode}s are supported, which determines how non-numerical entries should be handled: EXCEPTION_ON_INVALID: output has 10 category values (“0”, …, “9”), and any non-numerical values result in an exception INCLUDE_OTHER_CATEGORY: output has 11 category values (“0”, …, “9”, “Other”), all non-numerical values are mapped to “Other”

    FirstDigitTransform is useful (combined with {- link CategoricalToOneHotTransform} and Reductions) to implement .

    • param inputColumn Input column name

    • param outputColumn Output column name. If same as input, input column is replaced

    firstDigitTransform

    FirstDigitTransform converts a column to a categorical column, with values being the first digit of the number. For example, “3.1415” becomes “3” and “2.0” becomes “2”. Negative numbers ignore the sign: “-7.123” becomes “7”. Note that two {- link FirstDigitTransform.Mode}s are supported, which determines how non-numerical entries should be handled: EXCEPTION_ON_INVALID: output has 10 category values (“0”, …, “9”), and any non-numerical values result in an exception INCLUDE_OTHER_CATEGORY: output has 11 category values (“0”, …, “9”, “Other”), all non-numerical values are mapped to “Other”

    FirstDigitTransform is useful (combined with {- link CategoricalToOneHotTransform} and Reductions) to implement .

    • param inputColumn Input column name

    • param outputColumn Output column name. If same as input, input column is replaced

    • param mode See {- link FirstDigitTransform.Mode}

    build

    Create the TransformProcess object

    CategoricalToIntegerTransform

    Created by Alex on 4/03/2016.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    CategoricalToOneHotTransform

    Created by Alex on 4/03/2016.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    IntegerToCategoricalTransform

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    PivotTransform

    Pivot transform operates on two columns:

    • a categorical column that operates as a key, and

    • Another column that contains a value Essentially, Pivot transform takes keyvalue pairs and breaks them out into separate columns.

    For example, with schema [col0, key, value, col3] and values with key in {a,b,c} Output schema is [col0, key[a], key[b], key[c], col3] and input (col0Val, b, x, col3Val) gets mapped to (col0Val, 0, x, 0, col3Val).

    When expanding columns, a default value is used - for example 0 for numerical columns.

    transform

    • param keyColumnName Key column to expand

    • param valueColumnName Name of the column that contains the value

    StringToCategoricalTransform

    Convert a String column to a categorical column

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    AddConstantColumnTransform

    Add a new column, where the values in that column for all records are identical (according to the specified value)

    DuplicateColumnsTransform

    Duplicate one or more columns. The duplicated columns are placed immediately after the original columns

    transform

    • param columnsToDuplicate List of columns to duplicate

    • param newColumnNames List of names for the new (duplicate) columns

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    RemoveAllColumnsExceptForTransform

    Transform that removes all columns except for those that are explicitly specified as ones to keep To specify only the columns

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    RemoveColumnsTransform

    Remove the specified columns from the data. To specify only the columns to keep,

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    RenameColumnsTransform

    Rename one or more columns

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    ReorderColumnsTransform

    Rearrange the order of the columns. Note: A partial list of columns can be used here. Any columns that are not explicitly mentioned will be placed after those that are in the output, without changing their relative order.

    transform

    • param newOrder A partial or complete order of the columns in the output

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    ConditionalCopyValueTransform

    Replace the value in a specified column with a new value taken from another column, if a condition is satisfied/true. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.

    Note: For sequences, this transform use the convention that each step in the sequence is passed to the condition, and replaced (or not) separately (i.e., Condition.condition(List) is used on each time step individually)

    transform

    • param columnToReplace Name of the column in which to replace the old value

    • param sourceColumn Name of the column to get the new value from

    • param condition Condition

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    ConditionalReplaceValueTransform

    Replace the value in a specified column with a new value, if a condition is satisfied/true. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.

    Note: For sequences, this transform use the convention that each step in the sequence is passed to the condition, and replaced (or not) separately (i.e., Condition.condition(List) is used on each time step individually)

    transform

    • param columnToReplace Name of the column in which to replace the old value with ‘newValue’, if the condition holds

    • param newValue New value to use

    • param condition Condition

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    ConditionalReplaceValueTransformWithDefault

    Replace the value in a specified column with a ‘yes’ value, if a condition is satisfied/true. Replace the value of this same column with a ‘no’ value otherwise. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.

    Note: For sequences, this transform use the convention that each step in the sequence is passed to the condition, and replaced (or not) separately (i.e., Condition.condition(List) is used on each time step individually)

    ConvertToDouble

    Convert any value to an Double

    map

    • param column Name of the column to convert to a Double column

    DoubleColumnsMathOpTransform

    Add a new double column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    DoubleMathFunctionTransform

    A simple transform to do common mathematical operations, such as sin(x), ceil(x), etc.

    DoubleMathOpTransform

    Double mathematical operation. This is an in-place operation of the double column value and a double scalar.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    Log2Normalizer

    Normalize by taking scale log2((in-columnMin)/(mean-columnMin) + 1) Maps values in range (columnMin to infinity) to (0 to infinity) Most suitable for values with a geometric/negative exponential type distribution.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    MinMaxNormalizer

    Normalizer to map (min to max) -> (newMin-to newMax) linearly.

    Mathematically: (newMax-newMin)/(max-min) (x-min) + newMin

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    StandardizeNormalizer

    Normalize using (x-mean)/stdev. Also known as a standard score, standardization etc.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    SubtractMeanNormalizer

    Normalize by substracting the mean

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    ConvertToInteger

    Convert any value to an Integer.

    map

    • param column Name of the column to convert to an integer

    IntegerColumnsMathOpTransform

    Add a new integer column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==MathOp.Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2. NOTE: Division here is using if a decimal output value is required.

    toString

    • param newColumnName Name of the new column (output column)

    • param mathOp Mathematical operation. Only Add/Subtract/Multiply/Divide/Modulus is allowed here

    • param columns Columns to use in the mathematical operation

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    IntegerMathOpTransform

    Integer mathematical operation. This is an in-place operation of the integer column value and an integer scalar.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    IntegerToOneHotTransform

    Convert an integer column to a set of one-hot columns.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    ReplaceEmptyIntegerWithValueTransform

    Replace an empty/missing integer with a certain value.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    ReplaceInvalidWithIntegerTransform

    Replace an invalid (non-integer) value in a column with a specified integer

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    LongColumnsMathOpTransform

    Add a new long column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==MathOp.Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2. if a decimal output value is required.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    LongMathOpTransform

    Long mathematical operation. This is an in-place operation of the long column value and an long scalar.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    TextToCharacterIndexTransform

    Convert each text value in a sequence to a longer sequence of integer indices. For example, “abc” would be converted to [1, 2, 3]. Values in other columns will be duplicated.

    TextToTermIndexSequenceTransform

    Convert each text value in a sequence to a longer sequence of integer indices. For example, “zero one two” would be converted to [0, 1, 2]. Values in other columns will be duplicated.

    SequenceDifferenceTransform

    SequenceDifferenceTransform: for an input sequence, calculate the difference on one column. For each time t, calculate someColumn(t) - someColumn(t-s), where s >= 1 is the ‘lookback’ period.

    Note: at t=0 (i.e., the first step in a sequence; or more generally, for all times t < s), there is no previous value these time steps:

    1. Default: output = someColumn(t) - someColumn(max(t-s, 0))

    2. SpecifiedValue: output = someColumn(t) - someColumn(t-s) if t-s >= 0, or a custom Writable object (for example, a DoubleWritable(0) or NullWritable).

    Note: this is an in-place operation: i.e., the values in each column are modified. If the original values are and apply the difference operation in-place on the copy.

    outputColumnName

    Create a SequenceDifferenceTransform with default lookback of 1, and using FirstStepMode.Default. Output column name is the same as the input column name.

    • param columnName Name of the column to perform the operation on.

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    SequenceMovingWindowReduceTransform

    SequenceMovingWindowReduceTransform Adds a new column, where the value is derived by: (a) using a window of the last N values in a single column, (b) Apply a reduction op on the window to calculate a new value for example, this transformer can be used to implement a simple moving average of the last N values, or determine the minimum or maximum values in the last N time steps.

    defaultOutputColumnName

    Enumeration to specify how each cases are handled: For example, for a look back period of 20, how should the first 19 output values be calculated? Default: Perform your former reduction as normal, with as many values are available SpecifiedValue: use the given/specified value instead of the actual output value. For example, you could assign values of 0 or NullWritable to positions 0 through 18 of the output.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    SequenceOffsetTransform

    Sequence offset transform takes a sequence, and shifts The values in one or more columns by a specified number of times steps. It has 2 modes of operation (OperationType enum), with respect to the columns it operates on: InPlace: operations may be performed in-place, modifying the values in the specified columns NewColumn: operations may produce new columns, with the original (source) columns remaining unmodified

    Additionally, there are 2 modes for handling values outside the original sequence (EdgeHandling enum): TrimSequence: the entire sequence is trimmed (start or end) by a specified number of steps SpecifiedValue: for any values outside of the original sequence, they are given a specified value

    Note 1: When specifying offsets, they are done as follows: Positive offsets: move the values in the specified columns to a later time. Earlier time steps are either be trimmed or Given specified values; the last values in these columns will be truncated/removed.

    Note 2: Care must be taken when using TrimSequence: for example, if we chain multiple sequence offset transforms on the one dataset, we may end up trimming much more than we want. In this case, it may be better to use SpecifiedValue, at the end.

    AppendStringColumnTransform

    Append a String to the values in a single column

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    ChangeCaseStringTransform

    Change case (to, e.g, all lower case) of String column.

    ConcatenateStringColumns

    Concatenate values of one or more String columns into a new String column. Retains the constituent String columns so user must remove those manually, if desired.

    TODO: use new String Reduce functionality in DataVec?

    transform

    • param columnsToConcatenate A partial or complete order of the columns in the output

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    ConvertToString

    Convert any value to a string.

    map

    Transform the writable in to a string

    • param writable the writable to transform

    • return the string form of this writable

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    MapAllStringsExceptListTransform

    This method maps all String values, except those is the specified list, to a single String value

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    RemoveWhiteSpaceTransform

    String transform that removes all whitespace charaters

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    ReplaceEmptyStringTransform

    Replace empty String values with the specified String

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    ReplaceStringTransform

    Replaces String values that match regular expressions.

    map

    Constructs a new ReplaceStringTransform using the specified

    • param columnName Name of the column

    • param map Key: regular expression; Value: replacement value

    StringListToCategoricalSetTransform

    Convert a delimited String to a list of binary categorical columns. Suppose the possible String values were {“a”,”b”,”c”,”d”} and the String column value to be converted contained the String “a,c”, then the 4 output columns would have values [“true”,”false”,”true”,”false”]

    transform

    • param columnName The name of the column to convert

    • param newColumnNames The names of the new columns to create

    • param categoryTokens The possible tokens that may be present. Note this list must have the same length and order as the newColumnNames list

    • param delimiter The delimiter for the Strings to convert

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    StringListToCountsNDArrayTransform

    Converts String column into a bag-of-words (BOW) represented as an NDArray of “counts.” Note that the original column is removed in the process

    transform

    • param columnName The name of the column to convert

    • param vocabulary The possible tokens that may be present.

    • param delimiter The delimiter for the Strings to convert

    • param ignoreUnknown Whether to ignore unknown tokens

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    outputColumnName

    The output column name after the operation has been applied

    • return the output column name

    columnName

    The output column names This will often be the same as the input

    • return the output column names

    StringListToIndicesNDArrayTransform

    Converts String column into a sparse bag-of-words (BOW) represented as an NDArray of indices. Appropriate for embeddings or as efficient storage before being expanded into a dense array.

    StringMapTransform

    A simple String -> String map function.

    Keys in the map are the original values; the Values in the map are their replacements. If a String appears in the data but does not appear in the provided map (as a key), that String values will not be modified.

    map

    • param columnName Name of the column

    • param map Key: From. Value: To

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    DeriveColumnsFromTimeTransform

    Create a number of new columns by deriving their values from a Time column. Can be used for example to create new columns with the year, month, day, hour, minute, second etc.

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    mapSequence

    Transform a sequence

    • param sequence

    toString

    The output column name after the operation has been applied

    • return the output column name

    StringToTimeTransform

    Convert a String column to a time column by parsing the date/time String, using a JodaTime.

    Time format is specified as per

    getNewColumnMetaData

    Instantiate this without a time format specified. If this constructor is used, this transform will be allowed to handle several common transforms as defined in the static formats array.

    • param columnName Name of the String column

    • param timeZone Timezone for time parsing

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    TimeMathOpTransform

    Transform math op on a time column

    Note: only the following MathOps are supported: Add, Subtract, ScalarMin, ScalarMax For ScalarMin/Max, the TimeUnit must be milliseconds - i.e., value must be in epoch millisecond format

    map

    Transform an object in to another object

    • param input the record to transform

    • return the transformed writable

    '4.25'

    Original

    Regex

    Replacement

    Result

    Data_Vec

    _

    DataVec

    B1C2T3

    \d

    one

    BoneConeTone

    '  4.25 '

    [source]
    http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
    http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
    Benford’s law
    Benford’s law
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    [source]
    http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
    [source]

    ^\s+|\s+$

    import org.datavec.api.transform.TransformProcess;
    
    TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
        .removeColumns("CustomerID","MerchantID")
        .filter(new ConditionFilter(new CategoricalColumnCondition("MerchantCountryCode", ConditionOp.NotInSet, new HashSet<>(Arrays.asList("USA","CAN")))))
        .conditionalReplaceValueTransform(
            "TransactionAmountUSD",     //Column to operate on
            new DoubleWritable(0.0),    //New value to use, when the condition is satisfied
            new DoubleColumnCondition("TransactionAmountUSD",ConditionOp.LessThan, 0.0)) //Condition: amount < 0.0
        .stringToTimeTransform("DateTimeString","YYYY-MM-DD HH:mm:ss.SSS", DateTimeZone.UTC)
        .renameColumn("DateTimeString", "DateTime")
        .transform(new DeriveColumnsFromTimeTransform.Builder("DateTime").addIntegerDerivedColumn("HourOfDay", DateTimeFieldType.hourOfDay()).build())
        .removeColumns("DateTime")
        .build();
    import org.datavec.local.transforms.LocalTransformExecutor;
    
    List<List<Writable>> processedData = LocalTransformExecutor.execute(originalData, tp);
    //Now, print the schema after each time step:
    int numActions = tp.getActionList().size();
    
    for(int i=0; i<numActions; i++ ){
        System.out.println("\n\n==================================================");
        System.out.println("-- Schema after step " + i + " (" + tp.getActionList().get(i) + ") --");
    
        System.out.println(tp.getSchemaAfterStep(i));
    }
    public Schema getFinalSchema()
    public Schema getSchemaAfterStep(int step)
    public String toJson()
    public String toYaml()
    public static TransformProcess fromJson(String json)
    public static TransformProcess fromYaml(String yaml)
    public Builder transform(Transform transform)
    public Builder filter(Filter filter)
    public Builder filter(Condition condition)
    public Builder removeColumns(String... columnNames)
    public Builder removeColumns(Collection<String> columnNames)
    public Builder removeAllColumnsExceptFor(String... columnNames)
    public Builder removeAllColumnsExceptFor(Collection<String> columnNames)
    public Builder renameColumn(String oldName, String newName)
    public Builder renameColumns(List<String> oldNames, List<String> newNames)
    public Builder reorderColumns(String... newOrder)
    public Builder duplicateColumn(String column, String newName)
    public Builder duplicateColumns(List<String> columnNames, List<String> newNames)
    public Builder integerMathOp(String column, MathOp mathOp, int scalar)
    public Builder integerColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)
    public Builder longMathOp(String columnName, MathOp mathOp, long scalar)
    public Builder longColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)
    public Builder floatMathOp(String columnName, MathOp mathOp, float scalar)
    public Builder floatColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)
    public Builder floatMathFunction(String columnName, MathFunction mathFunction)
    public Builder doubleMathOp(String columnName, MathOp mathOp, double scalar)
    public Builder doubleColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)
    public Builder doubleMathFunction(String columnName, MathFunction mathFunction)
    public Builder timeMathOp(String columnName, MathOp mathOp, long timeQuantity, TimeUnit timeUnit)
    public Builder categoricalToOneHot(String... columnNames)
    public Builder categoricalToInteger(String... columnNames)
    public Builder integerToCategorical(String columnName, List<String> categoryStateNames)
    public Builder integerToCategorical(String columnName, Map<Integer, String> categoryIndexNameMap)
    public Builder integerToOneHot(String columnName, int minValue, int maxValue)
    public Builder addConstantColumn(String newColumnName, ColumnType newColumnType, Writable fixedValue)
    public Builder addConstantDoubleColumn(String newColumnName, double value)
    public Builder addConstantIntegerColumn(String newColumnName, int value)
    public Builder addConstantLongColumn(String newColumnName, long value)
    public Builder convertToString(String inputColumn)
    public Builder convertToDouble(String inputColumn)
    public Builder convertToInteger(String inputColumn)
    public Builder normalize(String column, Normalize type, DataAnalysis da)
    public Builder convertToSequence(String keyColumn, SequenceComparator comparator)
    public Builder convertToSequence()
    public Builder convertToSequence(List<String> keyColumns, SequenceComparator comparator)
    public Builder convertFromSequence()
    public Builder splitSequence(SequenceSplit split)
    public Builder trimSequence(int numStepsToTrim, boolean trimFromStart)
    public Builder offsetSequence(List<String> columnsToOffset, int offsetAmount,
                                          SequenceOffsetTransform.OperationType operationType)
    public Builder reduce(IAssociativeReducer reducer)
    public Builder reduceSequence(IAssociativeReducer reducer)
    public Builder reduceSequenceByWindow(IAssociativeReducer reducer, WindowFunction windowFunction)
    public Builder sequenceMovingWindowReduce(String columnName, int lookback, ReduceOp op)
    public Builder calculateSortedRank(String newColumnName, String sortOnColumn, WritableComparator comparator)
    public Builder calculateSortedRank(String newColumnName, String sortOnColumn, WritableComparator comparator,
                                               boolean ascending)
    public Builder stringToCategorical(String columnName, List<String> stateNames)
    public Builder stringRemoveWhitespaceTransform(String columnName)
    public Builder stringMapTransform(String columnName, Map<String, String> mapping)
    public Builder stringToTimeTransform(String column, String format, DateTimeZone dateTimeZone)
    public Builder stringToTimeTransform(String column, String format, DateTimeZone dateTimeZone, Locale locale)
    public Builder appendStringColumnTransform(String column, String toAppend)
    public Builder conditionalReplaceValueTransform(String column, Writable newValue, Condition condition)
    public Builder conditionalReplaceValueTransformWithDefault(String column, Writable yesVal, Writable noVal, Condition condition)
    public Builder conditionalCopyValueTransform(String columnToReplace, String sourceColumn, Condition condition)
    public Builder replaceStringTransform(String columnName, Map<String, String> mapping)
    public Builder ndArrayScalarOpTransform(String columnName, MathOp op, double value)
    public Builder ndArrayColumnsMathOpTransform(String newColumnName, MathOp mathOp, String... columnNames)
    public Builder ndArrayMathFunctionTransform(String columnName, MathFunction mathFunction)
    public Builder ndArrayDistanceTransform(String newColumnName, Distance distance, String firstCol,
                                                    String secondCol)
    public Builder firstDigitTransform(String inputColumn, String outputColumn)
    public Builder firstDigitTransform(String inputColumn, String outputColumn, FirstDigitTransform.Mode mode)
    public TransformProcess build()
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public Schema transform(Schema inputSchema)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public Schema transform(Schema inputSchema)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Schema transform(Schema inputSchema)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Schema transform(Schema inputSchema)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Schema transform(Schema inputSchema)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public DoubleWritable map(Writable writable)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public Object map(Object input)
    public Object map(Object input)
    public Object map(Object input)
    public Object map(Object input)
    public IntWritable map(Writable writable)
    public String toString()
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public Object map(Object input)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Object map(Object input)
    public Object map(Object input)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public Object map(Object input)
    public String outputColumnName()
    public String columnName()
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public static String defaultOutputColumnName(String originalName, int lookback, ReduceOp op)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Object map(Object input)
    public Schema transform(Schema inputSchema)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Text map(Writable writable)
    public Object map(Object input)
    public Object map(Object input)
    public Object map(Object input)
    public Object map(Object input)
    public Text map(final Writable writable)
    public Schema transform(Schema inputSchema)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Schema transform(Schema inputSchema)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String outputColumnName()
    public String columnName()
    public Text map(Writable writable)
    public Object map(Object input)
    public Object map(Object input)
    public Object mapSequence(Object sequence)
    public String toString()
    public ColumnMetaData getNewColumnMetaData(String newName, ColumnMetaData oldColumnType)
    public Object map(Object input)
    public Object map(Object input)