Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Execute ETL and vectorization in a local instance.
Because datasets are commonly large by nature, you can decide on an execution mechanism that best suits your needs. For example, if you are vectorizing a large training dataset, you can process it in a distributed Spark cluster. However, if you need to do real-time inference, DataVec also provides a local executor that doesn't require any additional setup.
Once you've created your TransformProcess using your Schema, and you've either loaded your dataset into a Apache Spark JavaRDD or have a RecordReader that load your dataset, you can execute a transform.
Locally this looks like:
import org.datavec.local.transforms.LocalTransformExecutor;
List<List<Writable>> transformed = LocalTransformExecutor.execute(recordReader, transformProcess)
List<List<List<Writable>>> transformedSeq = LocalTransformExecutor.executeToSequence(sequenceReader, transformProcess)
List<List<Writable>> joined = LocalTransformExecutor.executeJoin(join, leftReader, rightReader)When using Spark this looks like:
import org.datavec.spark.transforms.SparkTransformExecutor;
JavaRDD<List<Writable>> transformed = SparkTransformExecutor.execute(inputRdd, transformProcess)
JavaRDD<List<List<Writable>>> transformedSeq = SparkTransformExecutor.executeToSequence(inputSequenceRdd, transformProcess)
JavaRDD<List<Writable>> joined = SparkTransformExecutor.executeJoin(join, leftRdd, rightRdd)Local transform executor
isTryCatch
public static boolean isTryCatch()Execute the specified TransformProcess with the given input data Note: this method can only be used if the TransformProcess returns non-sequence data. For TransformProcesses that return a sequence, use {- link #executeToSequence(List, TransformProcess)}
param inputWritables Input data to process
param transformProcess TransformProcess to execute
return Processed data
Execute a datavec transform process on spark rdds.
isTryCatch
public static boolean isTryCatch()deprecated Use static methods instead of instance methods on SparkTransformExecutor
Gather statistics on datasets.
Sometimes datasets are too large or too abstract in their format to manually analyze and estimate statistics on certain columns or patterns. DataVec comes with some helper utilities for performing a data analysis, and maximums, means, minimums, and other useful metrics.
If you have loaded your data into Apache Spark, DataVec has a special AnalyzeSpark class which can generate histograms, collect statistics, and return information about the quality of the data. Assuming you have already loaded your data into a Spark RDD, pass the JavaRDD and Schema to the class.
If you are using DataVec in Scala and your data was loaded into a regular RDD class, you can convert it by calling .toJavaRDD() which returns a JavaRDD. If you need to convert it back, call rdd().
The code below demonstrates some of many analyses for a 2D dataset in Spark analysis using the RDD javaRdd and the schema mySchema:
import org.datavec.spark.transform.AnalyzeSpark;
import org.datavec.api.writable.Writable;
import org.datavec.api.transform.analysis.*;
int maxHistogramBuckets = 10
DataAnalysis analysis = AnalyzeSpark.analyze(mySchema, javaRdd, maxHistogramBuckets)
DataQualityAnalysis analysis = AnalyzeSpark.analyzeQuality(mySchema, javaRdd)
Writable max = AnalyzeSpark.max(javaRdd, "myColumn", mySchema)
int numSamples = 5
List<Writable> sample = AnalyzeSpark.sampleFromColumn(numSamples, "myColumn", mySchema, javaRdd)Note that if you have sequence data, there are special methods for that as well:
SequenceDataAnalysis seqAnalysis = AnalyzeSpark.analyzeSequence(mySchema, sequenceRdd)
List<Writable> uniqueSequence = AnalyzeSpark.getUniqueSequence("myColumn", seqSchema, sequenceRdd)The AnalyzeLocal class works very similarly to its Spark counterpart and has a similar API. Instead of passing an RDD, it accepts a RecordReader which allows it to iterate over the dataset.
import org.datavec.local.transforms.AnalyzeLocal;
int maxHistogramBuckets = 10
DataAnalysis analysis = AnalyzeLocal.analyze(mySchema, csvRecordReader, maxHistogramBuckets)Analyse the specified data - returns a DataAnalysis object with summary information about each column
public static DataAnalysis analyze(Schema schema, RecordReader rr, int maxHistogramBuckets)Analyse the specified data - returns a DataAnalysis object with summary information about each column
param schema Schema for data
param rr Data to analyze
return DataAnalysis for data
public static DataQualityAnalysis analyzeQualitySequence(Schema schema, SequenceRecordReader data)Analyze the data quality of sequence data - provides a report on missing values, values that don’t comply with schema, etc
param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object
public static DataQualityAnalysis analyzeQuality(final Schema schema, final RecordReader data)Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc
param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object
AnalizeSpark: static methods for analyzing and
public static SequenceDataAnalysis analyzeSequence(Schema schema, JavaRDD<List<List<Writable>>> data,
int maxHistogramBuckets)param schema
param data
param maxHistogramBuckets
return
public static DataAnalysis analyze(Schema schema, JavaRDD<List<Writable>> data)Analyse the specified data - returns a DataAnalysis object with summary information about each column
param schema Schema for data
param data Data to analyze
return DataAnalysis for data
public static DataQualityAnalysis analyzeQualitySequence(Schema schema, JavaRDD<List<List<Writable>>> data)Randomly sample values from a single column
param count Number of values to sample
param columnName Name of the column to sample from
param schema Schema
param data Data to sample from
return A list of random samples
public static DataQualityAnalysis analyzeQuality(final Schema schema, final JavaRDD<List<Writable>> data)Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc
param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object
public static Writable min(JavaRDD<List<Writable>> allData, String columnName, Schema schema)Randomly sample a set of invalid values from a specified column. Values are considered invalid according to the Schema / ColumnMetaData
param numToSample Maximum number of invalid values to sample
param columnName Same of the column from which to sample invalid values
param schema Data schema
param data Data
return List of invalid examples
public static Writable max(JavaRDD<List<Writable>> allData, String columnName, Schema schema)Get the maximum value for the specified column
param allData All data
param columnName Name of the column to get the minimum value for
param schema Schema of the data
return Maximum value for the column
How to use data records in DataVec.
In the DataVec world a Record represents a single entry in a dataset. DataVec differentiates types of records to make data manipulation easier with built-in APIs. Sequences and 2D records are distinguishable.
Most of the time you do not need to interact with the record classes directly, unless you are manually iterating records for the purpose of forwarding through a neural network.
A standard implementation of the Record interface
A standard implementation of the SequenceRecord interface.
Selection of data using conditions.
Filters are a part of transforms and gives a DSL for you to keep parts of your dataset. Filters can be one-liners for single conditions or include complex boolean logic.
You can also write your own filters by implementing the Filter interface, though it is much more often that you may want to create a custom condition instead.
If condition is satisfied (returns true): remove the example or sequence If condition is not satisfied (returns false): keep the example or sequence
removeExample
param writables Example
return true if example should be removed, false to keep
removeSequence
param sequence sequence example
return true if example should be removed, false to keep
transform
Get the output schema for this transformation, given an input schema
param inputSchema
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Filter: a method of removing examples (or sequences) according to some condition
FilterInvalidValues: a filter operation that removes any examples (or sequences) if the examples/sequences contains invalid values in any of a specified set of columns. Invalid values are determined with respect to the schema
transform
param columnsToFilterIfInvalid Columns to check for invalid values
removeExample
param writables Example
return true if example should be removed, false to keep
removeSequence
param sequence sequence example
return true if example should be removed, false to keep
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Remove invalid records of a certain size.
removeExample
param writables Example
return true if example should be removed, false to keep
removeSequence
param sequence sequence example
return true if example should be removed, false to keep
removeExample
param writables Example
return true if example should be removed, false to keep
removeSequence
param sequence sequence example
return true if example should be removed, false to keep
transform
Get the output schema for this transformation, given an input schema
param inputSchema
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
.filter(new ConditionFilter(new CategoricalColumnCondition("MerchantCountryCode", ConditionOp.NotInSet, new HashSet<>(Arrays.asList("USA","CAN")))))
.build();public boolean removeExample(Object writables)public boolean removeSequence(Object sequence)public Schema transform(Schema inputSchema)public String outputColumnName()public String columnName()public Schema transform(Schema inputSchema)public boolean removeExample(Object writables)public boolean removeSequence(Object sequence)public String outputColumnName()public String columnName()public boolean removeExample(Object writables)public boolean removeSequence(Object sequence)public boolean removeExample(List<Writable> writables)public boolean removeSequence(List<List<Writable>> sequence)public Schema transform(Schema inputSchema)public String outputColumnName()public String columnName()delimiter is configurable), determine the geographic midpoint. See “geographic midpoint” at: http://www.geomidpoint.com/methods.html For implementation algorithm, see: http://www.geomidpoint.com/calculation.html
transform
public Schema transform(Schema inputSchema)param delim Delimiter for the coordinates in text format. For example, if format is “lat,long” use “,”
A StringReducer is used to take a set of examples and reduce them. The idea: suppose you have a large number of columns, and you want to combine/reduce the values in each column. StringReducer allows you to specify different reductions for differently for different columns: min, max, sum, mean etc.
Uses are: (1) Reducing examples by a key (2) Reduction operations in time series (windowing ops, etc)
transform
public Schema transform(Schema schema)Get the output schema, given the input schema
outputColumnName
public Builder outputColumnName(String outputColumnName)Create a StringReducer builder, and set the default column reduction operation. For any columns that aren’t specified explicitly, they will use the default reduction operation. If a column does have a reduction operation explicitly specified, then it will override the default specified here.
param defaultOp Default reduction operation to perform
appendColumns
public Builder appendColumns(String... columns)Reduce the specified columns by taking the minimum value
prependColumns
public Builder prependColumns(String... columns)Reduce the specified columns by taking the maximum value
mergeColumns
public Builder mergeColumns(String... columns)Reduce the specified columns by taking the sum of values
replaceColumn
public Builder replaceColumn(String... columns)Reduce the specified columns by taking the mean of the values
customReduction
public Builder customReduction(String column, ColumnReduction columnReduction)Reduce the specified column using a custom column reduction functionality.
param column Column to execute the custom reduction functionality on
param columnReduction Column reduction to execute on that column
setIgnoreInvalid
public Builder setIgnoreInvalid(String... columns)When doing the reduction: set the specified columns to ignore any invalid values. Invalid: defined as being not valid according to the ColumnMetaData: {- link ColumnMetaData#isValid(Writable)}. For numerical columns, this typically means being unable to parse the Writable. For example, Writable.toLong() failing for a Long column. If the column has any restrictions (min/max values, regex for Strings etc) these will also be taken into account.
param columns Columns to set ‘ignore invalid’ for
createHtmlAnalysisString
Render a data analysis object as a HTML file. This will produce a summary table, along charts for numerical columns. The contents of the HTML file are returned as a String, which should be written to a .html file.
param analysis Data analysis object to render
see #createHtmlAnalysisFile(DataAnalysis, File)
createHtmlAnalysisFile
Render a data analysis object as a HTML file. This will produce a summary table, along charts for numerical columns
param dataAnalysis Data analysis object to render
param output Output file (should have extension .html)
A simple utility for plotting DataVec sequence data to HTML files. Each file contains only one sequence. Each column is plotted separately; only numerical and categorical columns are plotted.
createHtmlSequencePlots
Create a HTML file with plots for the given sequence.
param title Title of the page
param schema Schema for the data
param sequence Sequence to plot
return HTML file as a string
createHtmlSequencePlotFile
Create a HTML file with plots for the given sequence and write it to a file.
param title Title of the page
param schema Schema for the data
param sequence Sequence to plot
Data wrangling and mapping from one schema to another.
DataVec comes with the ability to serialize transforms, which allows them to be more portable when they're needed for production environments. A TransformProcess is serialzied to a human-readable format such as JSON and can be saved as a file.
The code below shows how you can serialize the transform process tp.
When you want to reinstantiate the transform process, call the static from<format> method.
Serializer used for converting objects (Transforms, Conditions, etc) to JSON format
Serializer used for converting objects (Transforms, Conditions, etc) to YAML format
public static String createHtmlAnalysisString(DataAnalysis analysis) throws Exceptionpublic static void createHtmlAnalysisFile(DataAnalysis dataAnalysis, File output) throws Exceptionpublic static String createHtmlSequencePlots(String title, Schema schema, List<List<Writable>> sequence)
throws Exceptionpublic static void createHtmlSequencePlotFile(String title, Schema schema, List<List<Writable>> sequence,
File output) throws ExceptionString serializedTransformString = tp.toJson()TransformProcess tp = TransformProcess.fromJson(serializedTransformString)Implementations for advanced transformation.
Operations, such as a Function, help execute transforms and load data into DataVec. The concept of operations is low-level, meaning that most of the time you will not need to worry about them.
If you're using Apache Spark, functions will iterate over the dataset and load it into a Spark RDD and convert the raw data format into a Writable.
import org.datavec.api.writable.Writable;
import org.datavec.api.records.reader.impl.csv.CSVRecordReader;
import org.datavec.spark.transform.misc.StringToWritablesFunction;
SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf)
String customerInfoPath = new ClassPathResource("CustomerInfo.csv").getFile().getPath();
JavaRDD<List<Writable>> customerInfo = sc.textFile(customerInfoPath).map(new StringToWritablesFunction(rr));The above code loads a CSV file into a 2D java RDD. Once your RDD is loaded, you can transform it, perform joins and use reducers to wrangle the data any way you want.
Created by huitseeker on 5/8/17.
It is used to execute many reduction operations in parallel on the same column, datavec#238
Created by huitseeker on 5/8/17.
supports a conversion to Byte.
Created by huitseeker on 5/14/17.
Created by huitseeker on 5/14/17.
before dispatching the appropriate column of this element to its operation.
Created by huitseeker on 5/14/17.
supports a conversion to Double.
Created by huitseeker on 5/14/17.
supports a conversion to Float.
Created by huitseeker on 5/14/17.
supports a conversion to Integer.
Created by huitseeker on 5/14/17.
supports a conversion to Long.
Created by huitseeker on 5/14/17.
supports a conversion to TextWritable. Created by huitseeker on 5/14/17.
CalculateSortedRank: calculate the rank of each example, after sorting example. For example, we might have some numerical “score” column, and we want to know for the rank (sort order) for each example, according to that column. The rank of each example (after sorting) will be added in a new Long column. Indexing is done from 0; examples will have values 0 to dataSetSize - 1.
Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data. Furthermore, the current implementation can only sort on one column
transform
public Schema transform(Schema inputSchema)param newColumnName Name of the new column (will contain the rank for each example)
param sortOnColumn Name of the column to sort on
param comparator Comparator used to sort examples
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
BooleanCondition: used for creating compound conditions, such as AND(ConditionA, ConditionB, …) As a BooleanCondition is a condition, these can be chained together, like NOT(OR(AND(…),AND(…)))
The output column name after the operation has been applied
return the output column name
The output column names This will often be the same as the input
return the output column names
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
conditionSequence
Condition on arbitrary input
param sequence the sequence to do a condition on
return true if the condition for the sequence is met false otherwise
transform
Get the output schema for this transformation, given an input schema
param inputSchema
And of all the given conditions
param conditions the conditions to and
return a joint and of all these conditions
Or of all the given conditions
param conditions the conditions to or
return a joint and of all these conditions
Not of the given condition
param condition the conditions to and
return a joint and of all these condition
And of all the given conditions
param first the first condition
param second the second condition for xor
return the xor of these 2 conditions
For certain single-column conditions: how should we apply these to sequences? And: Condition applies to sequence only if it applies to ALL time steps Or: Condition applies to sequence if it applies to ANY time steps NoSequencMode: Condition cannot be applied to sequences at all (error condition)
Created by agibsonccc on 11/26/16.
columnCondition
Returns whether the given element meets the condition set by this operation
param writable the element to test
return true if the condition is met false otherwise
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
columnCondition
Constructor for conditions equal or not equal. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (== or != only)
param value Value to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
columnCondition
Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Value to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
A column condition that simply checks whether a floating point value is infinite
columnCondition
param columnName Column check for the condition
columnCondition
Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Value to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
A Condition that applies to a single column. Whenever the specified value is invalid according to the schema, the condition applies.
For example, if a Writable contains String values in an Integer column (and these cannot be parsed to an integer), then the condition would return true, as these values are invalid according to the schema.
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
columnCondition
Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Value to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
A column condition that simply checks whether a floating point value is NaN
columnCondition
param columnName Name of the column to check the condition for
Condition that applies to the values in any column. Specifically, condition is true if the Writable value is a NullWritable, and false for any other value
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
columnCondition
Constructor for conditions equal or not equal Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (== or != only)
param value Value to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
Condition that applies to the values
columnCondition
Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Time value (in epoch millisecond format) to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
Created by huitseeker on 5/17/17.
A condition on sequence lengths
Condition that applies to the values in a String column, using a provided regex. Condition return true if the String matches the regex, or false otherwise Note: Uses Writable.toString(), hence can potentially be applied to non-String columns
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
public String outputColumnName()public String columnName()public boolean condition(Object input)public boolean conditionSequence(Object sequence)public Schema transform(Schema inputSchema)public static Condition AND(Condition... conditions)public static Condition OR(Condition... conditions)public static Condition NOT(Condition condition)public static Condition XOR(Condition first, Condition second)public boolean columnCondition(Writable writable)public boolean condition(Object input)public boolean columnCondition(Writable writable)public boolean condition(Object input)public boolean columnCondition(Writable writable)public boolean condition(Object input)public boolean columnCondition(Writable writable)public boolean columnCondition(Writable writable)public boolean condition(Object input)public boolean condition(Object input)public boolean columnCondition(Writable writable)public boolean condition(Object input)public boolean columnCondition(Writable writable)public boolean condition(Object input)public boolean columnCondition(Writable writable)public boolean condition(Object input)public boolean columnCondition(Writable writable)public boolean condition(Object input)public boolean condition(Object input)Neural networks work best when the data they’re fed is normalized, constrained to a range between -1 and 1. There are several reasons for that. One is that nets are trained using gradient descent, and their activation functions usually having an active range somewhere between -1 and 1. Even when using an activation function that doesn’t saturate quickly, it is still good practice to constrain your values to this range to improve performance.
Pre processor for DataSets that normalizes feature values (and optionally label values) to lie between a minimum and maximum value (by default between 0 and 1)
NormalizerMinMaxScaler
public NormalizerMinMaxScaler(double minRange, double maxRange)Preprocessor can take a range as minRange and maxRange
param minRange
param maxRange
load
public void load(File... statistics) throws IOExceptionLoad the given min and max
param statistics the statistics to load
throws IOException
save
public void save(File... files) throws IOExceptionSave the current min and max
param files the statistics to save
throws IOException
deprecated use {- link NormalizerSerializer instead}
Base interface for all normalizers
A DataSetPreProcessor used to flatten a 4d CNN features array to a flattened 2d format (for use in networks such as a DenseLayer/multi-layer perceptron)
statistics of the upper and lower bounds of the population
MinMaxStrategy
public MinMaxStrategy(double minRange, double maxRange)param minRange the target range lower bound
param maxRange the target range upper bound
preProcess
public void preProcess(INDArray array, INDArray maskArray, MinMaxStats stats)Normalize a data array
param array the data to normalize
param stats statistics of the data population
revert
public void revert(INDArray array, INDArray maskArray, MinMaxStats stats)Denormalize a data array
param array the data to denormalize
param stats statistics of the data population
Created by susaneraly on 6/23/16. A preprocessor specifically for images that applies min max scaling Can take a range, so pixel values can be scaled from 0->255 to minRange->maxRange default minRange = 0 and maxRange = 1; If pixel values are not 8 bits, you can specify the number of bits as the third argument in the constructor For values that are already floating point, specify the number of bits as 1
ImagePreProcessingScaler
public ImagePreProcessingScaler(double a, double b, int maxBits)Preprocessor can take a range as minRange and maxRange
param a, default = 0
param b, default = 1
param maxBits in the image, default = 8
fit
public void fit(DataSet dataSet)Fit a dataset (only compute based on the statistics from this dataset0
param dataSet the dataset to compute on
fit
public void fit(DataSetIterator iterator)Iterates over a dataset accumulating statistics for normalization
param iterator the iterator to use for collecting statistics.
transform
public void transform(DataSet toPreProcess)Transform the data
param toPreProcess the dataset to transform
A simple Composite MultiDataSetPreProcessor - allows you to apply multiple MultiDataSetPreProcessors sequentially on the one MultiDataSet, in the order they are passed to the constructor
CompositeMultiDataSetPreProcessor
public CompositeMultiDataSetPreProcessor(MultiDataSetPreProcessor... preProcessors)param preProcessors Preprocessors to apply. They will be applied in this order
Pre processor for MultiDataSet that normalizes feature values (and optionally label values) to lie between a minimum and maximum value (by default between 0 and 1)
MultiNormalizerMinMaxScaler
public MultiNormalizerMinMaxScaler(double minRange, double maxRange)Preprocessor can take a range as minRange and maxRange
param minRange the target range lower bound
param maxRange the target range upper bound
An interface for multi dataset normalizers. Data normalizers compute some sort of statistics over a MultiDataSet and scale the data in some way.
A preprocessor specifically for images that applies min max scaling to one or more of the feature arrays in a MultiDataSet. Can take a range, so pixel values can be scaled from 0->255 to minRange->maxRange default minRange = 0 and maxRange = 1; If pixel values are not 8 bits, you can specify the number of bits as the third argument in the constructor For values that are already floating point, specify the number of bits as 1
ImageMultiPreProcessingScaler
public ImageMultiPreProcessingScaler(double a, double b, int maxBits, int[] featureIndices)Preprocessor can take a range as minRange and maxRange
param a, default = 0
param b, default = 1
param maxBits in the image, default = 8
param featureIndices Indices of feature arrays to process. If only one feature array is present, this should always be 0
Created by susaneraly, Ede Meijer variance and mean Pre processor for DataSet that normalizes feature values (and optionally label values) to have 0 mean and a standard deviation of 1
load
public void load(File... files) throws IOExceptionLoad the means and standard deviations from the file system
param files the files to load from. Needs 4 files if normalizing labels, otherwise 2.
save
public void save(File... files) throws IOExceptionparam files the files to save to. Needs 4 files if normalizing labels, otherwise 2.
deprecated use {- link NormalizerSerializer} instead
Save the current means and standard deviations to the file system
of the means and standard deviations of the population
preProcess
public void preProcess(INDArray array, INDArray maskArray, DistributionStats stats)Normalize a data array
param array the data to normalize
param stats statistics of the data population
revert
public void revert(INDArray array, INDArray maskArray, DistributionStats stats)Denormalize a data array
param array the data to denormalize
param stats statistics of the data population
Interface for strategies that can normalize and denormalize data arrays based on statistics of the population
Pre processor for MultiDataSet that can be configured to use different normalization strategies for different inputs and outputs, or none at all. Can be used for example when one input should be normalized, but a different one should be untouched because it’s the input for an embedding layer. Alternatively, one might want to mix standardization and min-max scaling for different inputs and outputs.
By default, no normalization is applied. There are methods to configure the desired normalization strategy for inputs and outputs either globally or on an individual input/output level. Specific input/output strategies will override global ones.
MultiNormalizerHybrid
public MultiNormalizerHybrid standardizeAllInputs()Apply standardization to all inputs, except the ones individually configured
return the normalizer
minMaxScaleAllInputs
public MultiNormalizerHybrid minMaxScaleAllInputs()Apply min-max scaling to all inputs, except the ones individually configured
return the normalizer
minMaxScaleAllInputs
public MultiNormalizerHybrid minMaxScaleAllInputs(double rangeFrom, double rangeTo)Apply min-max scaling to all inputs, except the ones individually configured
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer
standardizeInput
public MultiNormalizerHybrid standardizeInput(int input)Apply standardization to a specific input, overriding the global input strategy if any
param input the index of the input
return the normalizer
minMaxScaleInput
public MultiNormalizerHybrid minMaxScaleInput(int input)Apply min-max scaling to a specific input, overriding the global input strategy if any
param input the index of the input
return the normalizer
minMaxScaleInput
public MultiNormalizerHybrid minMaxScaleInput(int input, double rangeFrom, double rangeTo)Apply min-max scaling to a specific input, overriding the global input strategy if any
param input the index of the input
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer
standardizeAllOutputs
public MultiNormalizerHybrid standardizeAllOutputs()Apply standardization to all outputs, except the ones individually configured
return the normalizer
minMaxScaleAllOutputs
public MultiNormalizerHybrid minMaxScaleAllOutputs()Apply min-max scaling to all outputs, except the ones individually configured
return the normalizer
minMaxScaleAllOutputs
public MultiNormalizerHybrid minMaxScaleAllOutputs(double rangeFrom, double rangeTo)Apply min-max scaling to all outputs, except the ones individually configured
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer
standardizeOutput
public MultiNormalizerHybrid standardizeOutput(int output)Apply standardization to a specific output, overriding the global output strategy if any
param output the index of the input
return the normalizer
minMaxScaleOutput
public MultiNormalizerHybrid minMaxScaleOutput(int output)Apply min-max scaling to a specific output, overriding the global output strategy if any
param output the index of the input
return the normalizer
minMaxScaleOutput
public MultiNormalizerHybrid minMaxScaleOutput(int output, double rangeFrom, double rangeTo)Apply min-max scaling to a specific output, overriding the global output strategy if any
param output the index of the input
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer
getInputStats
public NormalizerStats getInputStats(int input)Get normalization statistics for a given input.
param input the index of the input
return implementation of NormalizerStats corresponding to the normalization strategy selected
getOutputStats
public NormalizerStats getOutputStats(int output)Get normalization statistics for a given output.
param output the index of the output
return implementation of NormalizerStats corresponding to the normalization strategy selected
fit
public void fit(@NonNull MultiDataSet dataSet)Get the map of normalization statistics per input
return map of input indices pointing to NormalizerStats instances
fit
public void fit(@NonNull MultiDataSetIterator iterator)Iterates over a dataset accumulating statistics for normalization
param iterator the iterator to use for collecting statistics
transform
public void transform(@NonNull MultiDataSet data)Transform the dataset
param data the dataset to pre process
revert
public void revert(@NonNull MultiDataSet data)Undo (revert) the normalization applied by this DataNormalization instance (arrays are modified in-place)
param data MultiDataSet to revert the normalization on
revertFeatures
public void revertFeatures(@NonNull INDArray[] features)Undo (revert) the normalization applied by this DataNormalization instance to the entire inputs array
param features The normalized array of inputs
revertFeatures
public void revertFeatures(@NonNull INDArray[] features, INDArray[] maskArrays)Undo (revert) the normalization applied by this DataNormalization instance to the entire inputs array
param features The normalized array of inputs
param maskArrays Optional mask arrays belonging to the inputs
revertFeatures
public void revertFeatures(@NonNull INDArray[] features, INDArray[] maskArrays, int input)Undo (revert) the normalization applied by this DataNormalization instance to the features of a particular input
param features The normalized array of inputs
param maskArrays Optional mask arrays belonging to the inputs
param input the index of the input to revert normalization on
revertLabels
public void revertLabels(@NonNull INDArray[] labels)Undo (revert) the normalization applied by this DataNormalization instance to the entire outputs array
param labels The normalized array of outputs
revertLabels
public void revertLabels(@NonNull INDArray[] labels, INDArray[] maskArrays)Undo (revert) the normalization applied by this DataNormalization instance to the entire outputs array
param labels The normalized array of outputs
param maskArrays Optional mask arrays belonging to the outputs
revertLabels
public void revertLabels(@NonNull INDArray[] labels, INDArray[] maskArrays, int output)Undo (revert) the normalization applied by this DataNormalization instance to the labels of a particular output
param labels The normalized array of outputs
param maskArrays Optional mask arrays belonging to the outputs
param output the index of the output to revert normalization on
A simple Composite DataSetPreProcessor - allows you to apply multiple DataSetPreProcessors sequentially on the one DataSet, in the order they are passed to the constructor
CompositeDataSetPreProcessor
public CompositeDataSetPreProcessor(DataSetPreProcessor... preProcessors)param preProcessors Preprocessors to apply. They will be applied in this order
Pre processor for MultiDataSet that normalizes feature values (and optionally label values) to have 0 mean and a standard deviation of 1
load
public void load(@NonNull List<File> featureFiles, @NonNull List<File> labelFiles) throws IOExceptionLoad means and standard deviations from the file system
param featureFiles source files for features, requires 2 files per input, alternating mean and stddev files
param labelFiles source files for labels, requires 2 files per output, alternating mean and stddev files
save
public void save(@NonNull List<File> featureFiles, @NonNull List<File> labelFiles) throws IOExceptionparam featureFiles target files for features, requires 2 files per input, alternating mean and stddev files
param labelFiles target files for labels, requires 2 files per output, alternating mean and stddev files
deprecated use {- link MultiStandardizeSerializerStrategy} instead
Save the current means and standard deviations to the file system
This is a preprocessor specifically for VGG16. It subtracts the mean RGB value, computed on the training set, from each pixel as reported in: https://arxiv.org/pdf/1409.1556.pdf
fit
public void fit(DataSet dataSet)Fit a dataset (only compute based on the statistics from this dataset0
param dataSet the dataset to compute on
fit
public void fit(DataSetIterator iterator)Iterates over a dataset accumulating statistics for normalization
param iterator the iterator to use for collecting statistics.
transform
public void transform(DataSet toPreProcess)Transform the data
param toPreProcess the dataset to transform
An interface for data normalizers. Data normalizers compute some sort of statistics over a dataset and scale the data in some way.
Schemas for datasets and transformation.
The unfortunate reality is that data is dirty. When trying to vecotrize a dataset for deep learning, it is quite rare to find files that have zero errors. Schema is important for maintaining the meaning of the data before using it for something like training a neural network.
Schemas are primarily used for programming transformations. Before you can properly execute a TransformProcess you will need to pass the schema of the data being transformed.
An example of a schema for merchant records may look like:
Schema inputDataSchema = new Schema.Builder()
.addColumnsString("DateTimeString", "CustomerID", "MerchantID")
.addColumnInteger("NumItemsInTransaction")
.addColumnCategorical("MerchantCountryCode", Arrays.asList("USA","CAN","FR","MX"))
.addColumnDouble("TransactionAmountUSD",0.0,null,false,false) //$0.0 or more, no maximum limit, no NaN and no Infinite values
.addColumnCategorical("FraudLabel", Arrays.asList("Fraud","Legit"))
.build();If you have two different datasets that you want to merge together, DataVec provides a Join class with different join strategies such as Inner or RightOuter.
Schema customerInfoSchema = new Schema.Builder()
.addColumnLong("customerID")
.addColumnString("customerName")
.addColumnCategorical("customerCountry", Arrays.asList("USA","France","Japan","UK"))
.build();
Schema customerPurchasesSchema = new Schema.Builder()
.addColumnLong("customerID")
.addColumnTime("purchaseTimestamp", DateTimeZone.UTC)
.addColumnLong("productID")
.addColumnInteger("purchaseQty")
.addColumnDouble("unitPriceUSD")
.build();
Join join = new Join.Builder(Join.JoinType.Inner)
.setJoinColumns("customerID")
.setSchemas(customerInfoSchema, customerPurchasesSchema)
.build();Once you've defined your join and you've loaded the data into DataVec, you must use an Executor to complete the join.
DataVec comes with a few Schema classes and helper utilities for 2D and sequence types of data.
Join class: used to specify a join (like an SQL join)
setSchemas
public Builder setSchemas(Schema left, Schema right)Type of join Inner: Return examples where the join column values occur in both LeftOuter: Return all examples from left data, whether there is a matching right value or not. (If not: right values will have NullWritable instead) RightOuter: Return all examples from the right data, whether there is a matching left value or not. (If not: left values will have NullWritable instead) FullOuter: return all examples from both left/right, whether there is a matching value from the other side or not. (If not: other values will have NullWritable instead)
setKeyColumns
public Builder setKeyColumns(String... keyColumnNames)deprecated Use {- link #setJoinColumns(String…)}
setKeyColumnsLeft
public Builder setKeyColumnsLeft(String... keyColumnNames)deprecated Use {- link #setJoinColumnsLeft(String…)}
setKeyColumnsRight
public Builder setKeyColumnsRight(String... keyColumnNames)deprecated Use {- link #setJoinColumnsRight(String…)}
setJoinColumnsLeft
public Builder setJoinColumnsLeft(String... joinColumnNames)Specify the names of the columns to join on, for the left data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i
param joinColumnNames Names of the columns to join on (for left data)
setJoinColumnsRight
public Builder setJoinColumnsRight(String... joinColumnNames)Specify the names of the columns to join on, for the right data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i
param joinColumnNames Names of the columns to join on (for left data)
If passed a CSV file that contains a header and a single row of sample data, it will return a Schema.
Only Double, Integer, Long, and String types are supported. If no number type can be inferred, the field type will become the default type. Note that if your column is actually categorical but is represented as a number, you will need to do additional transformation. Also, if your sample field is blank/null, it will also become the default type.
A Schema defines the layout of tabular data. Specifically, it contains names f or each column, as well as details of types (Integer, String, Long, Double, etc). Type information for each column may optionally include restrictions on the allowable values for each column.
sameTypes
public boolean sameTypes(Schema schema)Create a schema based on the given metadata
param columnMetaData the metadata to create the schema from
newSchema
public Schema newSchema(List<ColumnMetaData> columnMetaData)Compute the difference in {- link ColumnMetaData} between this schema and the passed in schema. This is useful during the {- link org.datavec.api.transform.TransformProcess} to identify what a process will do to a given {- link Schema}.
param schema the schema to compute the difference for
return the metadata that is different (in order) between this schema and the other schema
numColumns
public int numColumns()Returns the number of columns or fields for this schema
return the number of columns or fields for this schema
getName
public String getName(int column)Returns the name of a given column at the specified index
param column the index of the column to get the name for
return the name of the column at the specified index
getType
public ColumnType getType(int column)Returns the {- link ColumnType} for the column at the specified index
param column the index of the column to get the type for
return the type of the column to at the specified inde
getType
public ColumnType getType(String columnName)Returns the {- link ColumnType} for the column at the specified index
param columnName the index of the column to get the type for
return the type of the column to at the specified inde
getMetaData
public ColumnMetaData getMetaData(int column)Returns the {- link ColumnMetaData} at the specified column index
param column the index to get the metadata for
return the metadata at ths specified index
getMetaData
public ColumnMetaData getMetaData(String column)Retrieve the metadata for the given column name
param column the name of the column to get metadata for
return the metadata for the given column name
getIndexOfColumn
public int getIndexOfColumn(String columnName)Return a copy of the list column names
return a copy of the list of column names for this schema
hasColumn
public boolean hasColumn(String columnName)Return the indices of the columns, given their namess
param columnNames Name of the columns to get indices for
return Column indexes
toJson
public String toJson()Serialize this schema to json
return a json representation of this schema
toYaml
public String toYaml()Serialize this schema to yaml
return the yaml representation of this schema
fromJson
public static Schema fromJson(String json)Create a schema from a given json string
param json the json to create the schema from
return the created schema based on the json
fromYaml
public static Schema fromYaml(String yaml)Create a schema from the given yaml string
param yaml the yaml to create the schema from
return the created schema based on the yaml
addColumnFloat
public Builder addColumnFloat(String name)Add a Float column with no restrictions on the allowable values, except for no NaN/infinite values allowed
param name Name of the column
addColumnFloat
public Builder addColumnFloat(String name, Float minAllowedValue, Float maxAllowedValue)Add a Float column with the specified restrictions (and no NaN/Infinite values allowed)
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
return
addColumnFloat
public Builder addColumnFloat(String name, Float minAllowedValue, Float maxAllowedValue, boolean allowNaN,
boolean allowInfinite)Add a Float column with the specified restrictions
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnsFloat
public Builder addColumnsFloat(String... columnNames)Add multiple Float columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)
param columnNames Names of the columns to add
addColumnsFloat
public Builder addColumnsFloat(String pattern, int minIdxInclusive, int maxIdxInclusive)A convenience method for adding multiple Float columns. For example, to add columns “myFloatCol_0”, “myFloatCol_1”, “myFloatCol_2”, use {- code addColumnsFloat(“myFloatCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsFloat
public Builder addColumnsFloat(String pattern, int minIdxInclusive, int maxIdxInclusive,
Float minAllowedValue, Float maxAllowedValue, boolean allowNaN, boolean allowInfinite)A convenience method for adding multiple Float columns, with additional restrictions that apply to all columns For example, to add columns “myFloatCol_0”, “myFloatCol_1”, “myFloatCol_2”, use {- code addColumnsFloat(“myFloatCol_%d”,0,2,null,null,false,false)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnDouble
public Builder addColumnDouble(String name)Add a Double column with no restrictions on the allowable values, except for no NaN/infinite values allowed
param name Name of the column
addColumnDouble
public Builder addColumnDouble(String name, Double minAllowedValue, Double maxAllowedValue)Add a Double column with the specified restrictions (and no NaN/Infinite values allowed)
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
return
addColumnDouble
public Builder addColumnDouble(String name, Double minAllowedValue, Double maxAllowedValue, boolean allowNaN,
boolean allowInfinite)Add a Double column with the specified restrictions
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnsDouble
public Builder addColumnsDouble(String... columnNames)Add multiple Double columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)
param columnNames Names of the columns to add
addColumnsDouble
public Builder addColumnsDouble(String pattern, int minIdxInclusive, int maxIdxInclusive)A convenience method for adding multiple Double columns. For example, to add columns “myDoubleCol_0”, “myDoubleCol_1”, “myDoubleCol_2”, use {- code addColumnsDouble(“myDoubleCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsDouble
public Builder addColumnsDouble(String pattern, int minIdxInclusive, int maxIdxInclusive,
Double minAllowedValue, Double maxAllowedValue, boolean allowNaN, boolean allowInfinite)A convenience method for adding multiple Double columns, with additional restrictions that apply to all columns For example, to add columns “myDoubleCol_0”, “myDoubleCol_1”, “myDoubleCol_2”, use {- code addColumnsDouble(“myDoubleCol_%d”,0,2,null,null,false,false)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnInteger
public Builder addColumnInteger(String name)Add an Integer column with no restrictions on the allowable values
param name Name of the column
addColumnInteger
public Builder addColumnInteger(String name, Integer minAllowedValue, Integer maxAllowedValue)Add an Integer column with the specified min/max allowable values
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumnsInteger
public Builder addColumnsInteger(String... names)Add multiple Integer columns with no restrictions on the min/max allowable values
param names Names of the integer columns to add
addColumnsInteger
public Builder addColumnsInteger(String pattern, int minIdxInclusive, int maxIdxInclusive)A convenience method for adding multiple Integer columns. For example, to add columns “myIntegerCol_0”, “myIntegerCol_1”, “myIntegerCol_2”, use {- code addColumnsInteger(“myIntegerCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsInteger
public Builder addColumnsInteger(String pattern, int minIdxInclusive, int maxIdxInclusive,
Integer minAllowedValue, Integer maxAllowedValue)A convenience method for adding multiple Integer columns. For example, to add columns “myIntegerCol_0”, “myIntegerCol_1”, “myIntegerCol_2”, use {- code addColumnsInteger(“myIntegerCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumnCategorical
public Builder addColumnCategorical(String name, String... stateNames)Add a Categorical column, with the specified state names
param name Name of the column
param stateNames Names of the allowable states for this categorical column
addColumnCategorical
public Builder addColumnCategorical(String name, List<String> stateNames)Add a Categorical column, with the specified state names
param name Name of the column
param stateNames Names of the allowable states for this categorical column
addColumnLong
public Builder addColumnLong(String name)Add a Long column, with no restrictions on the min/max values
param name Name of the column
addColumnLong
public Builder addColumnLong(String name, Long minAllowedValue, Long maxAllowedValue)Add a Long column with the specified min/max allowable values
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumnsLong
public Builder addColumnsLong(String... names)Add multiple Long columns, with no restrictions on the allowable values
param names Names of the Long columns to add
addColumnsLong
public Builder addColumnsLong(String pattern, int minIdxInclusive, int maxIdxInclusive)A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsLong
public Builder addColumnsLong(String pattern, int minIdxInclusive, int maxIdxInclusive, Long minAllowedValue,
Long maxAllowedValue)A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumn
public Builder addColumn(ColumnMetaData metaData)Add a column
param metaData metadata for this column
addColumnString
public Builder addColumnString(String name)Add a String column with no restrictions on the allowable values.
param name Name of the column
addColumnsString
public Builder addColumnsString(String... columnNames)Add multiple String columns with no restrictions on the allowable values
param columnNames Names of the String columns to add
addColumnString
public Builder addColumnString(String name, String regex, Integer minAllowableLength,
Integer maxAllowableLength)Add a String column with the specified restrictions
param name Name of the column
param regex Regex that the String must match in order to be considered valid. If null: no regex restriction
param minAllowableLength Minimum allowable length for the String to be considered valid
param maxAllowableLength Maximum allowable length for the String to be considered valid
addColumnsString
public Builder addColumnsString(String pattern, int minIdxInclusive, int maxIdxInclusive)A convenience method for adding multiple numbered String columns. For example, to add columns “myStringCol_0”, “myStringCol_1”, “myStringCol_2”, use {- code addColumnsString(“myStringCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsString
public Builder addColumnsString(String pattern, int minIdxInclusive, int maxIdxInclusive, String regex,
Integer minAllowedLength, Integer maxAllowedLength)A convenience method for adding multiple numbered String columns. For example, to add columns “myStringCol_0”, “myStringCol_1”, “myStringCol_2”, use {- code addColumnsString(“myStringCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param regex Regex that the String must match in order to be considered valid. If null: no regex restriction
param minAllowedLength Minimum allowed length of strings (inclusive). If null: no restriction
param maxAllowedLength Maximum allowed length of strings (inclusive). If null: no restriction
addColumnTime
public Builder addColumnTime(String columnName, TimeZone timeZone)Add a Time column with no restrictions on the min/max allowable times NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform
param columnName Name of the column
param timeZone Time zone of the time column
addColumnTime
public Builder addColumnTime(String columnName, DateTimeZone timeZone)Add a Time column with no restrictions on the min/max allowable times NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform
param columnName Name of the column
param timeZone Time zone of the time column
addColumnTime
public Builder addColumnTime(String columnName, DateTimeZone timeZone, Long minValidValue, Long maxValidValue)Add a Time column with the specified restrictions NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform
param columnName Name of the column
param timeZone Time zone of the time column
param minValidValue Minumum allowable time (in milliseconds). May be null.
param maxValidValue Maximum allowable time (in milliseconds). May be null.
addColumnNDArray
public Builder addColumnNDArray(String columnName, long[] shape)Add a NDArray column
param columnName Name of the column
param shape shape of the NDArray column. Use -1 in entries to specify as “variable length” in that dimension
build
public Schema build()Create the Schema
inferMultiple
public static Schema inferMultiple(List<List<Writable>> record)Infers a schema based on the record. The column names are based on indexing.
param record the record to infer from
return the infered schema
infer
public static Schema infer(List<Writable> record)Infers a schema based on the record. The column names are based on indexing.
param record the record to infer from
return the infered schema
inferSequenceMulti
public static SequenceSchema inferSequenceMulti(List<List<List<Writable>>> record)Infers a sequence schema based on the record
param record the record to infer the schema based on
return the inferred sequence schema
inferSequence
public static SequenceSchema inferSequence(List<List<Writable>> record)Infers a sequence schema based on the record
param record the record to infer the schema based on
return the inferred sequence schema
Read individual records from different formats.
Readers iterate records from a dataset in storage and load the data into DataVec. The usefulness of readers beyond individual entries in a dataset includes: what if you wanted to train a text generator on a corpus? Or programmatically compose two entries together to form a new record? Reader implementations are useful for complex file types or distributed storage mechanisms.
Readers return Writable classes that describe each column in a Record. These classes are used to convert each record to a tensor/ND-Array format.
Each reader implementation extends BaseRecordReader and provides a simple API for selecting the next record in a dataset, acting similarly to iterators.
Useful methods include:
next: Return a batch of Writable.
nextRecord: Return a single Record, optionally with RecordMetaData.
reset: Reset the underlying iterator.
hasNext: Iterator method to determine if another record is available.
You can hook a custom RecordListener to a record reader for debugging or visualization purposes. Pass your custom listener to the addListener base method immediately after initializing your class.
RecordReader for each pipeline. Individual record is a concatenation of the two collections. Create a recordreader that takes recordreaders and iterates over them and concatenates them hasNext would be the & of all the recordreaders concatenation would be next & addAll on the collection return one record
initialize
public void initialize(InputSplit split) throws IOException, InterruptedExceptionCombine multiple readers into a single reader. Records are read sequentially - thus if the first reader has 100 records, and the second reader has 200 records, ConcatenatingRecordReader will have 300 records.
File reader/writer
getCurrentLabel
public int getCurrentLabel()Return the current label. The index of the current file’s parent directory in the label list
return The index of the current file’s parent directory
Reads files line by line
Collection record reader. Mainly used for testing.
Collection record reader for sequences. Mainly used for testing.
initialize
public void initialize(InputSplit split) throws IOException, InterruptedExceptionparam records Collection of sequences. For example, List<List<List>> where the inner two lists are a sequence, and the outer list/collection is a list of sequences
Iterates through a list of strings return a record.
initialize
public void initialize(InputSplit split) throws IOException, InterruptedExceptionCalled once at initialization.
param split the split that defines the range of records to read
throws IOException
throws InterruptedException
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedExceptionCalled once at initialization.
param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException
hasNext
public boolean hasNext()Get the next record
return The list of next record
reset
public void reset()List of label strings
return
nextRecord
public Record nextRecord()Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream
param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream
close
public void close() throws IOExceptionCloses this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.
As noted in {- link AutoCloseable#close()}, cases where the close may fail require careful attention. It is strongly advised to relinquish the underlying resources and to internally mark the {- code Closeable} as closed, prior to throwing the {- code IOException}.
throws IOException if an I/O error occurs
setConf
public void setConf(Configuration conf)Set the configuration to be used by this object.
param conf
getConf
public Configuration getConf()Return the configuration used by this object.
Simple csv record reader.
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedExceptionSkip first n lines
param skipNumLines the number of lines to skip
A CSVRecordReader that can split each column into additional columns using regexs.
CSV Sequence Record Reader This reader is intended to read sequences of data in CSV format, where each sequence is defined in its own file (and there are multiple files) Each line in the file represents one time step
A sliding window of variable size across an entire CSV.
In practice the sliding window size starts at 1, then linearly increase to maxLinesPer sequence, then linearly decrease back to 1.
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedExceptionNo-arg constructor with the default number of lines per sequence (10)
Record reader for libsvm format, which is closely related to SVMLight format. Similar to scikit-learn we use a single reader for both formats, so this class is a subclass of SVMLightRecordReader.
Further details on the format can be found at
Matlab record reader
Record reader for SVMLight format, which can generally be described as
LABEL INDEX:VALUE INDEX:VALUE …
SVMLight format is well-suited to sparse data (e.g., bag-of-words) because it omits all features with value zero.
We support an “extended” version that allows for multiple targets (or labels) separated by a comma, as follows:
LABEL1,LABEL2,… INDEX:VALUE INDEX:VALUE …
This can be used to represent either multitask problems or multilabel problems with sparse binary labels (controlled via the “MULTILABEL” configuration option).
Like scikit-learn, we support both zero-based and one-based indexing.
Further details on the format can be found at
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedExceptionMust be called before attempting to read records.
param conf DataVec configuration
param split FileSplit
throws IOException
throws InterruptedException
setConf
public void setConf(Configuration conf)Set configuration.
param conf DataVec configuration
throws IOException
throws InterruptedException
hasNext
public boolean hasNext()Helper function to help detect lines that are commented out. May read ahead and cache a line.
return
nextRecord
public Record nextRecord()Return next record as list of Writables.
return
RegexLineRecordReader: Read a file, one line at a time, and split it into fields using a regex. To load an entire file using a
Example: Data in format “2016-01-01 23:59:59.001 1 DEBUG First entry message!” using regex String “(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) (\d+) ([A-Z]+) (.)” would be split into 4 Text writables: [“2016-01-01 23:59:59.001”, “1”, “DEBUG”, “First entry message!”]
RegexSequenceRecordReader: Read an entire file (as a sequence), one line at a time and split each line into fields using a regex.
Example: Data in format “2016-01-01 23:59:59.001 1 DEBUG First entry message!” using regex String “(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) (\d+) ([A-Z]+) (.)” would be split into 4 Text writables: [“2016-01-01 23:59:59.001”, “1”, “DEBUG”, “First entry message!”]
lines that don’t match the provided regex can result in an exception (FailOnInvalid), can be skipped silently (SkipInvalid), or skip invalid but log a warning (SkipInvalidWithWarning)
to have a transform process applied before being returned.
initialize
public void initialize(InputSplit split) throws IOException, InterruptedExceptionCalled once at initialization.
param split the split that defines the range of records to read
throws IOException
throws InterruptedException
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedExceptionCalled once at initialization.
param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException
hasNext
public boolean hasNext()Get the next record
return
reset
public void reset()List of label strings
return
nextRecord
public Record nextRecord()Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream
param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream
loadFromMetaData
public Record loadFromMetaData(RecordMetaData recordMetaData) throws IOExceptionLoad a single record from the given {- link RecordMetaData} instance Note: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient to load multiple records at once using {- link #loadFromMetaData(List)}
param recordMetaData Metadata for the record that we want to load from
return Single record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading
setListeners
public void setListeners(RecordListener... listeners)Load multiple records from the given a list of {- link RecordMetaData} instances
param recordMetaDatas Metadata for the records that we want to load from
return Multiple records for the given RecordMetaData instances
throws IOException If I/O error occurs during loading
setListeners
public void setListeners(Collection<RecordListener> listeners)Set the record listeners for this record reader.
param listeners
close
public void close() throws IOExceptionCloses this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.
As noted in {- link AutoCloseable#close()}, cases where the close may fail require careful attention. It is strongly advised to relinquish the underlying resources and to internally mark the {- code Closeable} as closed, prior to throwing the {- code IOException}.
throws IOException if an I/O error occurs
setConf
public void setConf(Configuration conf)Set the configuration to be used by this object.
param conf
getConf
public Configuration getConf()Return the configuration used by this object.
to be transformed before being returned.
setConf
public void setConf(Configuration conf)Set the configuration to be used by this object.
param conf
getConf
public Configuration getConf()Return the configuration used by this object.
batchesSupported
public boolean batchesSupported()Returns a sequence record.
return a sequence of records
nextSequence
public SequenceRecord nextSequence()Load a sequence record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream
param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream
loadSequenceFromMetaData
public SequenceRecord loadSequenceFromMetaData(RecordMetaData recordMetaData) throws IOExceptionLoad a single sequence record from the given {- link RecordMetaData} instance Note: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient to load multiple records at once using {- link #loadSequenceFromMetaData(List)}
param recordMetaData Metadata for the sequence record that we want to load from
return Single sequence record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading
initialize
public void initialize(InputSplit split) throws IOException, InterruptedExceptionLoad multiple sequence records from the given a list of {- link RecordMetaData} instances
param recordMetaDatas Metadata for the records that we want to load from
return Multiple sequence record for the given RecordMetaData instances
throws IOException If I/O error occurs during loading
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedExceptionCalled once at initialization.
param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException
hasNext
public boolean hasNext()Get the next record
return
reset
public void reset()List of label strings
return
nextRecord
public Record nextRecord()Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream
param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream
loadFromMetaData
public Record loadFromMetaData(RecordMetaData recordMetaData) throws IOExceptionLoad a single record from the given {- link RecordMetaData} instance Note: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient to load multiple records at once using {- link #loadFromMetaData(List)}
param recordMetaData Metadata for the record that we want to load from
return Single record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading
setListeners
public void setListeners(RecordListener... listeners)Load multiple records from the given a list of {- link RecordMetaData} instances
param recordMetaDatas Metadata for the records that we want to load from
return Multiple records for the given RecordMetaData instances
throws IOException If I/O error occurs during loading
setListeners
public void setListeners(Collection<RecordListener> listeners)Set the record listeners for this record reader.
param listeners
close
public void close() throws IOExceptionCloses this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.
As noted in {- link AutoCloseable#close()}, cases where the close may fail require careful attention. It is strongly advised to relinquish the underlying resources and to internally mark the {- code Closeable} as closed, prior to throwing the {- code IOException}.
throws IOException if an I/O error occurs
Native audio file loader using FFmpeg.
Wav file loader
Image record reader. Reads a local file system and parses images of a given height and width. All images are rescaled and converted to the given height, width, and number of channels.
Also appends the label if specified (one of k encoding based on the directory structure where each subdir of the root is an indexed label)
TFIDF record reader (wraps a tfidf vectorizer for delivering labels and conforming to the record reader interface)
Data wrangling and mapping from one schema to another.
One of the key tools in DataVec is transformations. DataVec helps the user map a dataset from one schema to another, and provides a list of operations to convert types, format data, and convert a 2D dataset to sequence data.
A transform process requires a Schema to successfully transform data. Both schema and transform process classes come with a helper Builder class which are useful for organizing code and avoiding complex constructors.
When both are combined together they look like the sample code below. Note how inputDataSchema is passed into the Builder constructor. Your transform process will fail to compile without it.
import org.datavec.api.transform.TransformProcess;
TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
.removeColumns("CustomerID","MerchantID")
.filter(new ConditionFilter(new CategoricalColumnCondition("MerchantCountryCode", ConditionOp.NotInSet, new HashSet<>(Arrays.asList("USA","CAN")))))
.conditionalReplaceValueTransform(
"TransactionAmountUSD", //Column to operate on
new DoubleWritable(0.0), //New value to use, when the condition is satisfied
new DoubleColumnCondition("TransactionAmountUSD",ConditionOp.LessThan, 0.0)) //Condition: amount < 0.0
.stringToTimeTransform("DateTimeString","YYYY-MM-DD HH:mm:ss.SSS", DateTimeZone.UTC)
.renameColumn("DateTimeString", "DateTime")
.transform(new DeriveColumnsFromTimeTransform.Builder("DateTime").addIntegerDerivedColumn("HourOfDay", DateTimeFieldType.hourOfDay()).build())
.removeColumns("DateTime")
.build();Different "backends" for executors are available. Using the tp transform process above, here's how you can execute it locally using plain DataVec.
import org.datavec.local.transforms.LocalTransformExecutor;
List<List<Writable>> processedData = LocalTransformExecutor.execute(originalData, tp);Each operation in a transform process represents a "step" in schema changes. Sometimes, the resulting transformation is not the intended result. You can debug this by printing each step in the transform tp with the following:
//Now, print the schema after each time step:
int numActions = tp.getActionList().size();
for(int i=0; i<numActions; i++ ){
System.out.println("\n\n==================================================");
System.out.println("-- Schema after step " + i + " (" + tp.getActionList().get(i) + ") --");
System.out.println(tp.getSchemaAfterStep(i));
}A TransformProcess defines an ordered list of transformations to be executed on some data
getFinalSchema
public Schema getFinalSchema()Get the action list that this transform process will execute
return
getSchemaAfterStep
public Schema getSchemaAfterStep(int step)Return the schema after executing all steps up to and including the specified step. Steps are indexed from 0: so getSchemaAfterStep(0) is after one transform has been executed.
param step Index of the step
return Schema of the data, after that (and all prior) steps have been executed
toJson
public String toJson()Execute the full sequence of transformations for a single example. May return null if example is filtered NOTE: Some TransformProcess operations cannot be done on examples individually. Most notably, ConvertToSequence and ConvertFromSequence operations require the full data set to be processed at once
param input
return
toYaml
public String toYaml()Convert the TransformProcess to a YAML string
return TransformProcess, as YAML
fromJson
public static TransformProcess fromJson(String json)Deserialize a JSON String (created by {- link #toJson()}) to a TransformProcess
return TransformProcess, from JSON
fromYaml
public static TransformProcess fromYaml(String yaml)Deserialize a JSON String (created by {- link #toJson()}) to a TransformProcess
return TransformProcess, from JSON
transform
public Builder transform(Transform transform)Infer the categories for the given record reader for a particular column Note that each “column index” is a column in the context of: List record = ...; record.get(columnIndex);
Note that anything passed in as a column will be automatically converted to a string for categorical purposes.
The expected input is strings or numbers (which have sensible toString() representations)
Note that the returned categories will be sorted alphabetically
param recordReader the record reader to iterate through
param columnIndex te column index to get categories for
return
filter
public Builder filter(Filter filter)Add a filter operation to be executed after the previously-added operations have been executed
param filter Filter operation to execute
filter
public Builder filter(Condition condition)Add a filter operation, based on the specified condition.
If condition is satisfied (returns true): remove the example or sequence If condition is not satisfied (returns false): keep the example or sequence
param condition Condition to filter on
removeColumns
public Builder removeColumns(String... columnNames)Remove all of the specified columns, by name
param columnNames Names of the columns to remove
removeColumns
public Builder removeColumns(Collection<String> columnNames)Remove all of the specified columns, by name
param columnNames Names of the columns to remove
removeAllColumnsExceptFor
public Builder removeAllColumnsExceptFor(String... columnNames)Remove all columns, except for those that are specified here
param columnNames Names of the columns to keep
removeAllColumnsExceptFor
public Builder removeAllColumnsExceptFor(Collection<String> columnNames)Remove all columns, except for those that are specified here
param columnNames Names of the columns to keep
renameColumn
public Builder renameColumn(String oldName, String newName)Rename a single column
param oldName Original column name
param newName New column name
renameColumns
public Builder renameColumns(List<String> oldNames, List<String> newNames)Rename multiple columns
param oldNames List of original column names
param newNames List of new column names
reorderColumns
public Builder reorderColumns(String... newOrder)Reorder the columns using a partial or complete new ordering. If only some of the column names are specified for the new order, the remaining columns will be placed at the end, according to their current relative ordering
param newOrder Names of the columns, in the order they will appear in the output
duplicateColumn
public Builder duplicateColumn(String column, String newName)Duplicate a single column
param column Name of the column to duplicate
param newName Name of the new (duplicate) column
duplicateColumns
public Builder duplicateColumns(List<String> columnNames, List<String> newNames)Duplicate a set of columns
param columnNames Names of the columns to duplicate
param newNames Names of the new (duplicated) columns
integerMathOp
public Builder integerMathOp(String column, MathOp mathOp, int scalar)Perform a mathematical operation (add, subtract, scalar max etc) on the specified integer column, with a scalar
param column The integer column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation
integerColumnsMathOp
public Builder integerColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)Calculate and add a new integer column by performing a mathematical operation on a number of existing columns. New column is added to the end.
param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation
longMathOp
public Builder longMathOp(String columnName, MathOp mathOp, long scalar)Perform a mathematical operation (add, subtract, scalar max etc) on the specified long column, with a scalar
param columnName The long column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation
longColumnsMathOp
public Builder longColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)Calculate and add a new long column by performing a mathematical operation on a number of existing columns. New column is added to the end.
param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation
floatMathOp
public Builder floatMathOp(String columnName, MathOp mathOp, float scalar)Perform a mathematical operation (add, subtract, scalar max etc) on the specified double column, with a scalar
param columnName The float column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation
floatColumnsMathOp
public Builder floatColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)Calculate and add a new float column by performing a mathematical operation on a number of existing columns. New column is added to the end.
param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation
floatMathFunction
public Builder floatMathFunction(String columnName, MathFunction mathFunction)Perform a mathematical operation (such as sin(x), ceil(x), exp(x) etc) on a column
param columnName Column name to operate on
param mathFunction MathFunction to apply to the column
doubleMathOp
public Builder doubleMathOp(String columnName, MathOp mathOp, double scalar)Perform a mathematical operation (add, subtract, scalar max etc) on the specified double column, with a scalar
param columnName The double column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation
doubleColumnsMathOp
public Builder doubleColumnsMathOp(String newColumnName, MathOp mathOp, String... columnNames)Calculate and add a new double column by performing a mathematical operation on a number of existing columns. New column is added to the end.
param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation
doubleMathFunction
public Builder doubleMathFunction(String columnName, MathFunction mathFunction)Perform a mathematical operation (such as sin(x), ceil(x), exp(x) etc) on a column
param columnName Column name to operate on
param mathFunction MathFunction to apply to the column
timeMathOp
public Builder timeMathOp(String columnName, MathOp mathOp, long timeQuantity, TimeUnit timeUnit)Perform a mathematical operation (add, subtract, scalar min/max only) on the specified time column
param columnName The integer column to perform the operation on
param mathOp The mathematical operation
param timeQuantity The quantity used in the mathematical op
param timeUnit The unit that timeQuantity is specified in
categoricalToOneHot
public Builder categoricalToOneHot(String... columnNames)Convert the specified column(s) from a categorical representation to a one-hot representation. This involves the creation of multiple new columns each.
param columnNames Names of the categorical column(s) to convert to a one-hot representation
categoricalToInteger
public Builder categoricalToInteger(String... columnNames)Convert the specified column(s) from a categorical representation to an integer representation. This will replace the specified categorical column(s) with an integer repreesentation, where each integer has the value 0 to numCategories-1.
param columnNames Name of the categorical column(s) to convert to an integer representation
integerToCategorical
public Builder integerToCategorical(String columnName, List<String> categoryStateNames)Convert the specified column from an integer representation (assume values 0 to numCategories-1) to a categorical representation, given the specified state names
param columnName Name of the column to convert
param categoryStateNames Names of the states for the categorical column
integerToCategorical
public Builder integerToCategorical(String columnName, Map<Integer, String> categoryIndexNameMap)Convert the specified column from an integer representation to a categorical representation, given the specified mapping between integer indexes and state names
param columnName Name of the column to convert
param categoryIndexNameMap Names of the states for the categorical column
integerToOneHot
public Builder integerToOneHot(String columnName, int minValue, int maxValue)Convert an integer column to a set of 1 hot columns, based on the value in integer column
param columnName Name of the integer column
param minValue Minimum value possible for the integer column (inclusive)
param maxValue Maximum value possible for the integer column (inclusive)
addConstantColumn
public Builder addConstantColumn(String newColumnName, ColumnType newColumnType, Writable fixedValue)Add a new column, where all values in the column are identical and as specified.
param newColumnName Name of the new column
param newColumnType Type of the new column
param fixedValue Value in the new column for all records
addConstantDoubleColumn
public Builder addConstantDoubleColumn(String newColumnName, double value)Add a new double column, where the value for that column (for all records) are identical
param newColumnName Name of the new column
param value Value in the new column for all records
addConstantIntegerColumn
public Builder addConstantIntegerColumn(String newColumnName, int value)Add a new integer column, where th e value for that column (for all records) are identical
param newColumnName Name of the new column
param value Value of the new column for all records
addConstantLongColumn
public Builder addConstantLongColumn(String newColumnName, long value)Add a new integer column, where the value for that column (for all records) are identical
param newColumnName Name of the new column
param value Value in the new column for all records
convertToString
public Builder convertToString(String inputColumn)Convert the specified column to a string.
param inputColumn the input column to convert
return builder pattern
convertToDouble
public Builder convertToDouble(String inputColumn)Convert the specified column to a double.
param inputColumn the input column to convert
return builder pattern
convertToInteger
public Builder convertToInteger(String inputColumn)Convert the specified column to an integer.
param inputColumn the input column to convert
return builder pattern
normalize
public Builder normalize(String column, Normalize type, DataAnalysis da)Normalize the specified column with a given type of normalization
param column Column to normalize
param type Type of normalization to apply
param da DataAnalysis object
convertToSequence
public Builder convertToSequence(String keyColumn, SequenceComparator comparator)Convert a set of independent records/examples into a sequence, according to some key. Within each sequence, values are ordered using the provided {- link SequenceComparator}
param keyColumn Column to use as a key (values with the same key will be combined into sequences)
param comparator A SequenceComparator to order the values within each sequence (for example, by time or String order)
convertToSequence
public Builder convertToSequence()Convert a set of independent records/examples into a sequence; each example is simply treated as a sequence of length 1, without any join/group operations. Note that more commonly, joining/grouping is required; use {- link #convertToSequence(List, SequenceComparator)} for this functionality
convertToSequence
public Builder convertToSequence(List<String> keyColumns, SequenceComparator comparator)Convert a set of independent records/examples into a sequence, where each sequence is grouped according to one or more key values (i.e., the values in one or more columns) Within each sequence, values are ordered using the provided {- link SequenceComparator}
param keyColumns Column to use as a key (values with the same key will be combined into sequences)
param comparator A SequenceComparator to order the values within each sequence (for example, by time or String order)
convertFromSequence
public Builder convertFromSequence()Convert a sequence to a set of individual values (by treating each value in each sequence as a separate example)
splitSequence
public Builder splitSequence(SequenceSplit split)Split sequences into 1 or more other sequences. Used for example to split large sequences into a set of smaller sequences
param split SequenceSplit that defines how splits will occur
trimSequence
public Builder trimSequence(int numStepsToTrim, boolean trimFromStart)SequenceTrimTranform removes the first or last N values in a sequence. Note that the resulting sequence may be of length 0, if the input sequence is less than or equal to N.
param numStepsToTrim Number of time steps to trim from the sequence
param trimFromStart If true: Trim values from the start of the sequence. If false: trim values from the end.
offsetSequence
public Builder offsetSequence(List<String> columnsToOffset, int offsetAmount,
SequenceOffsetTransform.OperationType operationType)Perform a sequence of operation on the specified columns. Note that this also truncates sequences by the specified offset amount by default. Use {- code transform(new SequenceOffsetTransform(…)} to change this. See {- link SequenceOffsetTransform} for details on exactly what this operation does and how.
param columnsToOffset Columns to offset
param offsetAmount Amount to offset the specified columns by (positive offset: ‘columnsToOffset’ are moved to later time steps)
param operationType Whether the offset should be done in-place or by adding a new column
reduce
public Builder reduce(IAssociativeReducer reducer)Reduce (i.e., aggregate/combine) a set of examples (typically by key). Note: In the current implementation, reduction operations can be performed only on standard (i.e., non-sequence) data
param reducer Reducer to use
reduceSequence
public Builder reduceSequence(IAssociativeReducer reducer)Reduce (i.e., aggregate/combine) a set of sequence examples - for each sequence individually. Note: This method results in non-sequence data. If you would instead prefer sequences of length 1 after the reduction, use {- code transform(new ReduceSequenceTransform(reducer))}.
param reducer Reducer to use to reduce each window
reduceSequenceByWindow
public Builder reduceSequenceByWindow(IAssociativeReducer reducer, WindowFunction windowFunction)Reduce (i.e., aggregate/combine) a set of sequence examples - for each sequence individually - using a window function. For example, take all records/examples in each 24-hour period (i.e., using window function), and convert them into a singe value (using the reducer). In this example, the output is a sequence, with time period of 24 hours.
param reducer Reducer to use to reduce each window
param windowFunction Window function to find apply on each sequence individually
sequenceMovingWindowReduce
public Builder sequenceMovingWindowReduce(String columnName, int lookback, ReduceOp op)SequenceMovingWindowReduceTransform: Adds a new column, where the value is derived by: (a) using a window of the last N values in a single column, (b) Apply a reduction op on the window to calculate a new value for example, this transformer can be used to implement a simple moving average of the last N values, or determine the minimum or maximum values in the last N time steps.
For example, for a simple moving average, length 20: {- code new SequenceMovingWindowReduceTransform(“myCol”, 20, ReduceOp.Mean)}
param columnName Column name to perform windowing on
param lookback Look back period for windowing
param op Reduction operation to perform on each window
calculateSortedRank
public Builder calculateSortedRank(String newColumnName, String sortOnColumn, WritableComparator comparator)CalculateSortedRank: calculate the rank of each example, after sorting example. For example, we might have some numerical “score” column, and we want to know for the rank (sort order) for each example, according to that column. The rank of each example (after sorting) will be added in a new Long column. Indexing is done from 0; examples will have values 0 to dataSetSize-1.
Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data Furthermore, the current implementation can only sort on one column
param newColumnName Name of the new column (will contain the rank for each example)
param sortOnColumn Column to sort on
param comparator Comparator used to sort examples
calculateSortedRank
public Builder calculateSortedRank(String newColumnName, String sortOnColumn, WritableComparator comparator,
boolean ascending)CalculateSortedRank: calculate the rank of each example, after sorting example. For example, we might have some numerical “score” column, and we want to know for the rank (sort order) for each example, according to that column. The rank of each example (after sorting) will be added in a new Long column. Indexing is done from 0; examples will have values 0 to dataSetSize-1.
Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data Furthermore, the current implementation can only sort on one column
param newColumnName Name of the new column (will contain the rank for each example)
param sortOnColumn Column to sort on
param comparator Comparator used to sort examples
param ascending If true: sort ascending. False: descending
stringToCategorical
public Builder stringToCategorical(String columnName, List<String> stateNames)Convert the specified String column to a categorical column. The state names must be provided.
param columnName Name of the String column to convert to categorical
param stateNames State names of the category
stringRemoveWhitespaceTransform
public Builder stringRemoveWhitespaceTransform(String columnName)Remove all whitespace characters from the values in the specified String column
param columnName Name of the column to remove whitespace from
stringMapTransform
public Builder stringMapTransform(String columnName, Map<String, String> mapping)Replace one or more String values in the specified column with new values.
Keys in the map are the original values; the Values in the map are their replacements. If a String appears in the data but does not appear in the provided map (as a key), that String values will not be modified.
param columnName Name of the column in which to do replacement
param mapping Map of oldValues -> newValues
stringToTimeTransform
public Builder stringToTimeTransform(String column, String format, DateTimeZone dateTimeZone)Convert a String column (containing a date/time String) to a time column (by parsing the date/time String)
param column String column containing the date/time Strings
param format Format of the strings. Time format is specified as per http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
param dateTimeZone Timezone of the column
stringToTimeTransform
public Builder stringToTimeTransform(String column, String format, DateTimeZone dateTimeZone, Locale locale)Convert a String column (containing a date/time String) to a time column (by parsing the date/time String)
param column String column containing the date/time Strings
param format Format of the strings. Time format is specified as per http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
param dateTimeZone Timezone of the column
param locale Locale of the column
appendStringColumnTransform
public Builder appendStringColumnTransform(String column, String toAppend)Append a String to a specified column
param column Column to append the value to
param toAppend String to append to the end of each writable
conditionalReplaceValueTransform
public Builder conditionalReplaceValueTransform(String column, Writable newValue, Condition condition)Replace the values in a specified column with a specified new value, if some condition holds. If the condition does not hold, the original values are not modified.
param column Column to operate on
param newValue Value to use as replacement, if condition is satisfied
param condition Condition that must be satisfied for replacement
conditionalReplaceValueTransformWithDefault
public Builder conditionalReplaceValueTransformWithDefault(String column, Writable yesVal, Writable noVal, Condition condition)Replace the values in a specified column with a specified “yes” value, if some condition holds. Replace it with a “no” value, otherwise.
param column Column to operate on
param yesVal Value to use as replacement, if condition is satisfied
param noVal Value to use as replacement, if condition is not satisfied
param condition Condition that must be satisfied for replacement
conditionalCopyValueTransform
public Builder conditionalCopyValueTransform(String columnToReplace, String sourceColumn, Condition condition)Replace the value in a specified column with a new value taken from another column, if a condition is satisfied/true. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.
param columnToReplace Name of the column in which values will be replaced (if condition is satisfied)
param sourceColumn Name of the column from which the new values will be
param condition Condition to use
replaceStringTransform
public Builder replaceStringTransform(String columnName, Map<String, String> mapping)Replace one or more String values in the specified column that match regular expressions.
Keys in the map are the regular expressions; the Values in the map are their String replacements. For example:
Original
Regex
Replacement
Result
Data_Vec
_
DataVec
B1C2T3
\d
one
BoneConeTone
' 4.25 '
^\s+|\s+$
'4.25'
param columnName Name of the column in which to do replacement
param mapping Map of old values or regular expression to new values
ndArrayScalarOpTransform
public Builder ndArrayScalarOpTransform(String columnName, MathOp op, double value)Element-wise NDArray math operation (add, subtract, etc) on an NDArray column
param columnName Name of the NDArray column to perform the operation on
param op Operation to perform
param value Value for the operation
ndArrayColumnsMathOpTransform
public Builder ndArrayColumnsMathOpTransform(String newColumnName, MathOp mathOp, String... columnNames)Perform an element wise mathematical operation (such as add, subtract, multiply) on NDArray columns. The existing columns are unchanged, a new NDArray column is added
param newColumnName Name of the new NDArray column
param mathOp Operation to perform
param columnNames Name of the columns used as input to the operation
ndArrayMathFunctionTransform
public Builder ndArrayMathFunctionTransform(String columnName, MathFunction mathFunction)Apply an element wise mathematical function (sin, tanh, abs etc) to an NDArray column. This operation is performed in place.
param columnName Name of the column to perform the operation on
param mathFunction Mathematical function to apply
ndArrayDistanceTransform
public Builder ndArrayDistanceTransform(String newColumnName, Distance distance, String firstCol,
String secondCol)Calculate a distance (cosine similarity, Euclidean, Manhattan) on two equal-sized NDArray columns. This operation adds a new Double column (with the specified name) with the result.
param newColumnName Name of the new column (result) to add
param distance Distance to apply
param firstCol first column to use in the distance calculation
param secondCol second column to use in the distance calculation
firstDigitTransform
public Builder firstDigitTransform(String inputColumn, String outputColumn)FirstDigitTransform converts a column to a categorical column, with values being the first digit of the number. For example, “3.1415” becomes “3” and “2.0” becomes “2”. Negative numbers ignore the sign: “-7.123” becomes “7”. Note that two {- link FirstDigitTransform.Mode}s are supported, which determines how non-numerical entries should be handled: EXCEPTION_ON_INVALID: output has 10 category values (“0”, …, “9”), and any non-numerical values result in an exception INCLUDE_OTHER_CATEGORY: output has 11 category values (“0”, …, “9”, “Other”), all non-numerical values are mapped to “Other”
FirstDigitTransform is useful (combined with {- link CategoricalToOneHotTransform} and Reductions) to implement Benford’s law.
param inputColumn Input column name
param outputColumn Output column name. If same as input, input column is replaced
firstDigitTransform
public Builder firstDigitTransform(String inputColumn, String outputColumn, FirstDigitTransform.Mode mode)FirstDigitTransform converts a column to a categorical column, with values being the first digit of the number. For example, “3.1415” becomes “3” and “2.0” becomes “2”. Negative numbers ignore the sign: “-7.123” becomes “7”. Note that two {- link FirstDigitTransform.Mode}s are supported, which determines how non-numerical entries should be handled: EXCEPTION_ON_INVALID: output has 10 category values (“0”, …, “9”), and any non-numerical values result in an exception INCLUDE_OTHER_CATEGORY: output has 11 category values (“0”, …, “9”, “Other”), all non-numerical values are mapped to “Other”
FirstDigitTransform is useful (combined with {- link CategoricalToOneHotTransform} and Reductions) to implement Benford’s law.
param inputColumn Input column name
param outputColumn Output column name. If same as input, input column is replaced
param mode See {- link FirstDigitTransform.Mode}
build
public TransformProcess build()Create the TransformProcess object
Created by Alex on 4/03/2016.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Created by Alex on 4/03/2016.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
Pivot transform operates on two columns:
a categorical column that operates as a key, and
Another column that contains a value Essentially, Pivot transform takes keyvalue pairs and breaks them out into separate columns.
For example, with schema [col0, key, value, col3] and values with key in {a,b,c} Output schema is [col0, key[a], key[b], key[c], col3] and input (col0Val, b, x, col3Val) gets mapped to (col0Val, 0, x, 0, col3Val).
When expanding columns, a default value is used - for example 0 for numerical columns.
transform
public Schema transform(Schema inputSchema)param keyColumnName Key column to expand
param valueColumnName Name of the column that contains the value
Convert a String column to a categorical column
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
Add a new column, where the values in that column for all records are identical (according to the specified value)
Duplicate one or more columns. The duplicated columns are placed immediately after the original columns
transform
public Schema transform(Schema inputSchema)param columnsToDuplicate List of columns to duplicate
param newColumnNames List of names for the new (duplicate) columns
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Transform that removes all columns except for those that are explicitly specified as ones to keep To specify only the columns
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Remove the specified columns from the data. To specify only the columns to keep,
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Rename one or more columns
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Rearrange the order of the columns. Note: A partial list of columns can be used here. Any columns that are not explicitly mentioned will be placed after those that are in the output, without changing their relative order.
transform
public Schema transform(Schema inputSchema)param newOrder A partial or complete order of the columns in the output
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Replace the value in a specified column with a new value taken from another column, if a condition is satisfied/true. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.
Note: For sequences, this transform use the convention that each step in the sequence is passed to the condition, and replaced (or not) separately (i.e., Condition.condition(List) is used on each time step individually)
transform
public Schema transform(Schema inputSchema)param columnToReplace Name of the column in which to replace the old value
param sourceColumn Name of the column to get the new value from
param condition Condition
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Replace the value in a specified column with a new value, if a condition is satisfied/true. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.
Note: For sequences, this transform use the convention that each step in the sequence is passed to the condition, and replaced (or not) separately (i.e., Condition.condition(List) is used on each time step individually)
transform
public Schema transform(Schema inputSchema)param columnToReplace Name of the column in which to replace the old value with ‘newValue’, if the condition holds
param newValue New value to use
param condition Condition
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Replace the value in a specified column with a ‘yes’ value, if a condition is satisfied/true. Replace the value of this same column with a ‘no’ value otherwise. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.
Note: For sequences, this transform use the convention that each step in the sequence is passed to the condition, and replaced (or not) separately (i.e., Condition.condition(List) is used on each time step individually)
Convert any value to an Double
map
public DoubleWritable map(Writable writable)param column Name of the column to convert to a Double column
Add a new double column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
A simple transform to do common mathematical operations, such as sin(x), ceil(x), etc.
Double mathematical operation. This is an in-place operation of the double column value and a double scalar.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
Normalize by taking scale log2((in-columnMin)/(mean-columnMin) + 1) Maps values in range (columnMin to infinity) to (0 to infinity) Most suitable for values with a geometric/negative exponential type distribution.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Normalizer to map (min to max) -> (newMin-to newMax) linearly.
Mathematically: (newMax-newMin)/(max-min) (x-min) + newMin
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Normalize using (x-mean)/stdev. Also known as a standard score, standardization etc.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Normalize by substracting the mean
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Convert any value to an Integer.
map
public IntWritable map(Writable writable)param column Name of the column to convert to an integer
Add a new integer column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==MathOp.Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2. NOTE: Division here is using if a decimal output value is required.
toString
public String toString()param newColumnName Name of the new column (output column)
param mathOp Mathematical operation. Only Add/Subtract/Multiply/Divide/Modulus is allowed here
param columns Columns to use in the mathematical operation
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
Integer mathematical operation. This is an in-place operation of the integer column value and an integer scalar.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Convert an integer column to a set of one-hot columns.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Replace an empty/missing integer with a certain value.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Replace an invalid (non-integer) value in a column with a specified integer
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Add a new long column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==MathOp.Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2. if a decimal output value is required.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
Long mathematical operation. This is an in-place operation of the long column value and an long scalar.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Convert each text value in a sequence to a longer sequence of integer indices. For example, “abc” would be converted to [1, 2, 3]. Values in other columns will be duplicated.
Convert each text value in a sequence to a longer sequence of integer indices. For example, “zero one two” would be converted to [0, 1, 2]. Values in other columns will be duplicated.
SequenceDifferenceTransform: for an input sequence, calculate the difference on one column. For each time t, calculate someColumn(t) - someColumn(t-s), where s >= 1 is the ‘lookback’ period.
Note: at t=0 (i.e., the first step in a sequence; or more generally, for all times t < s), there is no previous value these time steps:
Default: output = someColumn(t) - someColumn(max(t-s, 0))
SpecifiedValue: output = someColumn(t) - someColumn(t-s) if t-s >= 0, or a custom Writable object (for example, a DoubleWritable(0) or NullWritable).
Note: this is an in-place operation: i.e., the values in each column are modified. If the original values are and apply the difference operation in-place on the copy.
outputColumnName
public String outputColumnName()Create a SequenceDifferenceTransform with default lookback of 1, and using FirstStepMode.Default. Output column name is the same as the input column name.
param columnName Name of the column to perform the operation on.
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
SequenceMovingWindowReduceTransform Adds a new column, where the value is derived by: (a) using a window of the last N values in a single column, (b) Apply a reduction op on the window to calculate a new value for example, this transformer can be used to implement a simple moving average of the last N values, or determine the minimum or maximum values in the last N time steps.
defaultOutputColumnName
public static String defaultOutputColumnName(String originalName, int lookback, ReduceOp op)Enumeration to specify how each cases are handled: For example, for a look back period of 20, how should the first 19 output values be calculated? Default: Perform your former reduction as normal, with as many values are available SpecifiedValue: use the given/specified value instead of the actual output value. For example, you could assign values of 0 or NullWritable to positions 0 through 18 of the output.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Sequence offset transform takes a sequence, and shifts The values in one or more columns by a specified number of times steps. It has 2 modes of operation (OperationType enum), with respect to the columns it operates on: InPlace: operations may be performed in-place, modifying the values in the specified columns NewColumn: operations may produce new columns, with the original (source) columns remaining unmodified
Additionally, there are 2 modes for handling values outside the original sequence (EdgeHandling enum): TrimSequence: the entire sequence is trimmed (start or end) by a specified number of steps SpecifiedValue: for any values outside of the original sequence, they are given a specified value
Note 1: When specifying offsets, they are done as follows: Positive offsets: move the values in the specified columns to a later time. Earlier time steps are either be trimmed or Given specified values; the last values in these columns will be truncated/removed.
Note 2: Care must be taken when using TrimSequence: for example, if we chain multiple sequence offset transforms on the one dataset, we may end up trimming much more than we want. In this case, it may be better to use SpecifiedValue, at the end.
Append a String to the values in a single column
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Change case (to, e.g, all lower case) of String column.
Concatenate values of one or more String columns into a new String column. Retains the constituent String columns so user must remove those manually, if desired.
TODO: use new String Reduce functionality in DataVec?
transform
public Schema transform(Schema inputSchema)param columnsToConcatenate A partial or complete order of the columns in the output
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Convert any value to a string.
map
public Text map(Writable writable)Transform the writable in to a string
param writable the writable to transform
return the string form of this writable
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
This method maps all String values, except those is the specified list, to a single String value
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
String transform that removes all whitespace charaters
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Replace empty String values with the specified String
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Replaces String values that match regular expressions.
map
public Text map(final Writable writable)Constructs a new ReplaceStringTransform using the specified
param columnName Name of the column
param map Key: regular expression; Value: replacement value
Convert a delimited String to a list of binary categorical columns. Suppose the possible String values were {“a”,”b”,”c”,”d”} and the String column value to be converted contained the String “a,c”, then the 4 output columns would have values [“true”,”false”,”true”,”false”]
transform
public Schema transform(Schema inputSchema)param columnName The name of the column to convert
param newColumnNames The names of the new columns to create
param categoryTokens The possible tokens that may be present. Note this list must have the same length and order as the newColumnNames list
param delimiter The delimiter for the Strings to convert
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Converts String column into a bag-of-words (BOW) represented as an NDArray of “counts.” Note that the original column is removed in the process
transform
public Schema transform(Schema inputSchema)param columnName The name of the column to convert
param vocabulary The possible tokens that may be present.
param delimiter The delimiter for the Strings to convert
param ignoreUnknown Whether to ignore unknown tokens
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
outputColumnName
public String outputColumnName()The output column name after the operation has been applied
return the output column name
columnName
public String columnName()The output column names This will often be the same as the input
return the output column names
Converts String column into a sparse bag-of-words (BOW) represented as an NDArray of indices. Appropriate for embeddings or as efficient storage before being expanded into a dense array.
A simple String -> String map function.
Keys in the map are the original values; the Values in the map are their replacements. If a String appears in the data but does not appear in the provided map (as a key), that String values will not be modified.
map
public Text map(Writable writable)param columnName Name of the column
param map Key: From. Value: To
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Create a number of new columns by deriving their values from a Time column. Can be used for example to create new columns with the year, month, day, hour, minute, second etc.
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
public Object mapSequence(Object sequence)Transform a sequence
param sequence
toString
public String toString()The output column name after the operation has been applied
return the output column name
Convert a String column to a time column by parsing the date/time String, using a JodaTime.
Time format is specified as per http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
getNewColumnMetaData
public ColumnMetaData getNewColumnMetaData(String newName, ColumnMetaData oldColumnType)Instantiate this without a time format specified. If this constructor is used, this transform will be allowed to handle several common transforms as defined in the static formats array.
param columnName Name of the String column
param timeZone Timezone for time parsing
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable
Transform math op on a time column
Note: only the following MathOps are supported: Add, Subtract, ScalarMin, ScalarMax For ScalarMin/Max, the TimeUnit must be milliseconds - i.e., value must be in epoch millisecond format
map
public Object map(Object input)Transform an object in to another object
param input the record to transform
return the transformed writable