Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Gather statistics on datasets.
Sometimes datasets are too large or too abstract in their format to manually analyze and estimate statistics on certain columns or patterns. DataVec comes with some helper utilities for performing a data analysis, and maximums, means, minimums, and other useful metrics.
If you have loaded your data into Apache Spark, DataVec has a special AnalyzeSpark
class which can generate histograms, collect statistics, and return information about the quality of the data. Assuming you have already loaded your data into a Spark RDD, pass the JavaRDD
and Schema
to the class.
If you are using DataVec in Scala and your data was loaded into a regular RDD
class, you can convert it by calling .toJavaRDD()
which returns a JavaRDD
. If you need to convert it back, call rdd()
.
The code below demonstrates some of many analyses for a 2D dataset in Spark analysis using the RDD javaRdd
and the schema mySchema
:
Note that if you have sequence data, there are special methods for that as well:
The AnalyzeLocal
class works very similarly to its Spark counterpart and has a similar API. Instead of passing an RDD, it accepts a RecordReader
which allows it to iterate over the dataset.
Analyse the specified data - returns a DataAnalysis object with summary information about each column
Analyse the specified data - returns a DataAnalysis object with summary information about each column
param schema Schema for data
param rr Data to analyze
return DataAnalysis for data
Analyze the data quality of sequence data - provides a report on missing values, values that don’t comply with schema, etc
param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object
Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc
param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object
AnalizeSpark: static methods for analyzing and
param schema
param data
param maxHistogramBuckets
return
Analyse the specified data - returns a DataAnalysis object with summary information about each column
param schema Schema for data
param data Data to analyze
return DataAnalysis for data
Randomly sample values from a single column
param count Number of values to sample
param columnName Name of the column to sample from
param schema Schema
param data Data to sample from
return A list of random samples
Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc
param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object
Randomly sample a set of invalid values from a specified column. Values are considered invalid according to the Schema / ColumnMetaData
param numToSample Maximum number of invalid values to sample
param columnName Same of the column from which to sample invalid values
param schema Data schema
param data Data
return List of invalid examples
Get the maximum value for the specified column
param allData All data
param columnName Name of the column to get the minimum value for
param schema Schema of the data
return Maximum value for the column
Selection of data using conditions.
Filters are a part of transforms and gives a DSL for you to keep parts of your dataset. Filters can be one-liners for single conditions or include complex boolean logic.
You can also write your own filters by implementing the Filter
interface, though it is much more often that you may want to create a custom condition instead.
If condition is satisfied (returns true): remove the example or sequence If condition is not satisfied (returns false): keep the example or sequence
removeExample
param writables Example
return true if example should be removed, false to keep
removeSequence
param sequence sequence example
return true if example should be removed, false to keep
transform
Get the output schema for this transformation, given an input schema
param inputSchema
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Filter: a method of removing examples (or sequences) according to some condition
FilterInvalidValues: a filter operation that removes any examples (or sequences) if the examples/sequences contains invalid values in any of a specified set of columns. Invalid values are determined with respect to the schema
transform
param columnsToFilterIfInvalid Columns to check for invalid values
removeExample
param writables Example
return true if example should be removed, false to keep
removeSequence
param sequence sequence example
return true if example should be removed, false to keep
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Remove invalid records of a certain size.
removeExample
param writables Example
return true if example should be removed, false to keep
removeSequence
param sequence sequence example
return true if example should be removed, false to keep
removeExample
param writables Example
return true if example should be removed, false to keep
removeSequence
param sequence sequence example
return true if example should be removed, false to keep
transform
Get the output schema for this transformation, given an input schema
param inputSchema
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
How to use data records in DataVec.
In the DataVec world a Record represents a single entry in a dataset. DataVec differentiates types of records to make data manipulation easier with built-in APIs. Sequences and 2D records are distinguishable.
Most of the time you do not need to interact with the record classes directly, unless you are manually iterating records for the purpose of forwarding through a neural network.
A standard implementation of the Record interface
A standard implementation of the SequenceRecord interface.
Read individual records from different formats.
Readers iterate records from a dataset in storage and load the data into DataVec. The usefulness of readers beyond individual entries in a dataset includes: what if you wanted to train a text generator on a corpus? Or programmatically compose two entries together to form a new record? Reader implementations are useful for complex file types or distributed storage mechanisms.
Readers return Writable
classes that describe each column in a Record
. These classes are used to convert each record to a tensor/ND-Array format.
Each reader implementation extends BaseRecordReader
and provides a simple API for selecting the next record in a dataset, acting similarly to iterators.
Useful methods include:
next
: Return a batch of Writable
.
nextRecord
: Return a single Record
, optionally with RecordMetaData
.
reset
: Reset the underlying iterator.
hasNext
: Iterator method to determine if another record is available.
You can hook a custom RecordListener
to a record reader for debugging or visualization purposes. Pass your custom listener to the addListener
base method immediately after initializing your class.
RecordReader for each pipeline. Individual record is a concatenation of the two collections. Create a recordreader that takes recordreaders and iterates over them and concatenates them hasNext would be the & of all the recordreaders concatenation would be next & addAll on the collection return one record
initialize
Combine multiple readers into a single reader. Records are read sequentially - thus if the first reader has 100 records, and the second reader has 200 records, ConcatenatingRecordReader will have 300 records.
File reader/writer
getCurrentLabel
Return the current label. The index of the current file’s parent directory in the label list
return The index of the current file’s parent directory
Reads files line by line
Collection record reader. Mainly used for testing.
Collection record reader for sequences. Mainly used for testing.
initialize
param records Collection of sequences. For example, List<List<List>> where the inner two lists are a sequence, and the outer list/collection is a list of sequences
Iterates through a list of strings return a record.
initialize
Called once at initialization.
param split the split that defines the range of records to read
throws IOException
throws InterruptedException
initialize
Called once at initialization.
param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException
hasNext
Get the next record
return The list of next record
reset
List of label strings
return
nextRecord
Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream
param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream
close
Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.
As noted in {- link AutoCloseable#close()}, cases where the close may fail require careful attention. It is strongly advised to relinquish the underlying resources and to internally mark the {- code Closeable} as closed, prior to throwing the {- code IOException}.
throws IOException if an I/O error occurs
setConf
Set the configuration to be used by this object.
param conf
getConf
Return the configuration used by this object.
Simple csv record reader.
initialize
Skip first n lines
param skipNumLines the number of lines to skip
A CSVRecordReader that can split each column into additional columns using regexs.
CSV Sequence Record Reader This reader is intended to read sequences of data in CSV format, where each sequence is defined in its own file (and there are multiple files) Each line in the file represents one time step
A sliding window of variable size across an entire CSV.
In practice the sliding window size starts at 1, then linearly increase to maxLinesPer sequence, then linearly decrease back to 1.
initialize
No-arg constructor with the default number of lines per sequence (10)
Record reader for libsvm format, which is closely related to SVMLight format. Similar to scikit-learn we use a single reader for both formats, so this class is a subclass of SVMLightRecordReader.
Further details on the format can be found at
Matlab record reader
Record reader for SVMLight format, which can generally be described as
LABEL INDEX:VALUE INDEX:VALUE …
SVMLight format is well-suited to sparse data (e.g., bag-of-words) because it omits all features with value zero.
We support an “extended” version that allows for multiple targets (or labels) separated by a comma, as follows:
LABEL1,LABEL2,… INDEX:VALUE INDEX:VALUE …
This can be used to represent either multitask problems or multilabel problems with sparse binary labels (controlled via the “MULTILABEL” configuration option).
Like scikit-learn, we support both zero-based and one-based indexing.
Further details on the format can be found at
initialize
Must be called before attempting to read records.
param conf DataVec configuration
param split FileSplit
throws IOException
throws InterruptedException
setConf
Set configuration.
param conf DataVec configuration
throws IOException
throws InterruptedException
hasNext
Helper function to help detect lines that are commented out. May read ahead and cache a line.
return
nextRecord
Return next record as list of Writables.
return
RegexLineRecordReader: Read a file, one line at a time, and split it into fields using a regex. To load an entire file using a
Example: Data in format “2016-01-01 23:59:59.001 1 DEBUG First entry message!” using regex String “(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) (\d+) ([A-Z]+) (.)” would be split into 4 Text writables: [“2016-01-01 23:59:59.001”, “1”, “DEBUG”, “First entry message!”]
RegexSequenceRecordReader: Read an entire file (as a sequence), one line at a time and split each line into fields using a regex.
Example: Data in format “2016-01-01 23:59:59.001 1 DEBUG First entry message!” using regex String “(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) (\d+) ([A-Z]+) (.)” would be split into 4 Text writables: [“2016-01-01 23:59:59.001”, “1”, “DEBUG”, “First entry message!”]
lines that don’t match the provided regex can result in an exception (FailOnInvalid), can be skipped silently (SkipInvalid), or skip invalid but log a warning (SkipInvalidWithWarning)
to have a transform process applied before being returned.
initialize
Called once at initialization.
param split the split that defines the range of records to read
throws IOException
throws InterruptedException
initialize
Called once at initialization.
param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException
hasNext
Get the next record
return
reset
List of label strings
return
nextRecord
Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream
param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream
loadFromMetaData
Load a single record from the given {- link RecordMetaData} instance Note: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient to load multiple records at once using {- link #loadFromMetaData(List)}
param recordMetaData Metadata for the record that we want to load from
return Single record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading
setListeners
Load multiple records from the given a list of {- link RecordMetaData} instances
param recordMetaDatas Metadata for the records that we want to load from
return Multiple records for the given RecordMetaData instances
throws IOException If I/O error occurs during loading
setListeners
Set the record listeners for this record reader.
param listeners
close
Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.
As noted in {- link AutoCloseable#close()}, cases where the close may fail require careful attention. It is strongly advised to relinquish the underlying resources and to internally mark the {- code Closeable} as closed, prior to throwing the {- code IOException}.
throws IOException if an I/O error occurs
setConf
Set the configuration to be used by this object.
param conf
getConf
Return the configuration used by this object.
to be transformed before being returned.
setConf
Set the configuration to be used by this object.
param conf
getConf
Return the configuration used by this object.
batchesSupported
Returns a sequence record.
return a sequence of records
nextSequence
Load a sequence record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream
param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream
loadSequenceFromMetaData
Load a single sequence record from the given {- link RecordMetaData} instance Note: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient to load multiple records at once using {- link #loadSequenceFromMetaData(List)}
param recordMetaData Metadata for the sequence record that we want to load from
return Single sequence record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading
initialize
Load multiple sequence records from the given a list of {- link RecordMetaData} instances
param recordMetaDatas Metadata for the records that we want to load from
return Multiple sequence record for the given RecordMetaData instances
throws IOException If I/O error occurs during loading
initialize
Called once at initialization.
param conf a configuration for initialization
param split the split that defines the range of records to read
throws IOException
throws InterruptedException
hasNext
Get the next record
return
reset
List of label strings
return
nextRecord
Load the record from the given DataInputStream Unlike {- link #next()} the internal state of the RecordReader is not modified Implementations of this method should not close the DataInputStream
param uri
param dataInputStream
throws IOException if error occurs during reading from the input stream
loadFromMetaData
Load a single record from the given {- link RecordMetaData} instance Note: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient to load multiple records at once using {- link #loadFromMetaData(List)}
param recordMetaData Metadata for the record that we want to load from
return Single record for the given RecordMetaData instance
throws IOException If I/O error occurs during loading
setListeners
Load multiple records from the given a list of {- link RecordMetaData} instances
param recordMetaDatas Metadata for the records that we want to load from
return Multiple records for the given RecordMetaData instances
throws IOException If I/O error occurs during loading
setListeners
Set the record listeners for this record reader.
param listeners
close
Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.
As noted in {- link AutoCloseable#close()}, cases where the close may fail require careful attention. It is strongly advised to relinquish the underlying resources and to internally mark the {- code Closeable} as closed, prior to throwing the {- code IOException}.
throws IOException if an I/O error occurs
Native audio file loader using FFmpeg.
Wav file loader
Image record reader. Reads a local file system and parses images of a given height and width. All images are rescaled and converted to the given height, width, and number of channels.
Also appends the label if specified (one of k encoding based on the directory structure where each subdir of the root is an indexed label)
TFIDF record reader (wraps a tfidf vectorizer for delivering labels and conforming to the record reader interface)
BooleanCondition: used for creating compound conditions, such as AND(ConditionA, ConditionB, …) As a BooleanCondition is a condition, these can be chained together, like NOT(OR(AND(…),AND(…)))
The output column name after the operation has been applied
return the output column name
The output column names This will often be the same as the input
return the output column names
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
conditionSequence
Condition on arbitrary input
param sequence the sequence to do a condition on
return true if the condition for the sequence is met false otherwise
transform
Get the output schema for this transformation, given an input schema
param inputSchema
And of all the given conditions
param conditions the conditions to and
return a joint and of all these conditions
Or of all the given conditions
param conditions the conditions to or
return a joint and of all these conditions
Not of the given condition
param condition the conditions to and
return a joint and of all these condition
And of all the given conditions
param first the first condition
param second the second condition for xor
return the xor of these 2 conditions
For certain single-column conditions: how should we apply these to sequences? And: Condition applies to sequence only if it applies to ALL time steps Or: Condition applies to sequence if it applies to ANY time steps NoSequencMode: Condition cannot be applied to sequences at all (error condition)
Created by agibsonccc on 11/26/16.
columnCondition
Returns whether the given element meets the condition set by this operation
param writable the element to test
return true if the condition is met false otherwise
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
columnCondition
Constructor for conditions equal or not equal. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (== or != only)
param value Value to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
columnCondition
Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Value to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
A column condition that simply checks whether a floating point value is infinite
columnCondition
param columnName Column check for the condition
columnCondition
Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Value to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
A Condition that applies to a single column. Whenever the specified value is invalid according to the schema, the condition applies.
For example, if a Writable contains String values in an Integer column (and these cannot be parsed to an integer), then the condition would return true, as these values are invalid according to the schema.
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
columnCondition
Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Value to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
A column condition that simply checks whether a floating point value is NaN
columnCondition
param columnName Name of the column to check the condition for
Condition that applies to the values in any column. Specifically, condition is true if the Writable value is a NullWritable, and false for any other value
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
columnCondition
Constructor for conditions equal or not equal Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (== or != only)
param value Value to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
Condition that applies to the values
columnCondition
Constructor for operations such as less than, equal to, greater than, etc. Uses default sequence condition mode, {- link BaseColumnCondition#DEFAULT_SEQUENCE_CONDITION_MODE}
param columnName Column to check for the condition
param op Operation (<, >=, !=, etc)
param value Time value (in epoch millisecond format) to use in the condition
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
Created by huitseeker on 5/17/17.
A condition on sequence lengths
Condition that applies to the values in a String column, using a provided regex. Condition return true if the String matches the regex, or false otherwise Note: Uses Writable.toString(), hence can potentially be applied to non-String columns
condition
Condition on arbitrary input
param input the input to return the condition for
return true if the condition is met false otherwise
Data wrangling and mapping from one schema to another.
One of the key tools in DataVec is transformations. DataVec helps the user map a dataset from one schema to another, and provides a list of operations to convert types, format data, and convert a 2D dataset to sequence data.
A transform process requires a Schema
to successfully transform data. Both schema and transform process classes come with a helper Builder
class which are useful for organizing code and avoiding complex constructors.
When both are combined together they look like the sample code below. Note how inputDataSchema
is passed into the Builder
constructor. Your transform process will fail to compile without it.
Different "backends" for executors are available. Using the tp
transform process above, here's how you can execute it locally using plain DataVec.
Each operation in a transform process represents a "step" in schema changes. Sometimes, the resulting transformation is not the intended result. You can debug this by printing each step in the transform tp
with the following:
A TransformProcess defines an ordered list of transformations to be executed on some data
getFinalSchema
Get the action list that this transform process will execute
return
getSchemaAfterStep
Return the schema after executing all steps up to and including the specified step. Steps are indexed from 0: so getSchemaAfterStep(0) is after one transform has been executed.
param step Index of the step
return Schema of the data, after that (and all prior) steps have been executed
toJson
Execute the full sequence of transformations for a single example. May return null if example is filtered NOTE: Some TransformProcess operations cannot be done on examples individually. Most notably, ConvertToSequence and ConvertFromSequence operations require the full data set to be processed at once
param input
return
toYaml
Convert the TransformProcess to a YAML string
return TransformProcess, as YAML
fromJson
Deserialize a JSON String (created by {- link #toJson()}) to a TransformProcess
return TransformProcess, from JSON
fromYaml
Deserialize a JSON String (created by {- link #toJson()}) to a TransformProcess
return TransformProcess, from JSON
transform
Infer the categories for the given record reader for a particular column Note that each “column index” is a column in the context of: List record = ...; record.get(columnIndex);
Note that anything passed in as a column will be automatically converted to a string for categorical purposes.
The expected input is strings or numbers (which have sensible toString() representations)
Note that the returned categories will be sorted alphabetically
param recordReader the record reader to iterate through
param columnIndex te column index to get categories for
return
filter
Add a filter operation to be executed after the previously-added operations have been executed
param filter Filter operation to execute
filter
Add a filter operation, based on the specified condition.
If condition is satisfied (returns true): remove the example or sequence If condition is not satisfied (returns false): keep the example or sequence
param condition Condition to filter on
removeColumns
Remove all of the specified columns, by name
param columnNames Names of the columns to remove
removeColumns
Remove all of the specified columns, by name
param columnNames Names of the columns to remove
removeAllColumnsExceptFor
Remove all columns, except for those that are specified here
param columnNames Names of the columns to keep
removeAllColumnsExceptFor
Remove all columns, except for those that are specified here
param columnNames Names of the columns to keep
renameColumn
Rename a single column
param oldName Original column name
param newName New column name
renameColumns
Rename multiple columns
param oldNames List of original column names
param newNames List of new column names
reorderColumns
Reorder the columns using a partial or complete new ordering. If only some of the column names are specified for the new order, the remaining columns will be placed at the end, according to their current relative ordering
param newOrder Names of the columns, in the order they will appear in the output
duplicateColumn
Duplicate a single column
param column Name of the column to duplicate
param newName Name of the new (duplicate) column
duplicateColumns
Duplicate a set of columns
param columnNames Names of the columns to duplicate
param newNames Names of the new (duplicated) columns
integerMathOp
Perform a mathematical operation (add, subtract, scalar max etc) on the specified integer column, with a scalar
param column The integer column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation
integerColumnsMathOp
Calculate and add a new integer column by performing a mathematical operation on a number of existing columns. New column is added to the end.
param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation
longMathOp
Perform a mathematical operation (add, subtract, scalar max etc) on the specified long column, with a scalar
param columnName The long column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation
longColumnsMathOp
Calculate and add a new long column by performing a mathematical operation on a number of existing columns. New column is added to the end.
param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation
floatMathOp
Perform a mathematical operation (add, subtract, scalar max etc) on the specified double column, with a scalar
param columnName The float column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation
floatColumnsMathOp
Calculate and add a new float column by performing a mathematical operation on a number of existing columns. New column is added to the end.
param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation
floatMathFunction
Perform a mathematical operation (such as sin(x), ceil(x), exp(x) etc) on a column
param columnName Column name to operate on
param mathFunction MathFunction to apply to the column
doubleMathOp
Perform a mathematical operation (add, subtract, scalar max etc) on the specified double column, with a scalar
param columnName The double column to perform the operation on
param mathOp The mathematical operation
param scalar The scalar value to use in the mathematical operation
doubleColumnsMathOp
Calculate and add a new double column by performing a mathematical operation on a number of existing columns. New column is added to the end.
param newColumnName Name of the new/derived column
param mathOp Mathematical operation to execute on the columns
param columnNames Names of the columns to use in the mathematical operation
doubleMathFunction
Perform a mathematical operation (such as sin(x), ceil(x), exp(x) etc) on a column
param columnName Column name to operate on
param mathFunction MathFunction to apply to the column
timeMathOp
Perform a mathematical operation (add, subtract, scalar min/max only) on the specified time column
param columnName The integer column to perform the operation on
param mathOp The mathematical operation
param timeQuantity The quantity used in the mathematical op
param timeUnit The unit that timeQuantity is specified in
categoricalToOneHot
Convert the specified column(s) from a categorical representation to a one-hot representation. This involves the creation of multiple new columns each.
param columnNames Names of the categorical column(s) to convert to a one-hot representation
categoricalToInteger
Convert the specified column(s) from a categorical representation to an integer representation. This will replace the specified categorical column(s) with an integer repreesentation, where each integer has the value 0 to numCategories-1.
param columnNames Name of the categorical column(s) to convert to an integer representation
integerToCategorical
Convert the specified column from an integer representation (assume values 0 to numCategories-1) to a categorical representation, given the specified state names
param columnName Name of the column to convert
param categoryStateNames Names of the states for the categorical column
integerToCategorical
Convert the specified column from an integer representation to a categorical representation, given the specified mapping between integer indexes and state names
param columnName Name of the column to convert
param categoryIndexNameMap Names of the states for the categorical column
integerToOneHot
Convert an integer column to a set of 1 hot columns, based on the value in integer column
param columnName Name of the integer column
param minValue Minimum value possible for the integer column (inclusive)
param maxValue Maximum value possible for the integer column (inclusive)
addConstantColumn
Add a new column, where all values in the column are identical and as specified.
param newColumnName Name of the new column
param newColumnType Type of the new column
param fixedValue Value in the new column for all records
addConstantDoubleColumn
Add a new double column, where the value for that column (for all records) are identical
param newColumnName Name of the new column
param value Value in the new column for all records
addConstantIntegerColumn
Add a new integer column, where th e value for that column (for all records) are identical
param newColumnName Name of the new column
param value Value of the new column for all records
addConstantLongColumn
Add a new integer column, where the value for that column (for all records) are identical
param newColumnName Name of the new column
param value Value in the new column for all records
convertToString
Convert the specified column to a string.
param inputColumn the input column to convert
return builder pattern
convertToDouble
Convert the specified column to a double.
param inputColumn the input column to convert
return builder pattern
convertToInteger
Convert the specified column to an integer.
param inputColumn the input column to convert
return builder pattern
normalize
Normalize the specified column with a given type of normalization
param column Column to normalize
param type Type of normalization to apply
param da DataAnalysis object
convertToSequence
Convert a set of independent records/examples into a sequence, according to some key. Within each sequence, values are ordered using the provided {- link SequenceComparator}
param keyColumn Column to use as a key (values with the same key will be combined into sequences)
param comparator A SequenceComparator to order the values within each sequence (for example, by time or String order)
convertToSequence
Convert a set of independent records/examples into a sequence; each example is simply treated as a sequence of length 1, without any join/group operations. Note that more commonly, joining/grouping is required; use {- link #convertToSequence(List, SequenceComparator)} for this functionality
convertToSequence
Convert a set of independent records/examples into a sequence, where each sequence is grouped according to one or more key values (i.e., the values in one or more columns) Within each sequence, values are ordered using the provided {- link SequenceComparator}
param keyColumns Column to use as a key (values with the same key will be combined into sequences)
param comparator A SequenceComparator to order the values within each sequence (for example, by time or String order)
convertFromSequence
Convert a sequence to a set of individual values (by treating each value in each sequence as a separate example)
splitSequence
Split sequences into 1 or more other sequences. Used for example to split large sequences into a set of smaller sequences
param split SequenceSplit that defines how splits will occur
trimSequence
SequenceTrimTranform removes the first or last N values in a sequence. Note that the resulting sequence may be of length 0, if the input sequence is less than or equal to N.
param numStepsToTrim Number of time steps to trim from the sequence
param trimFromStart If true: Trim values from the start of the sequence. If false: trim values from the end.
offsetSequence
Perform a sequence of operation on the specified columns. Note that this also truncates sequences by the specified offset amount by default. Use {- code transform(new SequenceOffsetTransform(…)} to change this. See {- link SequenceOffsetTransform} for details on exactly what this operation does and how.
param columnsToOffset Columns to offset
param offsetAmount Amount to offset the specified columns by (positive offset: ‘columnsToOffset’ are moved to later time steps)
param operationType Whether the offset should be done in-place or by adding a new column
reduce
Reduce (i.e., aggregate/combine) a set of examples (typically by key). Note: In the current implementation, reduction operations can be performed only on standard (i.e., non-sequence) data
param reducer Reducer to use
reduceSequence
Reduce (i.e., aggregate/combine) a set of sequence examples - for each sequence individually. Note: This method results in non-sequence data. If you would instead prefer sequences of length 1 after the reduction, use {- code transform(new ReduceSequenceTransform(reducer))}.
param reducer Reducer to use to reduce each window
reduceSequenceByWindow
Reduce (i.e., aggregate/combine) a set of sequence examples - for each sequence individually - using a window function. For example, take all records/examples in each 24-hour period (i.e., using window function), and convert them into a singe value (using the reducer). In this example, the output is a sequence, with time period of 24 hours.
param reducer Reducer to use to reduce each window
param windowFunction Window function to find apply on each sequence individually
sequenceMovingWindowReduce
SequenceMovingWindowReduceTransform: Adds a new column, where the value is derived by: (a) using a window of the last N values in a single column, (b) Apply a reduction op on the window to calculate a new value for example, this transformer can be used to implement a simple moving average of the last N values, or determine the minimum or maximum values in the last N time steps.
For example, for a simple moving average, length 20: {- code new SequenceMovingWindowReduceTransform(“myCol”, 20, ReduceOp.Mean)}
param columnName Column name to perform windowing on
param lookback Look back period for windowing
param op Reduction operation to perform on each window
calculateSortedRank
CalculateSortedRank: calculate the rank of each example, after sorting example. For example, we might have some numerical “score” column, and we want to know for the rank (sort order) for each example, according to that column. The rank of each example (after sorting) will be added in a new Long column. Indexing is done from 0; examples will have values 0 to dataSetSize-1.
Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data Furthermore, the current implementation can only sort on one column
param newColumnName Name of the new column (will contain the rank for each example)
param sortOnColumn Column to sort on
param comparator Comparator used to sort examples
calculateSortedRank
CalculateSortedRank: calculate the rank of each example, after sorting example. For example, we might have some numerical “score” column, and we want to know for the rank (sort order) for each example, according to that column. The rank of each example (after sorting) will be added in a new Long column. Indexing is done from 0; examples will have values 0 to dataSetSize-1.
Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data Furthermore, the current implementation can only sort on one column
param newColumnName Name of the new column (will contain the rank for each example)
param sortOnColumn Column to sort on
param comparator Comparator used to sort examples
param ascending If true: sort ascending. False: descending
stringToCategorical
Convert the specified String column to a categorical column. The state names must be provided.
param columnName Name of the String column to convert to categorical
param stateNames State names of the category
stringRemoveWhitespaceTransform
Remove all whitespace characters from the values in the specified String column
param columnName Name of the column to remove whitespace from
stringMapTransform
Replace one or more String values in the specified column with new values.
Keys in the map are the original values; the Values in the map are their replacements. If a String appears in the data but does not appear in the provided map (as a key), that String values will not be modified.
param columnName Name of the column in which to do replacement
param mapping Map of oldValues -> newValues
stringToTimeTransform
Convert a String column (containing a date/time String) to a time column (by parsing the date/time String)
param column String column containing the date/time Strings
param format Format of the strings. Time format is specified as per http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
param dateTimeZone Timezone of the column
stringToTimeTransform
Convert a String column (containing a date/time String) to a time column (by parsing the date/time String)
param column String column containing the date/time Strings
param format Format of the strings. Time format is specified as per http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
param dateTimeZone Timezone of the column
param locale Locale of the column
appendStringColumnTransform
Append a String to a specified column
param column Column to append the value to
param toAppend String to append to the end of each writable
conditionalReplaceValueTransform
Replace the values in a specified column with a specified new value, if some condition holds. If the condition does not hold, the original values are not modified.
param column Column to operate on
param newValue Value to use as replacement, if condition is satisfied
param condition Condition that must be satisfied for replacement
conditionalReplaceValueTransformWithDefault
Replace the values in a specified column with a specified “yes” value, if some condition holds. Replace it with a “no” value, otherwise.
param column Column to operate on
param yesVal Value to use as replacement, if condition is satisfied
param noVal Value to use as replacement, if condition is not satisfied
param condition Condition that must be satisfied for replacement
conditionalCopyValueTransform
Replace the value in a specified column with a new value taken from another column, if a condition is satisfied/true. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.
param columnToReplace Name of the column in which values will be replaced (if condition is satisfied)
param sourceColumn Name of the column from which the new values will be
param condition Condition to use
replaceStringTransform
Replace one or more String values in the specified column that match regular expressions.
Keys in the map are the regular expressions; the Values in the map are their String replacements. For example:
Original
Regex
Replacement
Result
Data_Vec
_
DataVec
B1C2T3
\d
one
BoneConeTone
' 4.25 '
^\s+|\s+$
'4.25'
param columnName Name of the column in which to do replacement
param mapping Map of old values or regular expression to new values
ndArrayScalarOpTransform
Element-wise NDArray math operation (add, subtract, etc) on an NDArray column
param columnName Name of the NDArray column to perform the operation on
param op Operation to perform
param value Value for the operation
ndArrayColumnsMathOpTransform
Perform an element wise mathematical operation (such as add, subtract, multiply) on NDArray columns. The existing columns are unchanged, a new NDArray column is added
param newColumnName Name of the new NDArray column
param mathOp Operation to perform
param columnNames Name of the columns used as input to the operation
ndArrayMathFunctionTransform
Apply an element wise mathematical function (sin, tanh, abs etc) to an NDArray column. This operation is performed in place.
param columnName Name of the column to perform the operation on
param mathFunction Mathematical function to apply
ndArrayDistanceTransform
Calculate a distance (cosine similarity, Euclidean, Manhattan) on two equal-sized NDArray columns. This operation adds a new Double column (with the specified name) with the result.
param newColumnName Name of the new column (result) to add
param distance Distance to apply
param firstCol first column to use in the distance calculation
param secondCol second column to use in the distance calculation
firstDigitTransform
FirstDigitTransform converts a column to a categorical column, with values being the first digit of the number. For example, “3.1415” becomes “3” and “2.0” becomes “2”. Negative numbers ignore the sign: “-7.123” becomes “7”. Note that two {- link FirstDigitTransform.Mode}s are supported, which determines how non-numerical entries should be handled: EXCEPTION_ON_INVALID: output has 10 category values (“0”, …, “9”), and any non-numerical values result in an exception INCLUDE_OTHER_CATEGORY: output has 11 category values (“0”, …, “9”, “Other”), all non-numerical values are mapped to “Other”
FirstDigitTransform is useful (combined with {- link CategoricalToOneHotTransform} and Reductions) to implement Benford’s law.
param inputColumn Input column name
param outputColumn Output column name. If same as input, input column is replaced
firstDigitTransform
FirstDigitTransform converts a column to a categorical column, with values being the first digit of the number. For example, “3.1415” becomes “3” and “2.0” becomes “2”. Negative numbers ignore the sign: “-7.123” becomes “7”. Note that two {- link FirstDigitTransform.Mode}s are supported, which determines how non-numerical entries should be handled: EXCEPTION_ON_INVALID: output has 10 category values (“0”, …, “9”), and any non-numerical values result in an exception INCLUDE_OTHER_CATEGORY: output has 11 category values (“0”, …, “9”, “Other”), all non-numerical values are mapped to “Other”
FirstDigitTransform is useful (combined with {- link CategoricalToOneHotTransform} and Reductions) to implement Benford’s law.
param inputColumn Input column name
param outputColumn Output column name. If same as input, input column is replaced
param mode See {- link FirstDigitTransform.Mode}
build
Create the TransformProcess object
Created by Alex on 4/03/2016.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Created by Alex on 4/03/2016.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
Pivot transform operates on two columns:
a categorical column that operates as a key, and
Another column that contains a value Essentially, Pivot transform takes keyvalue pairs and breaks them out into separate columns.
For example, with schema [col0, key, value, col3] and values with key in {a,b,c} Output schema is [col0, key[a], key[b], key[c], col3] and input (col0Val, b, x, col3Val) gets mapped to (col0Val, 0, x, 0, col3Val).
When expanding columns, a default value is used - for example 0 for numerical columns.
transform
param keyColumnName Key column to expand
param valueColumnName Name of the column that contains the value
Convert a String column to a categorical column
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
Add a new column, where the values in that column for all records are identical (according to the specified value)
Duplicate one or more columns. The duplicated columns are placed immediately after the original columns
transform
param columnsToDuplicate List of columns to duplicate
param newColumnNames List of names for the new (duplicate) columns
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Transform that removes all columns except for those that are explicitly specified as ones to keep To specify only the columns
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Remove the specified columns from the data. To specify only the columns to keep,
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Rename one or more columns
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Rearrange the order of the columns. Note: A partial list of columns can be used here. Any columns that are not explicitly mentioned will be placed after those that are in the output, without changing their relative order.
transform
param newOrder A partial or complete order of the columns in the output
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Replace the value in a specified column with a new value taken from another column, if a condition is satisfied/true. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.
Note: For sequences, this transform use the convention that each step in the sequence is passed to the condition, and replaced (or not) separately (i.e., Condition.condition(List) is used on each time step individually)
transform
param columnToReplace Name of the column in which to replace the old value
param sourceColumn Name of the column to get the new value from
param condition Condition
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Replace the value in a specified column with a new value, if a condition is satisfied/true. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.
Note: For sequences, this transform use the convention that each step in the sequence is passed to the condition, and replaced (or not) separately (i.e., Condition.condition(List) is used on each time step individually)
transform
param columnToReplace Name of the column in which to replace the old value with ‘newValue’, if the condition holds
param newValue New value to use
param condition Condition
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Replace the value in a specified column with a ‘yes’ value, if a condition is satisfied/true. Replace the value of this same column with a ‘no’ value otherwise. Note that the condition can be any generic condition, including on other column(s), different to the column that will be modified if the condition is satisfied/true.
Note: For sequences, this transform use the convention that each step in the sequence is passed to the condition, and replaced (or not) separately (i.e., Condition.condition(List) is used on each time step individually)
Convert any value to an Double
map
param column Name of the column to convert to a Double column
Add a new double column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
A simple transform to do common mathematical operations, such as sin(x), ceil(x), etc.
Double mathematical operation. This is an in-place operation of the double column value and a double scalar.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
Normalize by taking scale log2((in-columnMin)/(mean-columnMin) + 1) Maps values in range (columnMin to infinity) to (0 to infinity) Most suitable for values with a geometric/negative exponential type distribution.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Normalizer to map (min to max) -> (newMin-to newMax) linearly.
Mathematically: (newMax-newMin)/(max-min) (x-min) + newMin
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Normalize using (x-mean)/stdev. Also known as a standard score, standardization etc.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Normalize by substracting the mean
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Convert any value to an Integer.
map
param column Name of the column to convert to an integer
Add a new integer column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==MathOp.Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2. NOTE: Division here is using if a decimal output value is required.
toString
param newColumnName Name of the new column (output column)
param mathOp Mathematical operation. Only Add/Subtract/Multiply/Divide/Modulus is allowed here
param columns Columns to use in the mathematical operation
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
Integer mathematical operation. This is an in-place operation of the integer column value and an integer scalar.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Convert an integer column to a set of one-hot columns.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Replace an empty/missing integer with a certain value.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Replace an invalid (non-integer) value in a column with a specified integer
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Add a new long column, calculated from one or more other columns. A new column (with the specified name) is added as the final column of the output. No other columns are modified. For example, if newColumnName==”newCol”, mathOp==MathOp.Add, and columns=={“col1”,”col2”}, then the output column with name “newCol” has value col1+col2. if a decimal output value is required.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
Long mathematical operation. This is an in-place operation of the long column value and an long scalar.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Convert each text value in a sequence to a longer sequence of integer indices. For example, “abc” would be converted to [1, 2, 3]. Values in other columns will be duplicated.
Convert each text value in a sequence to a longer sequence of integer indices. For example, “zero one two” would be converted to [0, 1, 2]. Values in other columns will be duplicated.
SequenceDifferenceTransform: for an input sequence, calculate the difference on one column. For each time t, calculate someColumn(t) - someColumn(t-s), where s >= 1 is the ‘lookback’ period.
Note: at t=0 (i.e., the first step in a sequence; or more generally, for all times t < s), there is no previous value these time steps:
Default: output = someColumn(t) - someColumn(max(t-s, 0))
SpecifiedValue: output = someColumn(t) - someColumn(t-s) if t-s >= 0, or a custom Writable object (for example, a DoubleWritable(0) or NullWritable).
Note: this is an in-place operation: i.e., the values in each column are modified. If the original values are and apply the difference operation in-place on the copy.
outputColumnName
Create a SequenceDifferenceTransform with default lookback of 1, and using FirstStepMode.Default. Output column name is the same as the input column name.
param columnName Name of the column to perform the operation on.
columnName
The output column names This will often be the same as the input
return the output column names
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
SequenceMovingWindowReduceTransform Adds a new column, where the value is derived by: (a) using a window of the last N values in a single column, (b) Apply a reduction op on the window to calculate a new value for example, this transformer can be used to implement a simple moving average of the last N values, or determine the minimum or maximum values in the last N time steps.
defaultOutputColumnName
Enumeration to specify how each cases are handled: For example, for a look back period of 20, how should the first 19 output values be calculated? Default: Perform your former reduction as normal, with as many values are available SpecifiedValue: use the given/specified value instead of the actual output value. For example, you could assign values of 0 or NullWritable to positions 0 through 18 of the output.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Sequence offset transform takes a sequence, and shifts The values in one or more columns by a specified number of times steps. It has 2 modes of operation (OperationType enum), with respect to the columns it operates on: InPlace: operations may be performed in-place, modifying the values in the specified columns NewColumn: operations may produce new columns, with the original (source) columns remaining unmodified
Additionally, there are 2 modes for handling values outside the original sequence (EdgeHandling enum): TrimSequence: the entire sequence is trimmed (start or end) by a specified number of steps SpecifiedValue: for any values outside of the original sequence, they are given a specified value
Note 1: When specifying offsets, they are done as follows: Positive offsets: move the values in the specified columns to a later time. Earlier time steps are either be trimmed or Given specified values; the last values in these columns will be truncated/removed.
Note 2: Care must be taken when using TrimSequence: for example, if we chain multiple sequence offset transforms on the one dataset, we may end up trimming much more than we want. In this case, it may be better to use SpecifiedValue, at the end.
Append a String to the values in a single column
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Change case (to, e.g, all lower case) of String column.
Concatenate values of one or more String columns into a new String column. Retains the constituent String columns so user must remove those manually, if desired.
TODO: use new String Reduce functionality in DataVec?
transform
param columnsToConcatenate A partial or complete order of the columns in the output
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Convert any value to a string.
map
Transform the writable in to a string
param writable the writable to transform
return the string form of this writable
map
Transform an object in to another object
param input the record to transform
return the transformed writable
This method maps all String values, except those is the specified list, to a single String value
map
Transform an object in to another object
param input the record to transform
return the transformed writable
String transform that removes all whitespace charaters
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Replace empty String values with the specified String
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Replaces String values that match regular expressions.
map
Constructs a new ReplaceStringTransform using the specified
param columnName Name of the column
param map Key: regular expression; Value: replacement value
Convert a delimited String to a list of binary categorical columns. Suppose the possible String values were {“a”,”b”,”c”,”d”} and the String column value to be converted contained the String “a,c”, then the 4 output columns would have values [“true”,”false”,”true”,”false”]
transform
param columnName The name of the column to convert
param newColumnNames The names of the new columns to create
param categoryTokens The possible tokens that may be present. Note this list must have the same length and order as the newColumnNames list
param delimiter The delimiter for the Strings to convert
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Converts String column into a bag-of-words (BOW) represented as an NDArray of “counts.” Note that the original column is removed in the process
transform
param columnName The name of the column to convert
param vocabulary The possible tokens that may be present.
param delimiter The delimiter for the Strings to convert
param ignoreUnknown Whether to ignore unknown tokens
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
Converts String column into a sparse bag-of-words (BOW) represented as an NDArray of indices. Appropriate for embeddings or as efficient storage before being expanded into a dense array.
A simple String -> String map function.
Keys in the map are the original values; the Values in the map are their replacements. If a String appears in the data but does not appear in the provided map (as a key), that String values will not be modified.
map
param columnName Name of the column
param map Key: From. Value: To
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Create a number of new columns by deriving their values from a Time column. Can be used for example to create new columns with the year, month, day, hour, minute, second etc.
map
Transform an object in to another object
param input the record to transform
return the transformed writable
mapSequence
Transform a sequence
param sequence
toString
The output column name after the operation has been applied
return the output column name
Convert a String column to a time column by parsing the date/time String, using a JodaTime.
Time format is specified as per http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
getNewColumnMetaData
Instantiate this without a time format specified. If this constructor is used, this transform will be allowed to handle several common transforms as defined in the static formats array.
param columnName Name of the String column
param timeZone Timezone for time parsing
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Transform math op on a time column
Note: only the following MathOps are supported: Add, Subtract, ScalarMin, ScalarMax For ScalarMin/Max, the TimeUnit must be milliseconds - i.e., value must be in epoch millisecond format
map
Transform an object in to another object
param input the record to transform
return the transformed writable
Schemas for datasets and transformation.
The unfortunate reality is that data is dirty. When trying to vecotrize a dataset for deep learning, it is quite rare to find files that have zero errors. Schema is important for maintaining the meaning of the data before using it for something like training a neural network.
Schemas are primarily used for programming transformations. Before you can properly execute a TransformProcess
you will need to pass the schema of the data being transformed.
An example of a schema for merchant records may look like:
If you have two different datasets that you want to merge together, DataVec provides a Join
class with different join strategies such as Inner
or RightOuter
.
Once you've defined your join and you've loaded the data into DataVec, you must use an Executor
to complete the join.
DataVec comes with a few Schema
classes and helper utilities for 2D and sequence types of data.
Join class: used to specify a join (like an SQL join)
setSchemas
Type of join Inner: Return examples where the join column values occur in both LeftOuter: Return all examples from left data, whether there is a matching right value or not. (If not: right values will have NullWritable instead) RightOuter: Return all examples from the right data, whether there is a matching left value or not. (If not: left values will have NullWritable instead) FullOuter: return all examples from both left/right, whether there is a matching value from the other side or not. (If not: other values will have NullWritable instead)
setKeyColumns
deprecated Use {- link #setJoinColumns(String…)}
setKeyColumnsLeft
deprecated Use {- link #setJoinColumnsLeft(String…)}
setKeyColumnsRight
deprecated Use {- link #setJoinColumnsRight(String…)}
setJoinColumnsLeft
Specify the names of the columns to join on, for the left data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i
param joinColumnNames Names of the columns to join on (for left data)
setJoinColumnsRight
Specify the names of the columns to join on, for the right data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i
param joinColumnNames Names of the columns to join on (for left data)
If passed a CSV file that contains a header and a single row of sample data, it will return a Schema.
Only Double, Integer, Long, and String types are supported. If no number type can be inferred, the field type will become the default type. Note that if your column is actually categorical but is represented as a number, you will need to do additional transformation. Also, if your sample field is blank/null, it will also become the default type.
A Schema defines the layout of tabular data. Specifically, it contains names f or each column, as well as details of types (Integer, String, Long, Double, etc). Type information for each column may optionally include restrictions on the allowable values for each column.
sameTypes
Create a schema based on the given metadata
param columnMetaData the metadata to create the schema from
newSchema
Compute the difference in {- link ColumnMetaData} between this schema and the passed in schema. This is useful during the {- link org.datavec.api.transform.TransformProcess} to identify what a process will do to a given {- link Schema}.
param schema the schema to compute the difference for
return the metadata that is different (in order) between this schema and the other schema
numColumns
Returns the number of columns or fields for this schema
return the number of columns or fields for this schema
getName
Returns the name of a given column at the specified index
param column the index of the column to get the name for
return the name of the column at the specified index
getType
Returns the {- link ColumnType} for the column at the specified index
param column the index of the column to get the type for
return the type of the column to at the specified inde
getType
Returns the {- link ColumnType} for the column at the specified index
param columnName the index of the column to get the type for
return the type of the column to at the specified inde
getMetaData
Returns the {- link ColumnMetaData} at the specified column index
param column the index to get the metadata for
return the metadata at ths specified index
getMetaData
Retrieve the metadata for the given column name
param column the name of the column to get metadata for
return the metadata for the given column name
getIndexOfColumn
Return a copy of the list column names
return a copy of the list of column names for this schema
hasColumn
Return the indices of the columns, given their namess
param columnNames Name of the columns to get indices for
return Column indexes
toJson
Serialize this schema to json
return a json representation of this schema
toYaml
Serialize this schema to yaml
return the yaml representation of this schema
fromJson
Create a schema from a given json string
param json the json to create the schema from
return the created schema based on the json
fromYaml
Create a schema from the given yaml string
param yaml the yaml to create the schema from
return the created schema based on the yaml
addColumnFloat
Add a Float column with no restrictions on the allowable values, except for no NaN/infinite values allowed
param name Name of the column
addColumnFloat
Add a Float column with the specified restrictions (and no NaN/Infinite values allowed)
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
return
addColumnFloat
Add a Float column with the specified restrictions
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnsFloat
Add multiple Float columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)
param columnNames Names of the columns to add
addColumnsFloat
A convenience method for adding multiple Float columns. For example, to add columns “myFloatCol_0”, “myFloatCol_1”, “myFloatCol_2”, use {- code addColumnsFloat(“myFloatCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsFloat
A convenience method for adding multiple Float columns, with additional restrictions that apply to all columns For example, to add columns “myFloatCol_0”, “myFloatCol_1”, “myFloatCol_2”, use {- code addColumnsFloat(“myFloatCol_%d”,0,2,null,null,false,false)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnDouble
Add a Double column with no restrictions on the allowable values, except for no NaN/infinite values allowed
param name Name of the column
addColumnDouble
Add a Double column with the specified restrictions (and no NaN/Infinite values allowed)
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
return
addColumnDouble
Add a Double column with the specified restrictions
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnsDouble
Add multiple Double columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)
param columnNames Names of the columns to add
addColumnsDouble
A convenience method for adding multiple Double columns. For example, to add columns “myDoubleCol_0”, “myDoubleCol_1”, “myDoubleCol_2”, use {- code addColumnsDouble(“myDoubleCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsDouble
A convenience method for adding multiple Double columns, with additional restrictions that apply to all columns For example, to add columns “myDoubleCol_0”, “myDoubleCol_1”, “myDoubleCol_2”, use {- code addColumnsDouble(“myDoubleCol_%d”,0,2,null,null,false,false)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnInteger
Add an Integer column with no restrictions on the allowable values
param name Name of the column
addColumnInteger
Add an Integer column with the specified min/max allowable values
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumnsInteger
Add multiple Integer columns with no restrictions on the min/max allowable values
param names Names of the integer columns to add
addColumnsInteger
A convenience method for adding multiple Integer columns. For example, to add columns “myIntegerCol_0”, “myIntegerCol_1”, “myIntegerCol_2”, use {- code addColumnsInteger(“myIntegerCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsInteger
A convenience method for adding multiple Integer columns. For example, to add columns “myIntegerCol_0”, “myIntegerCol_1”, “myIntegerCol_2”, use {- code addColumnsInteger(“myIntegerCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumnCategorical
Add a Categorical column, with the specified state names
param name Name of the column
param stateNames Names of the allowable states for this categorical column
addColumnCategorical
Add a Categorical column, with the specified state names
param name Name of the column
param stateNames Names of the allowable states for this categorical column
addColumnLong
Add a Long column, with no restrictions on the min/max values
param name Name of the column
addColumnLong
Add a Long column with the specified min/max allowable values
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumnsLong
Add multiple Long columns, with no restrictions on the allowable values
param names Names of the Long columns to add
addColumnsLong
A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsLong
A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumn
Add a column
param metaData metadata for this column
addColumnString
Add a String column with no restrictions on the allowable values.
param name Name of the column
addColumnsString
Add multiple String columns with no restrictions on the allowable values
param columnNames Names of the String columns to add
addColumnString
Add a String column with the specified restrictions
param name Name of the column
param regex Regex that the String must match in order to be considered valid. If null: no regex restriction
param minAllowableLength Minimum allowable length for the String to be considered valid
param maxAllowableLength Maximum allowable length for the String to be considered valid
addColumnsString
A convenience method for adding multiple numbered String columns. For example, to add columns “myStringCol_0”, “myStringCol_1”, “myStringCol_2”, use {- code addColumnsString(“myStringCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsString
A convenience method for adding multiple numbered String columns. For example, to add columns “myStringCol_0”, “myStringCol_1”, “myStringCol_2”, use {- code addColumnsString(“myStringCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param regex Regex that the String must match in order to be considered valid. If null: no regex restriction
param minAllowedLength Minimum allowed length of strings (inclusive). If null: no restriction
param maxAllowedLength Maximum allowed length of strings (inclusive). If null: no restriction
addColumnTime
Add a Time column with no restrictions on the min/max allowable times NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform
param columnName Name of the column
param timeZone Time zone of the time column
addColumnTime
Add a Time column with no restrictions on the min/max allowable times NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform
param columnName Name of the column
param timeZone Time zone of the time column
addColumnTime
Add a Time column with the specified restrictions NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform
param columnName Name of the column
param timeZone Time zone of the time column
param minValidValue Minumum allowable time (in milliseconds). May be null.
param maxValidValue Maximum allowable time (in milliseconds). May be null.
addColumnNDArray
Add a NDArray column
param columnName Name of the column
param shape shape of the NDArray column. Use -1 in entries to specify as “variable length” in that dimension
build
Create the Schema
inferMultiple
Infers a schema based on the record. The column names are based on indexing.
param record the record to infer from
return the infered schema
infer
Infers a schema based on the record. The column names are based on indexing.
param record the record to infer from
return the infered schema
inferSequenceMulti
Infers a sequence schema based on the record
param record the record to infer the schema based on
return the inferred sequence schema
inferSequence
Infers a sequence schema based on the record
param record the record to infer the schema based on
return the inferred sequence schema
Execute ETL and vectorization in a local instance.
Because datasets are commonly large by nature, you can decide on an execution mechanism that best suits your needs. For example, if you are vectorizing a large training dataset, you can process it in a distributed Spark cluster. However, if you need to do real-time inference, DataVec also provides a local executor that doesn't require any additional setup.
Once you've created your TransformProcess
using your Schema
, and you've either loaded your dataset into a Apache Spark JavaRDD
or have a RecordReader
that load your dataset, you can execute a transform.
Locally this looks like:
When using Spark this looks like:
Local transform executor
isTryCatch
Execute the specified TransformProcess with the given input data Note: this method can only be used if the TransformProcess returns non-sequence data. For TransformProcesses that return a sequence, use {- link #executeToSequence(List, TransformProcess)}
param inputWritables Input data to process
param transformProcess TransformProcess to execute
return Processed data
Execute a datavec transform process on spark rdds.
isTryCatch
deprecated Use static methods instead of instance methods on SparkTransformExecutor
Data wrangling and mapping from one schema to another.
DataVec comes with the ability to serialize transforms, which allows them to be more portable when they're needed for production environments. A TransformProcess
is serialzied to a human-readable format such as JSON and can be saved as a file.
The code below shows how you can serialize the transform process tp
.
When you want to reinstantiate the transform process, call the static from<format>
method.
Serializer used for converting objects (Transforms, Conditions, etc) to JSON format
Serializer used for converting objects (Transforms, Conditions, etc) to YAML format
Implementations for advanced transformation.
Operations, such as a Function
, help execute transforms and load data into DataVec. The concept of operations is low-level, meaning that most of the time you will not need to worry about them.
If you're using Apache Spark, functions will iterate over the dataset and load it into a Spark RDD
and convert the raw data format into a Writable
.
The above code loads a CSV file into a 2D java RDD. Once your RDD is loaded, you can transform it, perform joins and use reducers to wrangle the data any way you want.
Created by huitseeker on 5/8/17.
It is used to execute many reduction operations in parallel on the same column, datavec#238
Created by huitseeker on 5/8/17.
supports a conversion to Byte.
Created by huitseeker on 5/14/17.
Created by huitseeker on 5/14/17.
before dispatching the appropriate column of this element to its operation.
Created by huitseeker on 5/14/17.
supports a conversion to Double.
Created by huitseeker on 5/14/17.
supports a conversion to Float.
Created by huitseeker on 5/14/17.
supports a conversion to Integer.
Created by huitseeker on 5/14/17.
supports a conversion to Long.
Created by huitseeker on 5/14/17.
supports a conversion to TextWritable. Created by huitseeker on 5/14/17.
CalculateSortedRank: calculate the rank of each example, after sorting example. For example, we might have some numerical “score” column, and we want to know for the rank (sort order) for each example, according to that column. The rank of each example (after sorting) will be added in a new Long column. Indexing is done from 0; examples will have values 0 to dataSetSize - 1.
Currently, CalculateSortedRank can only be applied on standard (i.e., non-sequence) data. Furthermore, the current implementation can only sort on one column
transform
param newColumnName Name of the new column (will contain the rank for each example)
param sortOnColumn Name of the column to sort on
param comparator Comparator used to sort examples
outputColumnName
The output column name after the operation has been applied
return the output column name
columnName
The output column names This will often be the same as the input
return the output column names
createHtmlAnalysisString
Render a data analysis object as a HTML file. This will produce a summary table, along charts for numerical columns. The contents of the HTML file are returned as a String, which should be written to a .html file.
param analysis Data analysis object to render
see #createHtmlAnalysisFile(DataAnalysis, File)
createHtmlAnalysisFile
Render a data analysis object as a HTML file. This will produce a summary table, along charts for numerical columns
param dataAnalysis Data analysis object to render
param output Output file (should have extension .html)
A simple utility for plotting DataVec sequence data to HTML files. Each file contains only one sequence. Each column is plotted separately; only numerical and categorical columns are plotted.
createHtmlSequencePlots
Create a HTML file with plots for the given sequence.
param title Title of the page
param schema Schema for the data
param sequence Sequence to plot
return HTML file as a string
createHtmlSequencePlotFile
Create a HTML file with plots for the given sequence and write it to a file.
param title Title of the page
param schema Schema for the data
param sequence Sequence to plot
delimiter is configurable), determine the geographic midpoint. See “geographic midpoint” at: http://www.geomidpoint.com/methods.html For implementation algorithm, see: http://www.geomidpoint.com/calculation.html
transform
param delim Delimiter for the coordinates in text format. For example, if format is “lat,long” use “,”
A StringReducer is used to take a set of examples and reduce them. The idea: suppose you have a large number of columns, and you want to combine/reduce the values in each column. StringReducer allows you to specify different reductions for differently for different columns: min, max, sum, mean etc.
Uses are: (1) Reducing examples by a key (2) Reduction operations in time series (windowing ops, etc)
transform
Get the output schema, given the input schema
outputColumnName
Create a StringReducer builder, and set the default column reduction operation. For any columns that aren’t specified explicitly, they will use the default reduction operation. If a column does have a reduction operation explicitly specified, then it will override the default specified here.
param defaultOp Default reduction operation to perform
appendColumns
Reduce the specified columns by taking the minimum value
prependColumns
Reduce the specified columns by taking the maximum value
mergeColumns
Reduce the specified columns by taking the sum of values
replaceColumn
Reduce the specified columns by taking the mean of the values
customReduction
Reduce the specified column using a custom column reduction functionality.
param column Column to execute the custom reduction functionality on
param columnReduction Column reduction to execute on that column
setIgnoreInvalid
When doing the reduction: set the specified columns to ignore any invalid values. Invalid: defined as being not valid according to the ColumnMetaData: {- link ColumnMetaData#isValid(Writable)}. For numerical columns, this typically means being unable to parse the Writable. For example, Writable.toLong() failing for a Long column. If the column has any restrictions (min/max values, regex for Strings etc) these will also be taken into account.
param columns Columns to set ‘ignore invalid’ for
Neural networks work best when the data they’re fed is normalized, constrained to a range between -1 and 1. There are several reasons for that. One is that nets are trained using gradient descent, and their activation functions usually having an active range somewhere between -1 and 1. Even when using an activation function that doesn’t saturate quickly, it is still good practice to constrain your values to this range to improve performance.
Pre processor for DataSets that normalizes feature values (and optionally label values) to lie between a minimum and maximum value (by default between 0 and 1)
NormalizerMinMaxScaler
Preprocessor can take a range as minRange and maxRange
param minRange
param maxRange
load
Load the given min and max
param statistics the statistics to load
throws IOException
save
Save the current min and max
param files the statistics to save
throws IOException
deprecated use {- link NormalizerSerializer instead}
Base interface for all normalizers
A DataSetPreProcessor used to flatten a 4d CNN features array to a flattened 2d format (for use in networks such as a DenseLayer/multi-layer perceptron)
statistics of the upper and lower bounds of the population
MinMaxStrategy
param minRange the target range lower bound
param maxRange the target range upper bound
preProcess
Normalize a data array
param array the data to normalize
param stats statistics of the data population
revert
Denormalize a data array
param array the data to denormalize
param stats statistics of the data population
Created by susaneraly on 6/23/16. A preprocessor specifically for images that applies min max scaling Can take a range, so pixel values can be scaled from 0->255 to minRange->maxRange default minRange = 0 and maxRange = 1; If pixel values are not 8 bits, you can specify the number of bits as the third argument in the constructor For values that are already floating point, specify the number of bits as 1
ImagePreProcessingScaler
Preprocessor can take a range as minRange and maxRange
param a, default = 0
param b, default = 1
param maxBits in the image, default = 8
fit
Fit a dataset (only compute based on the statistics from this dataset0
param dataSet the dataset to compute on
fit
Iterates over a dataset accumulating statistics for normalization
param iterator the iterator to use for collecting statistics.
transform
Transform the data
param toPreProcess the dataset to transform
A simple Composite MultiDataSetPreProcessor - allows you to apply multiple MultiDataSetPreProcessors sequentially on the one MultiDataSet, in the order they are passed to the constructor
CompositeMultiDataSetPreProcessor
param preProcessors Preprocessors to apply. They will be applied in this order
Pre processor for MultiDataSet that normalizes feature values (and optionally label values) to lie between a minimum and maximum value (by default between 0 and 1)
MultiNormalizerMinMaxScaler
Preprocessor can take a range as minRange and maxRange
param minRange the target range lower bound
param maxRange the target range upper bound
An interface for multi dataset normalizers. Data normalizers compute some sort of statistics over a MultiDataSet and scale the data in some way.
A preprocessor specifically for images that applies min max scaling to one or more of the feature arrays in a MultiDataSet. Can take a range, so pixel values can be scaled from 0->255 to minRange->maxRange default minRange = 0 and maxRange = 1; If pixel values are not 8 bits, you can specify the number of bits as the third argument in the constructor For values that are already floating point, specify the number of bits as 1
ImageMultiPreProcessingScaler
Preprocessor can take a range as minRange and maxRange
param a, default = 0
param b, default = 1
param maxBits in the image, default = 8
param featureIndices Indices of feature arrays to process. If only one feature array is present, this should always be 0
Created by susaneraly, Ede Meijer variance and mean Pre processor for DataSet that normalizes feature values (and optionally label values) to have 0 mean and a standard deviation of 1
load
Load the means and standard deviations from the file system
param files the files to load from. Needs 4 files if normalizing labels, otherwise 2.
save
param files the files to save to. Needs 4 files if normalizing labels, otherwise 2.
deprecated use {- link NormalizerSerializer} instead
Save the current means and standard deviations to the file system
of the means and standard deviations of the population
preProcess
Normalize a data array
param array the data to normalize
param stats statistics of the data population
revert
Denormalize a data array
param array the data to denormalize
param stats statistics of the data population
Interface for strategies that can normalize and denormalize data arrays based on statistics of the population
Pre processor for MultiDataSet that can be configured to use different normalization strategies for different inputs and outputs, or none at all. Can be used for example when one input should be normalized, but a different one should be untouched because it’s the input for an embedding layer. Alternatively, one might want to mix standardization and min-max scaling for different inputs and outputs.
By default, no normalization is applied. There are methods to configure the desired normalization strategy for inputs and outputs either globally or on an individual input/output level. Specific input/output strategies will override global ones.
MultiNormalizerHybrid
Apply standardization to all inputs, except the ones individually configured
return the normalizer
minMaxScaleAllInputs
Apply min-max scaling to all inputs, except the ones individually configured
return the normalizer
minMaxScaleAllInputs
Apply min-max scaling to all inputs, except the ones individually configured
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer
standardizeInput
Apply standardization to a specific input, overriding the global input strategy if any
param input the index of the input
return the normalizer
minMaxScaleInput
Apply min-max scaling to a specific input, overriding the global input strategy if any
param input the index of the input
return the normalizer
minMaxScaleInput
Apply min-max scaling to a specific input, overriding the global input strategy if any
param input the index of the input
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer
standardizeAllOutputs
Apply standardization to all outputs, except the ones individually configured
return the normalizer
minMaxScaleAllOutputs
Apply min-max scaling to all outputs, except the ones individually configured
return the normalizer
minMaxScaleAllOutputs
Apply min-max scaling to all outputs, except the ones individually configured
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer
standardizeOutput
Apply standardization to a specific output, overriding the global output strategy if any
param output the index of the input
return the normalizer
minMaxScaleOutput
Apply min-max scaling to a specific output, overriding the global output strategy if any
param output the index of the input
return the normalizer
minMaxScaleOutput
Apply min-max scaling to a specific output, overriding the global output strategy if any
param output the index of the input
param rangeFrom lower bound of the target range
param rangeTo upper bound of the target range
return the normalizer
getInputStats
Get normalization statistics for a given input.
param input the index of the input
return implementation of NormalizerStats corresponding to the normalization strategy selected
getOutputStats
Get normalization statistics for a given output.
param output the index of the output
return implementation of NormalizerStats corresponding to the normalization strategy selected
fit
Get the map of normalization statistics per input
return map of input indices pointing to NormalizerStats instances
fit
Iterates over a dataset accumulating statistics for normalization
param iterator the iterator to use for collecting statistics
transform
Transform the dataset
param data the dataset to pre process
revert
Undo (revert) the normalization applied by this DataNormalization instance (arrays are modified in-place)
param data MultiDataSet to revert the normalization on
revertFeatures
Undo (revert) the normalization applied by this DataNormalization instance to the entire inputs array
param features The normalized array of inputs
revertFeatures
Undo (revert) the normalization applied by this DataNormalization instance to the entire inputs array
param features The normalized array of inputs
param maskArrays Optional mask arrays belonging to the inputs
revertFeatures
Undo (revert) the normalization applied by this DataNormalization instance to the features of a particular input
param features The normalized array of inputs
param maskArrays Optional mask arrays belonging to the inputs
param input the index of the input to revert normalization on
revertLabels
Undo (revert) the normalization applied by this DataNormalization instance to the entire outputs array
param labels The normalized array of outputs
revertLabels
Undo (revert) the normalization applied by this DataNormalization instance to the entire outputs array
param labels The normalized array of outputs
param maskArrays Optional mask arrays belonging to the outputs
revertLabels
Undo (revert) the normalization applied by this DataNormalization instance to the labels of a particular output
param labels The normalized array of outputs
param maskArrays Optional mask arrays belonging to the outputs
param output the index of the output to revert normalization on
A simple Composite DataSetPreProcessor - allows you to apply multiple DataSetPreProcessors sequentially on the one DataSet, in the order they are passed to the constructor
CompositeDataSetPreProcessor
param preProcessors Preprocessors to apply. They will be applied in this order
Pre processor for MultiDataSet that normalizes feature values (and optionally label values) to have 0 mean and a standard deviation of 1
load
Load means and standard deviations from the file system
param featureFiles source files for features, requires 2 files per input, alternating mean and stddev files
param labelFiles source files for labels, requires 2 files per output, alternating mean and stddev files
save
param featureFiles target files for features, requires 2 files per input, alternating mean and stddev files
param labelFiles target files for labels, requires 2 files per output, alternating mean and stddev files
deprecated use {- link MultiStandardizeSerializerStrategy} instead
Save the current means and standard deviations to the file system
This is a preprocessor specifically for VGG16. It subtracts the mean RGB value, computed on the training set, from each pixel as reported in: https://arxiv.org/pdf/1409.1556.pdf
fit
Fit a dataset (only compute based on the statistics from this dataset0
param dataSet the dataset to compute on
fit
Iterates over a dataset accumulating statistics for normalization
param iterator the iterator to use for collecting statistics.
transform
Transform the data
param toPreProcess the dataset to transform
An interface for data normalizers. Data normalizers compute some sort of statistics over a dataset and scale the data in some way.