Schemas for datasets and transformation.
The unfortunate reality is that data is dirty. When trying to vecotrize a dataset for deep learning, it is quite rare to find files that have zero errors. Schema is important for maintaining the meaning of the data before using it for something like training a neural network.
Schemas are primarily used for programming transformations. Before you can properly execute a TransformProcess
you will need to pass the schema of the data being transformed.
An example of a schema for merchant records may look like:
If you have two different datasets that you want to merge together, DataVec provides a Join
class with different join strategies such as Inner
or RightOuter
.
Once you've defined your join and you've loaded the data into DataVec, you must use an Executor
to complete the join.
DataVec comes with a few Schema
classes and helper utilities for 2D and sequence types of data.
Join class: used to specify a join (like an SQL join)
setSchemas
Type of join Inner: Return examples where the join column values occur in both LeftOuter: Return all examples from left data, whether there is a matching right value or not. (If not: right values will have NullWritable instead) RightOuter: Return all examples from the right data, whether there is a matching left value or not. (If not: left values will have NullWritable instead) FullOuter: return all examples from both left/right, whether there is a matching value from the other side or not. (If not: other values will have NullWritable instead)
setKeyColumns
deprecated Use {- link #setJoinColumns(String…)}
setKeyColumnsLeft
deprecated Use {- link #setJoinColumnsLeft(String…)}
setKeyColumnsRight
deprecated Use {- link #setJoinColumnsRight(String…)}
setJoinColumnsLeft
Specify the names of the columns to join on, for the left data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i
param joinColumnNames Names of the columns to join on (for left data)
setJoinColumnsRight
Specify the names of the columns to join on, for the right data) The idea: join examples where firstDataValues(joinColumNamesLeft[i]) == secondDataValues(joinColumnNamesRight[i]) for all i
param joinColumnNames Names of the columns to join on (for left data)
If passed a CSV file that contains a header and a single row of sample data, it will return a Schema.
Only Double, Integer, Long, and String types are supported. If no number type can be inferred, the field type will become the default type. Note that if your column is actually categorical but is represented as a number, you will need to do additional transformation. Also, if your sample field is blank/null, it will also become the default type.
A Schema defines the layout of tabular data. Specifically, it contains names f or each column, as well as details of types (Integer, String, Long, Double, etc). Type information for each column may optionally include restrictions on the allowable values for each column.
sameTypes
Create a schema based on the given metadata
param columnMetaData the metadata to create the schema from
newSchema
Compute the difference in {- link ColumnMetaData} between this schema and the passed in schema. This is useful during the {- link org.datavec.api.transform.TransformProcess} to identify what a process will do to a given {- link Schema}.
param schema the schema to compute the difference for
return the metadata that is different (in order) between this schema and the other schema
numColumns
Returns the number of columns or fields for this schema
return the number of columns or fields for this schema
getName
Returns the name of a given column at the specified index
param column the index of the column to get the name for
return the name of the column at the specified index
getType
Returns the {- link ColumnType} for the column at the specified index
param column the index of the column to get the type for
return the type of the column to at the specified inde
getType
Returns the {- link ColumnType} for the column at the specified index
param columnName the index of the column to get the type for
return the type of the column to at the specified inde
getMetaData
Returns the {- link ColumnMetaData} at the specified column index
param column the index to get the metadata for
return the metadata at ths specified index
getMetaData
Retrieve the metadata for the given column name
param column the name of the column to get metadata for
return the metadata for the given column name
getIndexOfColumn
Return a copy of the list column names
return a copy of the list of column names for this schema
hasColumn
Return the indices of the columns, given their namess
param columnNames Name of the columns to get indices for
return Column indexes
toJson
Serialize this schema to json
return a json representation of this schema
toYaml
Serialize this schema to yaml
return the yaml representation of this schema
fromJson
Create a schema from a given json string
param json the json to create the schema from
return the created schema based on the json
fromYaml
Create a schema from the given yaml string
param yaml the yaml to create the schema from
return the created schema based on the yaml
addColumnFloat
Add a Float column with no restrictions on the allowable values, except for no NaN/infinite values allowed
param name Name of the column
addColumnFloat
Add a Float column with the specified restrictions (and no NaN/Infinite values allowed)
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
return
addColumnFloat
Add a Float column with the specified restrictions
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnsFloat
Add multiple Float columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)
param columnNames Names of the columns to add
addColumnsFloat
A convenience method for adding multiple Float columns. For example, to add columns “myFloatCol_0”, “myFloatCol_1”, “myFloatCol_2”, use {- code addColumnsFloat(“myFloatCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsFloat
A convenience method for adding multiple Float columns, with additional restrictions that apply to all columns For example, to add columns “myFloatCol_0”, “myFloatCol_1”, “myFloatCol_2”, use {- code addColumnsFloat(“myFloatCol_%d”,0,2,null,null,false,false)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnDouble
Add a Double column with no restrictions on the allowable values, except for no NaN/infinite values allowed
param name Name of the column
addColumnDouble
Add a Double column with the specified restrictions (and no NaN/Infinite values allowed)
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
return
addColumnDouble
Add a Double column with the specified restrictions
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnsDouble
Add multiple Double columns with no restrictions on the allowable values of the columns (other than no NaN/Infinite)
param columnNames Names of the columns to add
addColumnsDouble
A convenience method for adding multiple Double columns. For example, to add columns “myDoubleCol_0”, “myDoubleCol_1”, “myDoubleCol_2”, use {- code addColumnsDouble(“myDoubleCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsDouble
A convenience method for adding multiple Double columns, with additional restrictions that apply to all columns For example, to add columns “myDoubleCol_0”, “myDoubleCol_1”, “myDoubleCol_2”, use {- code addColumnsDouble(“myDoubleCol_%d”,0,2,null,null,false,false)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
param allowNaN If false: don’t allow NaN values. If true: allow.
param allowInfinite If false: don’t allow infinite values. If true: allow
addColumnInteger
Add an Integer column with no restrictions on the allowable values
param name Name of the column
addColumnInteger
Add an Integer column with the specified min/max allowable values
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumnsInteger
Add multiple Integer columns with no restrictions on the min/max allowable values
param names Names of the integer columns to add
addColumnsInteger
A convenience method for adding multiple Integer columns. For example, to add columns “myIntegerCol_0”, “myIntegerCol_1”, “myIntegerCol_2”, use {- code addColumnsInteger(“myIntegerCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsInteger
A convenience method for adding multiple Integer columns. For example, to add columns “myIntegerCol_0”, “myIntegerCol_1”, “myIntegerCol_2”, use {- code addColumnsInteger(“myIntegerCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumnCategorical
Add a Categorical column, with the specified state names
param name Name of the column
param stateNames Names of the allowable states for this categorical column
addColumnCategorical
Add a Categorical column, with the specified state names
param name Name of the column
param stateNames Names of the allowable states for this categorical column
addColumnLong
Add a Long column, with no restrictions on the min/max values
param name Name of the column
addColumnLong
Add a Long column with the specified min/max allowable values
param name Name of the column
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumnsLong
Add multiple Long columns, with no restrictions on the allowable values
param names Names of the Long columns to add
addColumnsLong
A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsLong
A convenience method for adding multiple Long columns. For example, to add columns “myLongCol_0”, “myLongCol_1”, “myLongCol_2”, use {- code addColumnsLong(“myLongCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param minAllowedValue Minimum allowed value (inclusive). If null: no restriction
param maxAllowedValue Maximum allowed value (inclusive). If null: no restriction
addColumn
Add a column
param metaData metadata for this column
addColumnString
Add a String column with no restrictions on the allowable values.
param name Name of the column
addColumnsString
Add multiple String columns with no restrictions on the allowable values
param columnNames Names of the String columns to add
addColumnString
Add a String column with the specified restrictions
param name Name of the column
param regex Regex that the String must match in order to be considered valid. If null: no regex restriction
param minAllowableLength Minimum allowable length for the String to be considered valid
param maxAllowableLength Maximum allowable length for the String to be considered valid
addColumnsString
A convenience method for adding multiple numbered String columns. For example, to add columns “myStringCol_0”, “myStringCol_1”, “myStringCol_2”, use {- code addColumnsString(“myStringCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
addColumnsString
A convenience method for adding multiple numbered String columns. For example, to add columns “myStringCol_0”, “myStringCol_1”, “myStringCol_2”, use {- code addColumnsString(“myStringCol_%d”,0,2)}
param pattern Pattern to use (via String.format). “%d” is replaced with column numbers
param minIdxInclusive Minimum column index to use (inclusive)
param maxIdxInclusive Maximum column index to use (inclusive)
param regex Regex that the String must match in order to be considered valid. If null: no regex restriction
param minAllowedLength Minimum allowed length of strings (inclusive). If null: no restriction
param maxAllowedLength Maximum allowed length of strings (inclusive). If null: no restriction
addColumnTime
Add a Time column with no restrictions on the min/max allowable times NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform
param columnName Name of the column
param timeZone Time zone of the time column
addColumnTime
Add a Time column with no restrictions on the min/max allowable times NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform
param columnName Name of the column
param timeZone Time zone of the time column
addColumnTime
Add a Time column with the specified restrictions NOTE: Time columns are represented by LONG (epoch millisecond) values. For time values in human-readable formats, use String columns + StringToTimeTransform
param columnName Name of the column
param timeZone Time zone of the time column
param minValidValue Minumum allowable time (in milliseconds). May be null.
param maxValidValue Maximum allowable time (in milliseconds). May be null.
addColumnNDArray
Add a NDArray column
param columnName Name of the column
param shape shape of the NDArray column. Use -1 in entries to specify as “variable length” in that dimension
build
Create the Schema
inferMultiple
Infers a schema based on the record. The column names are based on indexing.
param record the record to infer from
return the infered schema
infer
Infers a schema based on the record. The column names are based on indexing.
param record the record to infer from
return the infered schema
inferSequenceMulti
Infers a sequence schema based on the record
param record the record to infer the schema based on
return the inferred sequence schema
inferSequence
Infers a sequence schema based on the record
param record the record to infer the schema based on
return the inferred sequence schema