Gather statistics on datasets.
Sometimes datasets are too large or too abstract in their format to manually analyze and estimate statistics on certain columns or patterns. DataVec comes with some helper utilities for performing a data analysis, and maximums, means, minimums, and other useful metrics.
If you have loaded your data into Apache Spark, DataVec has a special AnalyzeSpark
class which can generate histograms, collect statistics, and return information about the quality of the data. Assuming you have already loaded your data into a Spark RDD, pass the JavaRDD
and Schema
to the class.
If you are using DataVec in Scala and your data was loaded into a regular RDD
class, you can convert it by calling .toJavaRDD()
which returns a JavaRDD
. If you need to convert it back, call rdd()
.
The code below demonstrates some of many analyses for a 2D dataset in Spark analysis using the RDD javaRdd
and the schema mySchema
:
Note that if you have sequence data, there are special methods for that as well:
The AnalyzeLocal
class works very similarly to its Spark counterpart and has a similar API. Instead of passing an RDD, it accepts a RecordReader
which allows it to iterate over the dataset.
Analyse the specified data - returns a DataAnalysis object with summary information about each column
Analyse the specified data - returns a DataAnalysis object with summary information about each column
param schema Schema for data
param rr Data to analyze
return DataAnalysis for data
Analyze the data quality of sequence data - provides a report on missing values, values that don’t comply with schema, etc
param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object
Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc
param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object
AnalizeSpark: static methods for analyzing and
param schema
param data
param maxHistogramBuckets
return
Analyse the specified data - returns a DataAnalysis object with summary information about each column
param schema Schema for data
param data Data to analyze
return DataAnalysis for data
Randomly sample values from a single column
param count Number of values to sample
param columnName Name of the column to sample from
param schema Schema
param data Data to sample from
return A list of random samples
Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc
param schema Schema for data
param data Data to analyze
return DataQualityAnalysis object
Randomly sample a set of invalid values from a specified column. Values are considered invalid according to the Schema / ColumnMetaData
param numToSample Maximum number of invalid values to sample
param columnName Same of the column from which to sample invalid values
param schema Data schema
param data Data
return List of invalid examples
Get the maximum value for the specified column
param allData All data
param columnName Name of the column to get the minimum value for
param schema Schema of the data
return Maximum value for the column