Sometimes datasets are too large or too abstract in their format to manually analyze and estimate statistics on certain columns or patterns. DataVec comes with some helper utilities for performing a data analysis, and maximums, means, minimums, and other useful metrics.
Using Spark for analysis
If you have loaded your data into Apache Spark, DataVec has a special AnalyzeSpark class which can generate histograms, collect statistics, and return information about the quality of the data. Assuming you have already loaded your data into a Spark RDD, pass the JavaRDD and Schema to the class.
If you are using DataVec in Scala and your data was loaded into a regular RDD class, you can convert it by calling .toJavaRDD() which returns a JavaRDD. If you need to convert it back, call rdd().
The code below demonstrates some of many analyses for a 2D dataset in Spark analysis using the RDD javaRdd and the schema mySchema:
Note that if you have sequence data, there are special methods for that as well:
Analyzing locally
The AnalyzeLocal class works very similarly to its Spark counterpart and has a similar API. Instead of passing an RDD, it accepts a RecordReader which allows it to iterate over the dataset.