AnalyzeSparkclass which can generate histograms, collect statistics, and return information about the quality of the data. Assuming you have already loaded your data into a Spark RDD, pass the
Schemato the class.
RDDclass, you can convert it by calling
.toJavaRDD()which returns a
JavaRDD. If you need to convert it back, call
javaRddand the schema
AnalyzeLocalclass works very similarly to its Spark counterpart and has a similar API. Instead of passing an RDD, it accepts a
RecordReaderwhich allows it to iterate over the dataset.