Deeplearning4j
Community ForumND4J JavadocDL4J Javadoc
EN 1.0.0-M2
EN 1.0.0-M2
  • Deeplearning4j Suite Overview
  • Release Notes
    • 1.0.0-M2
    • 1.0.0-M1.1
    • 1.0.0-M1
    • 1.0.0-beta7
    • 1.0.0-beta6
    • 1.0.0-beta5
    • 1.0.0-beta4
    • 1.0.0-beta3
    • 1.0.0-beta2
    • 1.0.0-beta
    • 1.0.0-alpha
    • 0.9.1
    • 0.9.0
    • 0.8.0
    • 0.7.2
    • 0.7.1
    • 0.7.0
    • 0.6.0
    • 0.5.0
    • 0.4.0
    • 1.00-M2.2
  • Multi-Project
    • Tutorials
      • Beginners
      • Quickstart
    • How To Guides
      • Import in to your favorite IDE
      • Contribute
        • Eclipse Contributors
      • Developer Docs
        • Github Actions/Build Infra
        • Javacpp
        • Release
        • Testing
      • Build From Source
      • Benchmark
      • Beginners
    • Reference
      • Examples Tour
    • Explanation
      • The core workflow
      • Configuration
        • Backends
          • Performance Issues
          • CPU
          • Cudnn
        • Memory
          • Workspaces
      • Build Tools
      • Snapshots
      • Maven
  • Deeplearning4j
    • Tutorials
      • Quick Start
      • Language Processing
        • Doc2Vec
        • Sentence Iterator
        • Tokenization
        • Vocabulary Cache
    • How To Guides
      • Custom Layers
      • Keras Import
        • Functional Models
        • Sequential Models
        • Custom Layers
        • Keras Import API Overview
          • Advanced Activations
          • Convolutional Layers
          • Core Layers
          • Embedding Layers
          • Local Layers
          • Noise Layers
          • Normalization Layers
          • Pooling Layers
          • Recurrent Layers
          • Wrapper Layers
        • Supported Features Overview
          • Activations
          • Constraints
          • Initializers
          • Losses
          • Optimizers
          • Regularizers
      • Tuning and Training
        • Visualization
        • Troubleshooting Training
        • Early Stopping
        • Evaluation
        • Transfer Learning
    • Reference
      • Model Zoo
        • Zoo Models
      • Activations
      • Auto Encoders
      • Computation Graph
      • Convolutional Layers
      • DataSet Iterators
      • Layers
      • Model Listeners
      • Saving and Loading Models
      • Multi Layer Network
      • Recurrent Layers
      • Updaters/Optimizers
      • Vertices
      • Word2vec/Glove/Doc2Vec
    • Explanation
  • datavec
    • Tutorials
      • Overview
    • How To Guides
    • Reference
      • Analysis
      • Conditions
      • Executors
      • Filters
      • Normalization
      • Operations
      • Transforms
      • Readers
      • Records
      • Reductions
      • Schemas
      • Serialization
      • Visualization
    • Explanation
  • Nd4j
    • Tutorials
      • Quickstart
    • How To Guides
      • Other Framework Interop
        • Tensorflow
        • TVM
        • Onnx
      • Matrix Manipulation
      • Element wise Operations
      • Basics
    • Reference
      • Op Descriptor Format
      • Tensor
      • Syntax
    • Explanation
  • Samediff
    • Tutorials
      • Quickstart
    • How To Guides
      • Importing Tensorflow
      • Adding Operations
        • codegen
    • Reference
      • Operation Namespaces
        • Base Operations
        • Bitwise
        • CNN
        • Image
        • LinAlg
        • Loss
        • Math
        • NN
        • Random
        • RNN
      • Variables
    • Explanation
      • Model Import Framework
  • Libnd4j
    • How To Guides
      • Building on Windows
      • Building for raspberry pi or Jetson Nano
      • Building on ios
      • How to Add Operations
      • How to Setup CLion
    • Reference
      • Understanding graph execution
      • Overview of working with libnd4j
      • Helpers Overview (CUDNN, OneDNN,Armcompute)
    • Explanation
  • Python4j
    • Tutorials
      • Quickstart
    • How To Guides
      • Write Python Script
    • Reference
      • Python Types
      • Python Path
      • Garbage Collection
      • Python Script Execution
    • Explanation
  • Spark
    • Tutorials
      • DL4J on Spark Quickstart
    • How To Guides
      • How To
      • Data How To
    • Reference
      • Parameter Server
      • Technical Reference
    • Explanation
      • Spark API Reference
  • codegen
Powered by GitBook
On this page
  • Analysis of data
  • Using Spark for analysis
  • Analyzing locally
  • Utilities
  • AnalyzeLocal
  • AnalyzeSpark

Was this helpful?

Export as PDF
  1. datavec
  2. Reference

Analysis

Gather statistics on datasets.

Analysis of data

Sometimes datasets are too large or too abstract in their format to manually analyze and estimate statistics on certain columns or patterns. DataVec comes with some helper utilities for performing a data analysis, and maximums, means, minimums, and other useful metrics.

Using Spark for analysis

If you have loaded your data into Apache Spark, DataVec has a special AnalyzeSpark class which can generate histograms, collect statistics, and return information about the quality of the data. Assuming you have already loaded your data into a Spark RDD, pass the JavaRDD and Schema to the class.

If you are using DataVec in Scala and your data was loaded into a regular RDD class, you can convert it by calling .toJavaRDD() which returns a JavaRDD. If you need to convert it back, call rdd().

The code below demonstrates some of many analyses for a 2D dataset in Spark analysis using the RDD javaRdd and the schema mySchema:

import org.datavec.spark.transform.AnalyzeSpark;
import org.datavec.api.writable.Writable;
import org.datavec.api.transform.analysis.*;

int maxHistogramBuckets = 10
DataAnalysis analysis = AnalyzeSpark.analyze(mySchema, javaRdd, maxHistogramBuckets)

DataQualityAnalysis analysis = AnalyzeSpark.analyzeQuality(mySchema, javaRdd)

Writable max = AnalyzeSpark.max(javaRdd, "myColumn", mySchema)

int numSamples = 5
List<Writable> sample = AnalyzeSpark.sampleFromColumn(numSamples, "myColumn", mySchema, javaRdd)

Note that if you have sequence data, there are special methods for that as well:

SequenceDataAnalysis seqAnalysis = AnalyzeSpark.analyzeSequence(mySchema, sequenceRdd)

List<Writable> uniqueSequence = AnalyzeSpark.getUniqueSequence("myColumn", seqSchema, sequenceRdd)

Analyzing locally

The AnalyzeLocal class works very similarly to its Spark counterpart and has a similar API. Instead of passing an RDD, it accepts a RecordReader which allows it to iterate over the dataset.

import org.datavec.local.transforms.AnalyzeLocal;

int maxHistogramBuckets = 10
DataAnalysis analysis = AnalyzeLocal.analyze(mySchema, csvRecordReader, maxHistogramBuckets)

Utilities

AnalyzeLocal

Analyse the specified data - returns a DataAnalysis object with summary information about each column

analyze

public static DataAnalysis analyze(Schema schema, RecordReader rr, int maxHistogramBuckets)

Analyse the specified data - returns a DataAnalysis object with summary information about each column

  • param schema Schema for data

  • param rr Data to analyze

  • return DataAnalysis for data

analyzeQualitySequence

public static DataQualityAnalysis analyzeQualitySequence(Schema schema, SequenceRecordReader data)

Analyze the data quality of sequence data - provides a report on missing values, values that don’t comply with schema, etc

  • param schema Schema for data

  • param data Data to analyze

  • return DataQualityAnalysis object

analyzeQuality

public static DataQualityAnalysis analyzeQuality(final Schema schema, final RecordReader data)

Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc

  • param schema Schema for data

  • param data Data to analyze

  • return DataQualityAnalysis object

AnalyzeSpark

AnalizeSpark: static methods for analyzing and

analyzeSequence

public static SequenceDataAnalysis analyzeSequence(Schema schema, JavaRDD<List<List<Writable>>> data,
                    int maxHistogramBuckets)
  • param schema

  • param data

  • param maxHistogramBuckets

  • return

analyze

public static DataAnalysis analyze(Schema schema, JavaRDD<List<Writable>> data)

Analyse the specified data - returns a DataAnalysis object with summary information about each column

  • param schema Schema for data

  • param data Data to analyze

  • return DataAnalysis for data

analyzeQualitySequence

public static DataQualityAnalysis analyzeQualitySequence(Schema schema, JavaRDD<List<List<Writable>>> data)

Randomly sample values from a single column

  • param count Number of values to sample

  • param columnName Name of the column to sample from

  • param schema Schema

  • param data Data to sample from

  • return A list of random samples

analyzeQuality

public static DataQualityAnalysis analyzeQuality(final Schema schema, final JavaRDD<List<Writable>> data)

Analyze the data quality of data - provides a report on missing values, values that don’t comply with schema, etc

  • param schema Schema for data

  • param data Data to analyze

  • return DataQualityAnalysis object

min

public static Writable min(JavaRDD<List<Writable>> allData, String columnName, Schema schema)

Randomly sample a set of invalid values from a specified column. Values are considered invalid according to the Schema / ColumnMetaData

  • param numToSample Maximum number of invalid values to sample

  • param columnName Same of the column from which to sample invalid values

  • param schema Data schema

  • param data Data

  • return List of invalid examples

max

public static Writable max(JavaRDD<List<Writable>> allData, String columnName, Schema schema)

Get the maximum value for the specified column

  • param allData All data

  • param columnName Name of the column to get the minimum value for

  • param schema Schema of the data

  • return Maximum value for the column

PreviousReferenceNextConditions

Last updated 3 years ago

Was this helpful?

[source]
[source]