Because datasets are commonly large by nature, you can decide on an execution mechanism that best suits your needs. For example, if you are vectorizing a large training dataset, you can process it in a distributed Spark cluster. However, if you need to do real-time inference, DataVec also provides a local executor that doesn't require any additional setup.
Once you've created your TransformProcess
using your Schema
, and you've either loaded your dataset into a Apache Spark JavaRDD
or have a RecordReader
that load your dataset, you can execute a transform.
Locally this looks like:
import org.datavec.local.transforms.LocalTransformExecutor;List<List<Writable>> transformed = LocalTransformExecutor.execute(recordReader, transformProcess)List<List<List<Writable>>> transformedSeq = LocalTransformExecutor.executeToSequence(sequenceReader, transformProcess)List<List<Writable>> joined = LocalTransformExecutor.executeJoin(join, leftReader, rightReader)
When using Spark this looks like:
import org.datavec.spark.transforms.SparkTransformExecutor;JavaRDD<List<Writable>> transformed = SparkTransformExecutor.execute(inputRdd, transformProcess)JavaRDD<List<List<Writable>>> transformedSeq = SparkTransformExecutor.executeToSequence(inputSequenceRdd, transformProcess)JavaRDD<List<Writable>> joined = SparkTransformExecutor.executeJoin(join, leftRdd, rightRdd)
[source]
Local transform executor
isTryCatch
public static boolean isTryCatch()
Execute the specified TransformProcess with the given input data Note: this method can only be used if the TransformProcess returns non-sequence data. For TransformProcesses that return a sequence, use {- link #executeToSequence(List, TransformProcess)}
param inputWritables Input data to process
param transformProcess TransformProcess to execute
return Processed data
[source]
Execute a datavec transform process on spark rdds.
isTryCatch
public static boolean isTryCatch()
deprecated Use static methods instead of instance methods on SparkTransformExecutor