Execute ETL and vectorization in a local instance.
Because datasets are commonly large by nature, you can decide on an execution mechanism that best suits your needs. For example, if you are vectorizing a large training dataset, you can process it in a distributed Spark cluster. However, if you need to do real-time inference, DataVec also provides a local executor that doesn't require any additional setup.
Once you've created your TransformProcess
using your Schema
, and you've either loaded your dataset into a Apache Spark JavaRDD
or have a RecordReader
that load your dataset, you can execute a transform.
Locally this looks like:
When using Spark this looks like:
Local transform executor
isTryCatch
Execute the specified TransformProcess with the given input data Note: this method can only be used if the TransformProcess returns non-sequence data. For TransformProcesses that return a sequence, use {- link #executeToSequence(List, TransformProcess)}
param inputWritables Input data to process
param transformProcess TransformProcess to execute
return Processed data
Execute a datavec transform process on spark rdds.
isTryCatch
deprecated Use static methods instead of instance methods on SparkTransformExecutor