Executors
Execute ETL and vectorization in a local instance.
Local or remote execution?
Because datasets are commonly large by nature, you can decide on an execution mechanism that best suits your needs. For example, if you are vectorizing a large training dataset, you can process it in a distributed Spark cluster. However, if you need to do real-time inference, DataVec also provides a local executor that doesn't require any additional setup.
Executing a transform process
Once you've created your TransformProcess
using your Schema
, and you've either loaded your dataset into a Apache Spark JavaRDD
or have a RecordReader
that load your dataset, you can execute a transform.
Locally this looks like:
When using Spark this looks like:
Available executors
LocalTransformExecutor
Local transform executor
isTryCatch
Execute the specified TransformProcess with the given input data Note: this method can only be used if the TransformProcess returns non-sequence data. For TransformProcesses that return a sequence, use {- link #executeToSequence(List, TransformProcess)}
param inputWritables Input data to process
param transformProcess TransformProcess to execute
return Processed data
SparkTransformExecutor
Execute a datavec transform process on spark rdds.
isTryCatch
deprecated Use static methods instead of instance methods on SparkTransformExecutor
Last updated