JavaRDD<MultiDataSet>(c) a directory of serialized DataSet/MultiDataSet (minibatch) objects on network storage such as HDFS, S3 or Azure blob storage (d) a directory of minibatches in some other format
pathis the directory where you saved the files.
JavaRDD<DataSet>for export, training or evaluation on Spark.
DataVecDataSetFunctionis very similar to the
RecordReaderDataSetIteratorthat is often used for single machine training.
RDD<List<Writable>>(for 'standard' data) or
RDD<List<List<Writable>>(for sequence data).
List<List<Writable>>) you can use SparkSourceDummySeqReader instead.
RDD<MultiDataSet>directly in each training job.
RDD<DataSet>is quite straightforward:
RDD<MultiDataSet>can be done in the same way using
BatchAndExportMultiDataSetsFunctioninstead, which takes the same arguments.
SparkComputationGraph.fitMultiDataSet(String path)if we saved an
MultiDataSetIteratorused for single-machine training. There are many different ways to create one, which is outside of the scope of this guide.
SequenceRecordReader(including a custom record reader) to a format usable for use on Spark. MapFileRecordWriter and MapFileSequenceRecordWriter require the following dependencies:
SequenceRecordReadercombined with a
MapFileSequenceRecordWriteris virtually the same.
MapFileSequenceRecordWriterboth support splitting - i.e., creating multiple smaller map files instead of creating one single (potentially multi-GB) map file. Using splitting is recommended when saving data in this manner for use with Spark.
RDD<DataSet>for image classification, starting from images stored either locally, or on a network file system such as HDFS.
minibatchSizeremote file reads.
PatternPathLabelGeneratorinstead. Let's say images are in the format "cat_img1234.jpg", "dog_2309.png" etc. We can use the following process: