RDD<DataSet>
/JavaRDD<DataSet>
(b) an RDD<MultiDataSet>
/JavaRDD<MultiDataSet>
(c) a directory of serialized DataSet/MultiDataSet (minibatch) objects on network storage such as HDFS, S3 or Azure blob storage (d) a directory of minibatches in some other formatSparkDl4jMultiLayer.fit(String path)
or SparkComputationGraph.fit(String path)
where path
is the directory where you saved the files.JavaRDD<DataSet>
for export, training or evaluation on Spark.DataVecDataSetFunction
is very similar to the RecordReaderDataSetIterator
that is often used for single machine training.RDD<List<Writable>>
(for 'standard' data) or RDD<List<List<Writable>>
(for sequence data).RDD<List<Writable>>
to RDD<MultiDataSet>
List<List<Writable>>
) you can use SparkSourceDummySeqReader instead.RDD<List<Writable>>
or RDD<List<List<Writable>>
to RDD<MultiDataSet>
RDD<DataSet>
or RDD<MultiDataSet>
directly in each training job.RDD<DataSet>
is quite straightforward:RDD<MultiDataSet>
can be done in the same way using BatchAndExportMultiDataSetsFunction
instead, which takes the same arguments.SparkComputationGraph.fitMultiDataSet(String path)
if we saved an RDD<MultiDataSet>
instead.SparkDl4jMultiLayer.fitPaths(JavaRDD<String>)
DataSetIterator
or MultiDataSetIterator
used for single-machine training. There are many different ways to create one, which is outside of the scope of this guide.RecordReader
or SequenceRecordReader
(including a custom record reader) to a format usable for use on Spark. MapFileRecordWriter and MapFileSequenceRecordWriter require the following dependencies:SequenceRecordReader
combined with a MapFileSequenceRecordWriter
is virtually the same.MapFileRecordWriter
and MapFileSequenceRecordWriter
both support splitting - i.e., creating multiple smaller map files instead of creating one single (potentially multi-GB) map file. Using splitting is recommended when saving data in this manner for use with Spark.RDD<DataSet>
for TrainingRDD<DataSet>
for image classification, starting from images stored either locally, or on a network file system such as HDFS.minibatchSize
remote file reads.org.deeplearning4j.spark.util.SparkDataUtils
.PatternPathLabelGenerator
instead. Let's say images are in the format "cat_img1234.jpg", "dog_2309.png" etc. We can use the following process: