> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/datavec/readers.md).

# Record Readers

A `RecordReader` is the entry point for data into DataVec. It reads raw bytes from an `InputSplit` and converts them into `List<Writable>` records — one list per data example, where each element corresponds to a column in your `Schema`.

## The RecordReader Interface

Every reader implements `RecordReader` and provides:

| Method                                             | Description                                             |
| -------------------------------------------------- | ------------------------------------------------------- |
| `initialize(InputSplit split)`                     | Set up the reader against a data source                 |
| `initialize(Configuration conf, InputSplit split)` | Set up with additional configuration                    |
| `hasNext()`                                        | True if another record is available                     |
| `next()`                                           | Return the next record as `List<Writable>`              |
| `nextRecord()`                                     | Return the next `Record` with optional `RecordMetaData` |
| `reset()`                                          | Restart iteration from the beginning                    |
| `close()`                                          | Release resources                                       |

After calling `initialize`, use `hasNext` / `next` in a loop, or pass the reader directly to a `DataSetIterator`.

## InputSplit

An `InputSplit` tells the reader where to find data. The main implementations:

### FileSplit

Points to a directory or single file. By default, all files recursively under the directory are included.

```java
// All files under a directory
InputSplit split = new FileSplit(new File("/data/train/"));

// Only CSV files, shuffled
InputSplit split = new FileSplit(
    new File("/data/train/"),
    new String[]{"csv"},
    new Random(42)
);

// A single file
InputSplit split = new FileSplit(new File("/data/train.csv"));
```

### NumberedFileInputSplit

For files named with sequential numbers in a format string:

```java
// matches seq_0000.csv through seq_9999.csv
InputSplit split = new NumberedFileInputSplit("/data/seq_%04d.csv", 0, 9999);
```

### CollectionInputSplit

For an explicit list of URIs:

```java
List<URI> uris = Arrays.asList(
    new URI("file:///data/a.csv"),
    new URI("file:///data/b.csv")
);
InputSplit split = new CollectionInputSplit(uris);
```

### InputStreamInputSplit

For streaming data from any `InputStream`:

```java
InputStream is = getClass().getResourceAsStream("/data.csv");
InputSplit split = new InputStreamInputSplit(is);
```

## CSV Readers

### CSVRecordReader

The most commonly used reader. Reads a CSV (or TSV, or any delimiter-separated) file line by line, producing one `List<Writable>` per line.

```java
// Default: comma delimiter, no header skip
RecordReader rr = new CSVRecordReader();
rr.initialize(new FileSplit(new File("data.csv")));

// Skip 1 header line, comma delimiter
RecordReader rr = new CSVRecordReader(1, ',');
rr.initialize(new FileSplit(new File("data.csv")));

// Tab delimiter
RecordReader rr = new CSVRecordReader(0, '\t');
```

All values are returned as `Text` (string) `Writable` objects. Numeric conversion happens automatically in the `TransformProcess` or during `DataSetIterator` construction.

When your CSV has mixed quoted fields:

```java
// Handle quoted fields with embedded commas
RecordReader rr = new CSVRecordReader(1, ',', '"');
```

### CSVSequenceRecordReader

Reads multiple files, treating each file as one sequence. Each line in a file is one time step; each value in a line is one feature at that time step.

This reader implements `SequenceRecordReader`, so use it with `SequenceRecordReaderDataSetIterator`.

```java
// One CSV file per sequence, one line per time step
SequenceRecordReader features = new CSVSequenceRecordReader(1, ',');
features.initialize(new NumberedFileInputSplit("/data/features_%d.csv", 0, 999));

SequenceRecordReader labels = new CSVSequenceRecordReader(1, ',');
labels.initialize(new NumberedFileInputSplit("/data/labels_%d.csv", 0, 999));

DataSetIterator iter = new SequenceRecordReaderDataSetIterator(
    features, labels, batchSize, numClasses,
    false,   // not regression
    SequenceRecordReaderDataSetIterator.AlignmentMode.ALIGN_END
);
```

### CSVRegexRecordReader

Splits columns using regex patterns rather than a simple delimiter. Useful for CSV files with inconsistent spacing or mixed delimiters.

```java
RecordReader rr = new CSVRegexRecordReader(0, ',');
```

### CSVVariableSlidingWindowRecordReader

Reads an entire CSV and produces subsequences using a variable sliding window. The window starts at size 1, grows to `maxLinesPerSequence`, then shrinks back. Useful for training on all possible subsequences of a dataset.

## Text Readers

### LineRecordReader

Reads a file line by line. Each line becomes a single-element record containing a `Text` writable. No parsing is done — you receive the raw line. Useful when you want to apply your own parsing in a `TransformProcess` or custom transform.

```java
RecordReader rr = new LineRecordReader();
rr.initialize(new FileSplit(new File("/data/corpus.txt")));

while (rr.hasNext()) {
    List<Writable> line = rr.next();  // single-element list
    String text = line.get(0).toString();
}
```

### RegexLineRecordReader

Reads a file line by line and splits each line into fields using a regex with capture groups. Each capture group becomes one `Text` writable in the record.

```java
// Parse log lines: "2024-01-15 14:32:01.123 42 WARN Message text here"
String regex = "(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\.\\d{3}) (\\d+) ([A-Z]+) (.+)";
int skipLines = 0;

RecordReader rr = new RegexLineRecordReader(regex, skipLines);
rr.initialize(new FileSplit(new File("/var/log/app.log")));

// Each record: ["2024-01-15 14:32:01.123", "42", "WARN", "Message text here"]
```

Lines that do not match the regex result in an exception by default.

### RegexSequenceRecordReader

Like `RegexLineRecordReader`, but reads an entire file as a sequence, with one time step per line. Supports three invalid-line handling modes:

* `FailOnInvalid` — throw an exception (default)
* `SkipInvalid` — silently skip non-matching lines
* `SkipInvalidWithWarning` — skip but log a warning

```java
RecordReader rr = new RegexSequenceRecordReader(regex, skipLines,
    RegexSequenceRecordReader.LineErrorHandling.SkipInvalidWithWarning);
```

### ListStringRecordReader

Reads from an in-memory list of strings. Each string is parsed as a single-column record. Useful for testing or when you have already loaded text into memory.

```java
List<List<String>> data = Arrays.asList(
    Arrays.asList("cat"),
    Arrays.asList("dog"),
    Arrays.asList("bird")
);
InputSplit split = new ListStringSplit(data);
RecordReader rr = new ListStringRecordReader();
rr.initialize(split);
```

## JSON / XML / YAML Readers

### JacksonRecordReader

Reads JSON, XML, or YAML files using Jackson. Each file (or each element in an array) becomes one record. You specify a `FieldSelection` to pull out the fields you need.

```java
import org.datavec.api.records.reader.impl.jackson.JacksonRecordReader;
import org.datavec.api.records.reader.impl.jackson.FieldSelection;
import com.fasterxml.jackson.databind.ObjectMapper;

FieldSelection fields = new FieldSelection.Builder()
    .addField("userId")
    .addField("amount")
    .addField("category")
    .build();

RecordReader rr = new JacksonRecordReader(
    fields,
    new ObjectMapper(),      // ObjectMapper for JSON
    false,                   // not append label
    -1,                      // label index (not used here)
    new FileSplit(new File("/data/events/"))
);
```

For XML, replace `new ObjectMapper()` with `new XmlMapper()` from the Jackson XML module.

## Image Reader

### ImageRecordReader

Reads a directory of images, where each subdirectory is treated as a class label (one-of-K labeling). All images are resized to the specified height, width, and channel count.

```java
import org.datavec.image.recordreader.ImageRecordReader;
import org.datavec.image.transform.ImageTransform;

int height = 224;
int width = 224;
int channels = 3;  // RGB; use 1 for grayscale

// Construct a label generator from directory names
ParentPathLabelGenerator labelMaker = new ParentPathLabelGenerator();

ImageRecordReader rr = new ImageRecordReader(height, width, channels, labelMaker);
rr.initialize(new FileSplit(new File("/data/images/train/")));
```

Expected directory structure:

```
/data/images/train/
    cat/
        img001.jpg
        img002.jpg
    dog/
        img003.jpg
        img004.jpg
```

With this structure, images in `cat/` get label index 0 and images in `dog/` get label index 1 (alphabetical ordering).

For image augmentation and transforms, see [Image Data](/en-1.0.0-rewrite/datavec/image.md).

## File Reader

### FileRecordReader

Reads individual files, returning the file path as a `Text` writable and the label derived from the parent directory name. Most commonly used as a base class rather than directly.

```java
RecordReader rr = new FileRecordReader();
rr.initialize(new FileSplit(new File("/data/files/")));

int currentLabel = ((FileRecordReader) rr).getCurrentLabel();
```

## Sparse Format Readers

### LibSvmRecordReader and SVMLightRecordReader

These readers parse sparse feature formats widely used in linear model and kernel method communities. The format encodes each example as:

```
LABEL index1:value1 index2:value2 ...
```

Zero-valued features are omitted. `LibSvmRecordReader` is a subclass of `SVMLightRecordReader` with minor format differences.

```java
// Configure for specific number of features
Configuration conf = new Configuration();
conf.set(SVMLightRecordReader.NUM_FEATURES, "10000");
conf.setBoolean(SVMLightRecordReader.ZERO_BASED_INDEXING, false);

RecordReader rr = new LibSvmRecordReader();
rr.initialize(conf, new FileSplit(new File("data.svm")));
```

## Collection Readers

### CollectionRecordReader

Wraps an in-memory `List<List<Writable>>` as a reader. Primarily used in unit tests.

```java
List<List<Writable>> data = new ArrayList<>();
data.add(Arrays.asList(new IntWritable(1), new DoubleWritable(3.14)));
data.add(Arrays.asList(new IntWritable(2), new DoubleWritable(2.72)));

RecordReader rr = new CollectionRecordReader(data);
```

### CollectionSequenceRecordReader

Like `CollectionRecordReader` but for sequence data: wraps `List<List<List<Writable>>>`.

## Combining Readers

### ConcatenatingRecordReader

Chains multiple readers sequentially. When the first reader is exhausted, reading continues with the second, and so on. Useful for combining training files across multiple directories.

```java
RecordReader r1 = new CSVRecordReader(1, ',');
r1.initialize(new FileSplit(new File("/data/train_2022/")));

RecordReader r2 = new CSVRecordReader(1, ',');
r2.initialize(new FileSplit(new File("/data/train_2023/")));

RecordReader combined = new ConcatenatingRecordReader(r1, r2);
```

### TransformProcessRecordReader

Wraps another reader and applies a `TransformProcess` to every record before returning it. Useful when you want to inline transformation without a separate executor step.

```java
RecordReader base = new CSVRecordReader(1, ',');
base.initialize(new FileSplit(new File("data.csv")));

TransformProcess tp = new TransformProcess.Builder(schema)
    .removeColumns("id")
    .categoricalToOneHot("color")
    .build();

RecordReader transformed = new TransformProcessRecordReader(base, tp);
```

For sequence readers, use `TransformProcessSequenceRecordReader` instead.

## Adding Listeners

You can attach a `RecordListener` to any reader for debugging or monitoring:

```java
rr.addListener(new LogRecordListener());  // logs every record to SLF4J
```

Custom listeners implement the `RecordListener` interface:

```java
rr.addListener(new RecordListener() {
    @Override
    public void recordRead(RecordReader reader, Object record) {
        System.out.println("Read: " + record);
    }
});
```

## Choosing the Right Reader

| Your data                        | Use                                                                                                      |
| -------------------------------- | -------------------------------------------------------------------------------------------------------- |
| CSV or TSV files                 | `CSVRecordReader`                                                                                        |
| One sequence per CSV file        | `CSVSequenceRecordReader`                                                                                |
| JSON / XML / YAML files          | `JacksonRecordReader`                                                                                    |
| Log files with structured format | `RegexLineRecordReader` (single record per line) or `RegexSequenceRecordReader` (whole file as sequence) |
| Labeled image directories        | `ImageRecordReader`                                                                                      |
| Sparse feature vectors           | `LibSvmRecordReader` / `SVMLightRecordReader`                                                            |
| In-memory data (testing)         | `CollectionRecordReader`                                                                                 |
| Multiple files to concatenate    | `ConcatenatingRecordReader`                                                                              |
| Any reader + inline transforms   | `TransformProcessRecordReader`                                                                           |


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/datavec/readers.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
