> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/datavec/schema.md).

# Schema

A `Schema` is the cornerstone of every DataVec transform pipeline. It defines the structure of your tabular data: how many columns there are, what each column is named, and what type of values it holds. The `TransformProcess` requires a schema to validate every operation at build time, so mismatches between your data and your transformation logic are caught early rather than at runtime.

## Why Schemas Matter

Data is rarely clean. Columns have unexpected types, values fall outside expected ranges, and string columns contain numbers that look categorical. By declaring a schema explicitly, you give DataVec:

1. A reference for type-checking each transform operation
2. A source of truth for valid value ranges, which filters can use to remove bad records
3. A serializable representation of your data layout that can travel with your model to production

## Building a Schema

Use `Schema.Builder` to construct a schema. The builder provides a fluent API where each `addColumn*` call appends a column definition.

```java
import org.datavec.api.transform.schema.Schema;
import org.joda.time.DateTimeZone;
import java.util.Arrays;

Schema schema = new Schema.Builder()
    .addColumnString("customerID")
    .addColumnString("email")
    .addColumnInteger("age")
    .addColumnDouble("accountBalanceUSD", 0.0, null, false, false)
    .addColumnCategorical("country", Arrays.asList("USA", "CAN", "GBR", "DEU", "FRA"))
    .addColumnTime("registrationDate", DateTimeZone.UTC)
    .addColumnCategorical("tier", Arrays.asList("bronze", "silver", "gold", "platinum"))
    .build();
```

The schema above declares seven columns in order. Order matters: column index 0 is `customerID`, index 1 is `email`, and so on. Any transform that references a column by index uses this ordering.

## Column Types

### Integer

A 32-bit signed integer. Optionally constrained to a min/max range.

```java
.addColumnInteger("age")
.addColumnInteger("quantity", 0, 10000)   // min 0, max 10000
```

Add multiple integer columns with a pattern:

```java
// Adds columns "feature_0", "feature_1", ..., "feature_9"
.addColumnsInteger("feature_%d", 0, 9)
```

### Long

A 64-bit signed integer. Useful for IDs, timestamps stored as epoch milliseconds, or any value that overflows int.

```java
.addColumnLong("transactionID")
.addColumnLong("eventTimestampMs", 0L, null)   // must be non-negative
```

### Double

A 64-bit floating point. By default, NaN and infinite values are rejected. You can allow them explicitly.

```java
.addColumnDouble("price")
.addColumnDouble("ratio", 0.0, 1.0)                          // bounded
.addColumnDouble("logLoss", null, null, false, false)        // unbounded, no NaN, no Inf
.addColumnDouble("rawScore", null, null, true, false)        // allow NaN, no Inf
```

### Float

A 32-bit floating point with the same options as Double.

```java
.addColumnFloat("embedding_0")
.addColumnFloat("embedding_1", null, null, false, false)

// Convenient multi-column add: "emb_0", "emb_1", ..., "emb_127"
.addColumnsFloat("emb_%d", 0, 127)
```

### String

Arbitrary text. Optionally constrained by a regex, minimum length, and maximum length.

```java
.addColumnString("description")
.addColumnString("zipCode", "\\d{5}", 5, 5)      // exactly 5 digits
.addColumnString("username", null, 3, 32)         // length 3 to 32
```

Add multiple string columns at once:

```java
.addColumnsString("field_0", "field_1", "field_2")
// or
.addColumnsString("field_%d", 0, 2)
```

### Categorical

A string column with a fixed, known set of allowed values (the "state names"). Categorical columns can be converted to one-hot encoding or integer encoding via `TransformProcess`.

```java
.addColumnCategorical("color", Arrays.asList("red", "green", "blue"))
.addColumnCategorical("priority", "low", "medium", "high")   // varargs form
```

Every value in the column must match one of the declared state names; values outside this set are considered invalid by schema validation.

### Time

A time column is stored internally as epoch milliseconds (a `Long`), but carries timezone information and optional min/max bounds. For data where timestamps arrive as human-readable strings, use a `String` column plus `stringToTimeTransform` in your `TransformProcess`.

```java
.addColumnTime("eventTime", DateTimeZone.UTC)
.addColumnTime("eventTime", DateTimeZone.UTC, 0L, null)   // must be after epoch
```

### NDArray

An embedded multidimensional array. Use -1 for variable-length dimensions.

```java
.addColumnNDArray("imageVector", new long[]{1, 28, 28})
.addColumnNDArray("embedding", new long[]{-1})    // variable length
```

### Boolean

A true/false column.

```java
.addColumnBoolean("isActive")
```

### Bytes

A raw byte array. Primarily used for binary blob columns.

```java
.addColumnBytes("rawPayload")
```

## Inspecting a Schema

```java
Schema schema = ...;

int n = schema.numColumns();                         // number of columns
String name = schema.getName(0);                     // name of column 0
ColumnType type = schema.getType("customerID");      // type by name
ColumnMetaData meta = schema.getMetaData("age");     // full metadata
int idx = schema.getIndexOfColumn("email");          // index by name
boolean exists = schema.hasColumn("missingCol");     // existence check
```

## Serializing a Schema

Schemas serialize to JSON or YAML, which allows you to save and reload them alongside your trained model.

```java
// Serialize
String json = schema.toJson();
String yaml = schema.toYaml();

// Deserialize
Schema reloaded = Schema.fromJson(json);
Schema reloadedYaml = Schema.fromYaml(yaml);
```

Storing the schema alongside your `TransformProcess` (also JSON-serializable) means you can reconstruct the full preprocessing pipeline at inference time without re-specifying it in code.

## SequenceSchema

`SequenceSchema` extends `Schema` for time series and sequence data. It has the same column definitions but is used with sequence-aware readers (like `CSVSequenceRecordReader`) and sequence transforms.

```java
import org.datavec.api.transform.schema.SequenceSchema;

SequenceSchema seqSchema = new SequenceSchema.Builder()
    .addColumnLong("timestampMs")
    .addColumnDouble("sensorValue")
    .addColumnCategorical("event", Arrays.asList("none", "alert", "critical"))
    .build();
```

`SequenceSchema` is passed to `TransformProcess.Builder` just like a regular schema:

```java
TransformProcess tp = new TransformProcess.Builder(seqSchema)
    .categoricalToInteger("event")
    .build();
```

## Schema Inference

If you have a sample record and want DataVec to guess a schema rather than declaring one manually, use the static inference methods on `Schema`:

```java
// Infer from a single record (List<Writable>)
List<Writable> sampleRecord = reader.next();
Schema inferred = Schema.infer(sampleRecord);

// Infer from multiple records (List<List<Writable>>)
List<List<Writable>> samples = new ArrayList<>();
for (int i = 0; i < 100 && reader.hasNext(); i++) {
    samples.add(reader.next());
}
Schema inferred = Schema.inferMultiple(samples);
```

Inferred schemas use column names `0`, `1`, `2`, ... and can only detect `Integer`, `Long`, `Double`, and `String` types. Any column that cannot be parsed as a number becomes `String`. Always review and refine inferred schemas before use in production — in particular, columns that are actually categorical (but represented as small integers) will be inferred as numeric.

For CSV files that have a header row and at least one data row, use `InferredSchema`:

```java
import org.datavec.api.transform.schema.InferredSchema;

// Reads header line for column names, first data line to infer types
Schema schema = new InferredSchema("/path/to/data.csv").build();
```

## Joining Two Schemas

When you have two datasets with a common key column, use `Join` to combine them:

```java
import org.datavec.api.transform.join.Join;

Schema customerSchema = new Schema.Builder()
    .addColumnLong("customerID")
    .addColumnString("customerName")
    .addColumnCategorical("country", Arrays.asList("USA", "GBR", "DEU"))
    .build();

Schema purchaseSchema = new Schema.Builder()
    .addColumnLong("customerID")
    .addColumnTime("purchaseTime", DateTimeZone.UTC)
    .addColumnDouble("amountUSD")
    .build();

Join join = new Join.Builder(Join.JoinType.Inner)
    .setJoinColumns("customerID")
    .setSchemas(customerSchema, purchaseSchema)
    .build();
```

Join types:

* `Inner` — only records with matching keys in both datasets
* `LeftOuter` — all records from the left dataset; right side fills `NullWritable` for unmatched rows
* `RightOuter` — all records from the right dataset; left side fills `NullWritable` for unmatched rows
* `FullOuter` — all records from both datasets; unmatched sides fill `NullWritable`

After building the `Join`, execute it with `LocalTransformExecutor.executeJoin` or `SparkTransformExecutor.executeJoin`. See [Executors](/en-1.0.0-rewrite/datavec/executors.md).

## Common Mistakes

**Column order mismatch**: The schema column order must exactly match the order that your `RecordReader` produces values. If your CSV has columns in a different order than your schema, transforms will silently operate on the wrong columns.

**Declaring categorical columns for numeric data**: If a column contains the integers 0, 1, 2, 3 and you want to use it as a label, declare it as `Integer` (not `Categorical`) and use `categoricalToInteger` transform only when the raw column actually contains strings.

**Forgetting timezone on Time columns**: Time columns always require a timezone. Use `DateTimeZone.UTC` as the default unless your data is explicitly in a local timezone.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/datavec/schema.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
