> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/datavec/transforms.md).

# Transforms

A `TransformProcess` is an ordered list of operations applied sequentially to every record in a dataset. Each operation sees the record as it was left by the previous operation, allowing you to chain complex multi-step pipelines from simple building blocks.

The `TransformProcess.Builder` API is fluent and validates each operation against the current schema state at build time. If you reference a column that does not exist, apply a numeric operation to a string column, or produce a result that breaks the schema, you get an exception when calling `.build()` rather than at runtime.

## Building a TransformProcess

```java
import org.datavec.api.transform.TransformProcess;
import org.datavec.api.transform.condition.ConditionOp;
import org.datavec.api.transform.condition.column.*;

TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
    // Remove columns that aren't features
    .removeColumns("CustomerID", "MerchantID")
    // Filter: keep only USA and CAN transactions
    .filter(new ConditionFilter(
        new CategoricalColumnCondition("MerchantCountryCode",
            ConditionOp.NotInSet, new HashSet<>(Arrays.asList("USA","CAN")))
    ))
    // Replace negative amounts with 0
    .conditionalReplaceValueTransform(
        "TransactionAmountUSD",
        new DoubleWritable(0.0),
        new DoubleColumnCondition("TransactionAmountUSD", ConditionOp.LessThan, 0.0)
    )
    // Parse a date string into epoch milliseconds
    .stringToTimeTransform("DateTimeString", "YYYY-MM-DD HH:mm:ss.SSS", DateTimeZone.UTC)
    .renameColumn("DateTimeString", "DateTime")
    // Extract hour-of-day as a new integer column
    .transform(new DeriveColumnsFromTimeTransform.Builder("DateTime")
        .addIntegerDerivedColumn("HourOfDay", DateTimeFieldType.hourOfDay())
        .build())
    .removeColumns("DateTime")
    .build();
```

## Column Management

### Removing Columns

```java
// Remove specific columns by name
.removeColumns("id", "timestamp", "raw_text")

// Remove all columns except the ones you want to keep
.removeAllColumnsExceptFor("feature1", "feature2", "label")
```

### Renaming Columns

```java
.renameColumn("OldName", "NewName")

// Rename multiple at once
.renameColumns(
    Arrays.asList("col_a", "col_b"),
    Arrays.asList("alpha", "beta")
)
```

### Reordering Columns

```java
// Named columns go first in the given order; remaining columns follow
.reorderColumns("label", "feature1", "feature2")
```

### Duplicating Columns

```java
.duplicateColumn("income", "income_backup")
```

### Adding Constant Columns

Useful when downstream systems require a fixed column that is the same for all records.

```java
.addConstantDoubleColumn("bias", 1.0)
.addConstantIntegerColumn("version", 3)
.addConstantLongColumn("batchID", 12345L)
.addConstantColumn("flag", ColumnType.String, new Text("active"))
```

## Type Conversions

### String to Numeric

```java
.convertToDouble("priceStr")      // "19.99" -> 19.99 (Double column)
.convertToInteger("countStr")     // "42" -> 42 (Integer column)
.convertToString("userId")        // any column -> String
```

### Categorical Encoding

**One-Hot Encoding**: Replaces a categorical column with N binary columns (one per state).

```java
// "color" with states ["red","green","blue"] becomes three Integer columns:
// "color[red]", "color[green]", "color[blue]"
.categoricalToOneHot("color")
.categoricalToOneHot("color", "size")   // multiple columns
```

**Integer Encoding**: Replaces the categorical column with a single integer (0 to numStates-1).

```java
.categoricalToInteger("priority")   // "low"->0, "medium"->1, "high"->2
```

**Integer to Categorical**: The reverse — assign state names to integer values.

```java
.integerToCategorical("labelIdx", Arrays.asList("cat", "dog", "bird"))
// or with explicit mapping
.integerToCategorical("labelIdx", Map.of(0, "cat", 1, "dog", 2, "bird"))
```

**Integer to One-Hot**: Directly expand an integer column to one-hot, given known min/max.

```java
.integerToOneHot("classId", 0, 9)   // integers 0-9 -> 10 binary columns
```

**String to Categorical**: Attach known state names to an existing string column.

```java
.stringToCategorical("tier", Arrays.asList("bronze", "silver", "gold"))
```

## Mathematical Operations

### Scalar Operations on a Single Column

```java
// MathOp values: Add, Subtract, Multiply, Divide, Modulus, ScalarMin, ScalarMax
.doubleMathOp("price", MathOp.Multiply, 1.1)      // 10% markup
.integerMathOp("count", MathOp.Add, 1)             // increment
.longMathOp("timestamp", MathOp.Subtract, 3600000L) // subtract 1 hour
.floatMathOp("weight", MathOp.Divide, 1000.0f)     // grams to kg
```

### Math Functions on a Column

```java
// MathFunction values: Abs, Ceil, Floor, Log, Log2, Log10, Exp, Sin, Cos, Sqrt...
.doubleMathFunction("logReturn", MathFunction.Log)
.floatMathFunction("angle", MathFunction.Sin)
```

### Derived Column from Multiple Columns

```java
// Adds "totalRevenue" = col1 + col2 + col3
.doubleColumnsMathOp("totalRevenue", MathOp.Add, "q1Rev", "q2Rev", "q3Rev", "q4Rev")
.integerColumnsMathOp("totalCount", MathOp.Add, "clicks", "impressions")
```

### Time Arithmetic

```java
// Shift a time column by a quantity
.timeMathOp("eventTime", MathOp.Add, 30, TimeUnit.MINUTES)
```

## String Operations

### Remove Whitespace

```java
.stringRemoveWhitespaceTransform("zipCode")
```

### Append to String

```java
.appendStringColumnTransform("urlPath", "?ref=datavec")
```

### Map Replacements

Replace specific string values with new values:

```java
Map<String, String> replacements = new HashMap<>();
replacements.put("n/a", "");
replacements.put("N/A", "");
replacements.put("none", "");

.stringMapTransform("description", replacements)
```

## Time Transforms

### Parse String to Time

```java
// Converts "2024-01-15 14:32:01.000" -> epoch milliseconds as Long
.stringToTimeTransform("dateStr", "YYYY-MM-DD HH:mm:ss.SSS", DateTimeZone.UTC)

// With locale
.stringToTimeTransform("dateStr", "MMM dd, YYYY", DateTimeZone.UTC, Locale.ENGLISH)
```

### Derive Integer Columns from Time

Extract date/time components from a Time column as new Integer columns:

```java
.transform(new DeriveColumnsFromTimeTransform.Builder("eventTime")
    .addIntegerDerivedColumn("year", DateTimeFieldType.year())
    .addIntegerDerivedColumn("month", DateTimeFieldType.monthOfYear())
    .addIntegerDerivedColumn("dayOfWeek", DateTimeFieldType.dayOfWeek())
    .addIntegerDerivedColumn("hourOfDay", DateTimeFieldType.hourOfDay())
    .build())
```

## Conditional Transforms

### Replace a Value When Condition Is True

```java
// If amount < 0, replace with 0.0
.conditionalReplaceValueTransform(
    "amount",
    new DoubleWritable(0.0),
    new DoubleColumnCondition("amount", ConditionOp.LessThan, 0.0)
)
```

### Replace with One of Two Values Based on Condition

```java
// If isActive is true, use 1; otherwise use 0
.conditionalReplaceValueTransformWithDefault(
    "activeFlag",
    new IntWritable(1),    // yesVal
    new IntWritable(0),    // noVal
    new BooleanColumnCondition("isActive", ConditionOp.Equal, true)
)
```

### Copy a Value from Another Column Conditionally

```java
// If "override" is non-null, copy its value to "price"
.conditionalCopyValueTransform("price", "override", condition)
```

## Filtering

Filters are applied inline during `TransformProcess` execution. A record is removed if the condition returns true.

```java
// Remove records where country is not in the allowed set
.filter(new ConditionFilter(
    new CategoricalColumnCondition("country",
        ConditionOp.NotInSet, new HashSet<>(Arrays.asList("USA","CAN","GBR")))
))

// Shorthand: pass condition directly
.filter(new DoubleColumnCondition("price", ConditionOp.LessThan, 0.0))
```

## Normalization

DataVec includes a built-in `normalize` operation that applies standard statistical normalization inline in the transform, given a pre-computed `DataAnalysis` object. The type parameter controls which normalization strategy to apply.

```java
DataAnalysis da = AnalyzeLocal.analyze(schema, reader, 100);

.normalize("income", Normalize.Standardize, da)    // zero mean, unit variance
.normalize("age", Normalize.MinMax, da)             // scale to [0,1]
.normalize("score", Normalize.Log2Mean, da)         // log2 normalization
```

For full control over normalizer fitting and serialization, use `NormalizerStandardize` or `NormalizerMinMaxScaler` as a pre-processor on the `DataSetIterator` instead. See [Normalization](/en-1.0.0-rewrite/datavec/normalization.md).

## Sorting and Ranking

### Calculate Sorted Rank

Adds a new Long column containing the rank (0 to N-1) of each record when sorted by a given column:

```java
.calculateSortedRank(
    "scoreRank",          // new column name
    "score",              // sort column
    new DoubleWritableComparator(),  // comparator
    false                 // false = descending (highest score = rank 0)
)
```

## Sequence Transforms

### Convert Records to Sequences

Group flat records into sequences by a key column:

```java
// Group by "userID", order within each group by "timestamp"
.convertToSequence("userID", new NumericalColumnComparator("timestamp", true))

// Group by multiple keys
.convertToSequence(
    Arrays.asList("userID", "sessionID"),
    new NumericalColumnComparator("timestamp", true)
)
```

### Convert Sequences Back to Records

Flatten sequences into individual records:

```java
.convertFromSequence()
```

### Split Sequences

Break large sequences into smaller ones:

```java
// Split every 50 time steps
.splitSequence(new NumStepsSequenceSplit(50, 1))

// Split at specific values in a column
.splitSequence(new ConditionSequenceSplit(
    new IntegerColumnCondition("eventType", ConditionOp.Equal, 0)
))
```

### Trim Sequences

Remove time steps from the start or end:

```java
.trimSequence(5, true)    // remove first 5 time steps
.trimSequence(5, false)   // remove last 5 time steps
```

### Offset Sequence Columns

Shift column values forward or backward in time (for creating prediction targets):

```java
// Shift "target" column forward 1 step (use current features to predict next target)
.offsetSequence(Arrays.asList("target"), 1,
    SequenceOffsetTransform.OperationType.InPlace)
```

### Moving Window Reduction

Compute a rolling statistic over the last N time steps:

```java
// Add "priceAvg5" = mean of last 5 values of "price"
.sequenceMovingWindowReduce("price", 5, ReduceOp.Mean)
```

### Reduce Sequences

Aggregate a whole sequence into a single record:

```java
Reducer reducer = new Reducer.Builder(ReduceOp.Mean)
    .meanColumns("feature1", "feature2")
    .maxColumns("maxSignal")
    .countColumns("numEvents")
    .build();

.reduceSequence(reducer)
```

## Applying Custom Transforms

If none of the built-in transforms meet your needs, implement the `Transform` interface directly and add it via `.transform(yourCustomTransform)`.

```java
public class MyTransform implements Transform {
    @Override
    public Schema transform(Schema inputSchema) {
        // Return modified schema after this transform
    }

    @Override
    public List<Writable> map(List<Writable> writables) {
        // Apply your custom logic and return new record
    }

    // ... other interface methods
}

// Register in builder
.transform(new MyTransform())
```

## Executing the Transform

After building a `TransformProcess`, execute it with `LocalTransformExecutor` or `SparkTransformExecutor`:

```java
import org.datavec.local.transforms.LocalTransformExecutor;

List<List<Writable>> processed = LocalTransformExecutor.execute(originalData, tp);
```

See [Executors](/en-1.0.0-rewrite/datavec/executors.md) for full details.

## Debugging Transforms

Print the schema state after each step to diagnose unexpected results:

```java
int numSteps = tp.getActionList().size();
for (int i = 0; i < numSteps; i++) {
    System.out.println("After step " + i + " (" + tp.getActionList().get(i) + "):");
    System.out.println(tp.getSchemaAfterStep(i));
}

System.out.println("Final schema:");
System.out.println(tp.getFinalSchema());
```

This is particularly helpful when a later transform fails because the schema no longer matches expectations after earlier steps.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/datavec/transforms.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
