> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/datavec/filters.md).

# Filters

Filters remove records from a dataset during the `TransformProcess` execution. A filter is evaluated against every record; if the filter's `removeExample` method returns `true`, that record is dropped from the output.

Filters are the primary mechanism for data cleaning. Real-world datasets commonly contain records with missing values, out-of-range numbers, wrong column counts, or constraint violations. Rather than letting these records corrupt your training data, filters remove them early in the pipeline.

## Using Filters in a TransformProcess

Filters are applied inline as steps in a `TransformProcess`:

```java
TransformProcess tp = new TransformProcess.Builder(schema)
    // First filter: remove records from countries outside our scope
    .filter(new ConditionFilter(
        new CategoricalColumnCondition("country",
            ConditionOp.NotInSet,
            new HashSet<>(Arrays.asList("USA", "CAN", "GBR")))
    ))
    // Second filter: remove records with invalid values in numeric columns
    .filter(new FilterInvalidValues("age", "income"))
    .build();
```

Filters are applied in order. A record that survives the first filter is evaluated by the second, and so on. Multiple filters compound: a record is kept only if it passes every filter.

## ConditionFilter

`ConditionFilter` wraps any `Condition` into a filter. The semantics are:

* If the condition returns **true**: the record is **removed**
* If the condition returns **false**: the record is **kept**

```java
import org.datavec.api.transform.filter.ConditionFilter;
import org.datavec.api.transform.condition.column.DoubleColumnCondition;
import org.datavec.api.transform.condition.ConditionOp;

// Remove records where "price" is negative
Filter negativePriceFilter = new ConditionFilter(
    new DoubleColumnCondition("price", ConditionOp.LessThan, 0.0)
);

// Remove records where "status" is "deleted"
Filter deletedFilter = new ConditionFilter(
    new CategoricalColumnCondition("status", ConditionOp.Equal, "deleted")
);

// Remove records where score is NaN OR Infinite
Filter badScoreFilter = new ConditionFilter(
    BooleanCondition.OR(
        new NaNColumnCondition("score"),
        new InfiniteColumnCondition("score")
    )
);
```

Any `Condition` — including compound `AND`, `OR`, `NOT` combinations — can be wrapped in a `ConditionFilter`. See [Conditions](/en-1.0.0-rewrite/datavec/conditions.md) for the full condition API.

As a shorthand, `TransformProcess.Builder.filter(Condition condition)` automatically wraps the condition in a `ConditionFilter`:

```java
// Equivalent to .filter(new ConditionFilter(new DoubleColumnCondition(...)))
.filter(new DoubleColumnCondition("price", ConditionOp.LessThan, 0.0))
```

## FilterInvalidValues

`FilterInvalidValues` removes any record that contains values that are invalid according to the column's declared type and constraints in the `Schema`.

"Invalid" means:

* A value that cannot be parsed as the declared type (e.g., the string `"abc"` in a Double column)
* A numeric value outside the declared min/max range
* A categorical value not in the declared state list
* A string that fails the declared regex or length constraints

```java
import org.datavec.api.transform.filter.FilterInvalidValues;

// Remove records with invalid values in "age" or "income"
Filter invalidFilter = new FilterInvalidValues("age", "income");

// Remove records with invalid values in ANY column
Filter allColumnsFilter = new FilterInvalidValues();
```

When no column names are given, the filter checks all columns in the schema. This is useful as a broad sanity check at the start of a pipeline before more targeted transforms.

## InvalidNumColumns

`InvalidNumColumns` removes records that do not have the expected number of columns. This is useful for CSV files where some rows are corrupted or have stray extra delimiters.

```java
import org.datavec.api.transform.filter.InvalidNumColumns;

// Schema has 5 columns; remove records with != 5 values
Filter wrongWidth = new InvalidNumColumns(schema);
```

This filter compares the actual number of `Writable` values in each record against `schema.numColumns()`. Records with too few or too many columns are removed.

## Implementing a Custom Filter

If the built-in filters do not cover your use case, implement the `Filter` interface:

```java
import org.datavec.api.transform.filter.Filter;

public class MyCustomFilter implements Filter {
    @Override
    public boolean removeExample(Object writables) {
        List<Writable> record = (List<Writable>) writables;
        // Return true to REMOVE the record, false to keep it
        double value = record.get(2).toDouble();
        return Double.isNaN(value) || value < 0;
    }

    @Override
    public boolean removeSequence(Object sequence) {
        // Called for sequence data
        List<List<Writable>> seq = (List<List<Writable>>) sequence;
        return seq.isEmpty();
    }

    @Override
    public Schema transform(Schema inputSchema) {
        return inputSchema; // filters don't change the schema
    }
}
```

Add the custom filter to a `TransformProcess`:

```java
TransformProcess tp = new TransformProcess.Builder(schema)
    .filter(new MyCustomFilter())
    .build();
```

## Filter vs. Conditional Replace

Filters and conditional transforms address the same class of problems — records with bad values — but with different strategies:

| Strategy                                | When to use                                                                |
| --------------------------------------- | -------------------------------------------------------------------------- |
| **Filter** (remove the record)          | When bad records are unrecoverable and you have enough data to spare       |
| **Conditional replace** (fix the value) | When you can substitute a sensible default (e.g., 0.0 for negative prices) |

In practice, use filtering for structural problems (wrong column count, completely missing values) and conditional replace for recoverable data quality issues (out-of-range but fixable values).

## Order Matters

Because filters in a `TransformProcess` are applied in sequence, earlier filters can simplify later ones:

```java
TransformProcess tp = new TransformProcess.Builder(schema)
    // Step 1: Remove structurally broken records first
    .filter(new InvalidNumColumns(schema))
    // Step 2: Now safe to check values knowing all records have the right shape
    .filter(new FilterInvalidValues("age", "income"))
    // Step 3: Business logic filtering
    .filter(new ConditionFilter(
        new DoubleColumnCondition("income", ConditionOp.GreaterThan, 1_000_000.0)
    ))
    .build();
```

Placing cheap structural checks (like `InvalidNumColumns`) before expensive value-level checks reduces unnecessary work on malformed records.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/datavec/filters.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
