> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/deeplearning4j/overview-4.md).

# LLM & VLM Stack

Deeplearning4j 1.0.0-rewrite ships a full large language model (LLM) and vision-language model (VLM) application stack built on top of SameDiff. The stack is organized into six new Maven modules that together cover every layer of inference: tokenization, generation, KV cache management, speculative decoding, continuous batching, evaluation, benchmarking, model editing, audio transcription, and a web frontend for ND4J graphs.

This page gives a complete reference for all six modules, with API details and working Java code examples for the most important classes.

***

## 1. Overview and Module Map

The LLM stack sits above the existing SameDiff execution engine. SameDiff handles op dispatch and graph execution; the new modules provide the inference-specific infrastructure that production LLM serving requires.

```
Your Application
       │
       ▼
samediff-llm        ← generation pipeline, KV cache, speculative decoding,
                       continuous batching, tokenizers, evaluation, benchmarking,
                       model editing
samediff-vlm        ← vision-language model support (image+text)
samediff-audio      ← Whisper ASR support
nd4j-tokenizers     ← Rust-backed HuggingFace / SentencePiece / CLIP tokenizers
nd4j-torchscript    ← TorchScript / PyTorch model import
nd4j-web            ← TypeScript / FlatBuffers frontend for ND4J graphs
       │
       ▼
SameDiff (ND4J)     ← op execution, graph optimization, DSP plan lifecycle
       │
       ▼
libnd4j (C++)       ← CPU/CUDA kernels, BLAS, cuDNN
```

The six modules are independent of each other except that `samediff-vlm` and `samediff-audio` depend on `samediff-llm`, and all of them depend on `nd4j-tokenizers`.

***

## 2. Maven Dependencies

Add only the modules you need. All modules share the same version string.

```xml
<properties>
    <dl4j.version>1.0.0-rewrite</dl4j.version>
</properties>

<!-- Core LLM generation pipeline, KV cache, speculative decoding,
     continuous batching, evaluation, benchmarking, model editing -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-llm</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- Vision-language models (requires samediff-llm) -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-vlm</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- Whisper ASR (requires samediff-llm) -->
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-audio</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- Rust-backed tokenizers: HuggingFace, SentencePiece, CLIP -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-tokenizers</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- TorchScript / PyTorch model import -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-torchscript</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- TypeScript / FlatBuffers frontend for ND4J graphs -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-web</artifactId>
    <version>${dl4j.version}</version>
</dependency>
```

***

## 3. Generation Pipeline

The generation pipeline is the unified entry point for all text generation tasks. It handles model I/O auto-discovery, embedding extraction, tokenization, decode loop construction, and configuration-driven optimization.

### Core Classes

| Class                                            | Role                                                      |
| ------------------------------------------------ | --------------------------------------------------------- |
| `GenerationPipeline`                             | Top-level entry point; owns the decode loop and lifecycle |
| `GenerationPipelineConfig`                       | Builder-style configuration for the pipeline              |
| `DecodeOptions`                                  | Per-call generation parameters (temperature, top-k, etc.) |
| `GenerationResult`                               | Output: token IDs, decoded text, timing data              |
| `TextGenerator`                                  | Higher-level API with streaming callback support          |
| `Sampler` / `GreedySampler` / `CompositeSampler` | Sampling strategy hierarchy                               |
| `SamplingConfig` / `SamplerUtils`                | Temperature / top-k / top-p config and utilities          |
| `DecoderInputBuilder` / `DecoderUtils`           | Tensor construction for each decode step                  |
| `DecodeStepDiagnostics`                          | Per-step diagnostics: token IDs, logit stats, timing      |

### Building a Pipeline

`GenerationPipelineConfig` uses a fluent builder. All fields are optional; the pipeline performs auto-discovery for any field not set.

```java
import org.deeplearning4j.llm.generation.GenerationPipeline;
import org.deeplearning4j.llm.generation.GenerationPipelineConfig;
import org.deeplearning4j.llm.generation.DecodeOptions;
import org.deeplearning4j.llm.generation.GenerationResult;
import org.deeplearning4j.llm.generation.SamplingConfig;
import org.deeplearning4j.llm.kv.KvCacheStrategy;
import org.nd4j.autodiff.samediff.SameDiff;

// Load or import your model
SameDiff model = SameDiff.load(new File("llama-3.1-8b.fb"), true);

// Configure the pipeline
GenerationPipelineConfig config = GenerationPipelineConfig.builder()
    .decoder(model)                           // SameDiff graph containing the decoder
    .tokenizer("tokenizer.json")              // path or classpath resource
    .maxTokens(2048)                          // maximum context length
    .samplingConfig(SamplingConfig.builder()
        .temperature(0.7f)
        .topK(50)
        .topP(0.9f)
        .repetitionPenalty(1.1f)
        .build())
    .kvCacheStrategy(KvCacheStrategy.PAGED)   // see KV Cache section
    .build();

GenerationPipeline pipeline = new GenerationPipeline(config);
```

### Running Generation

Pass per-call options via `DecodeOptions`. Settings in `DecodeOptions` override the pipeline-level `SamplingConfig` for that call only.

```java
DecodeOptions opts = DecodeOptions.builder()
    .temperature(0.8f)
    .topK(40)
    .maxNewTokens(512)
    .build();

GenerationResult result = pipeline.generate("Explain quantum entanglement.", opts);

System.out.println(result.getText());         // decoded string
System.out.println(result.getTokenIds());     // List<Integer> of generated token IDs
System.out.printf("%.1f tok/s%n",
    result.getTokensPerSecond());             // throughput from timing data
```

### Streaming with TextGenerator

`TextGenerator` wraps `GenerationPipeline` and adds token-by-token streaming via a callback.

```java
import org.deeplearning4j.llm.generation.TextGenerator;

TextGenerator generator = new TextGenerator(pipeline);

generator.generate("Write a haiku about neural networks.", opts,
    token -> System.out.print(token));        // callback receives each decoded token

System.out.println();  // newline after streaming completes
```

### Sampling Strategies

`GreedySampler` always picks the highest-probability token. `CompositeSampler` chains a sequence of sampling transforms — temperature scaling, then top-k filtering, then top-p nucleus filtering — before the final argmax or categorical sample.

```java
import org.deeplearning4j.llm.generation.sampling.GreedySampler;
import org.deeplearning4j.llm.generation.sampling.CompositeSampler;
import org.deeplearning4j.llm.generation.sampling.SamplerUtils;

// Greedy decoding — deterministic, fast
Sampler greedy = new GreedySampler();

// Nucleus sampling: temperature → top-k → top-p → sample
Sampler nucleus = CompositeSampler.builder()
    .temperature(0.9f)
    .topK(100)
    .topP(0.95f)
    .build();
```

### Per-Step Diagnostics

Enable `DecodeStepDiagnostics` to capture detailed information about each decode step. This is useful for debugging generation quality issues.

```java
import org.deeplearning4j.llm.generation.DecodeStepDiagnostics;

pipeline.enableDiagnostics(true);

GenerationResult result = pipeline.generate("Hello, world!", opts);

for (DecodeStepDiagnostics step : result.getDiagnostics()) {
    System.out.printf("Step %d: token=%d logit_max=%.3f logit_entropy=%.3f time_ms=%d%n",
        step.getStep(),
        step.getSelectedTokenId(),
        step.getMaxLogit(),
        step.getLogitEntropy(),
        step.getStepTimeMs());
}
```

***

## 4. KV Cache Management

The key-value (KV) cache stores the attention keys and values computed during the prefill and previous decode steps. Good cache management is the single largest lever for improving LLM serving throughput. The LLM stack provides a comprehensive hierarchy of cache implementations.

### Cache Strategy Overview

| Strategy         | Class                       | When to Use                              |
| ---------------- | --------------------------- | ---------------------------------------- |
| Paged            | `PagedKVCache`              | Default; best memory utilization         |
| Paged + eviction | `EvictablePagedKVCache`     | Long conversations; evict old pages      |
| Per-layer policy | `PerLayerPagedKVCache`      | Different eviction per transformer layer |
| Quantized        | `QuantizedPagedKVCache`     | Memory-constrained GPUs; INT8/FP16 pages |
| MLA              | `MLAKVCache`                | DeepSeek Multi-head Latent Attention     |
| Beam search      | `BeamKVCacheManager`        | Beam decoding with K beams               |
| Speculative      | `SpeculativeKVCacheManager` | Speculative decoding draft/verify        |
| Tiered           | `TieredKVCacheManager`      | GPU → host DRAM → disk tiering           |
| Unified          | `UnifiedKvCacheManager`     | Single manager across all strategies     |

### PagedKVCache

`PagedKVCache` partitions the cache into fixed-size pages. Sequences are allocated pages on demand; eviction is O(1) — just reclaim a page. This is the vLLM-style approach and is the default strategy.

```java
import org.deeplearning4j.llm.kv.PagedKVCache;

PagedKVCache cache = PagedKVCache.builder()
    .pageSize(16)           // tokens per page
    .maxPages(1024)         // total pages (controls max memory)
    .numLayers(32)          // transformer layer count
    .numHeads(8)            // KV heads (use GQA count, not full head count)
    .headDim(128)           // per-head dimension
    .dtype(DataType.FLOAT16)
    .build();
```

### Eviction Policies

`EvictablePagedKVCache` adds eviction support. Three built-in eviction policies are provided:

| Policy       | Class                        | Description                                                                            |
| ------------ | ---------------------------- | -------------------------------------------------------------------------------------- |
| LRU          | default                      | Evict least recently used page                                                         |
| H2O          | `H2OEvictionPolicy`          | Heavy Hitter Oracle: evict low-importance tokens based on accumulated attention scores |
| StreamingLLM | `StreamingLLMEvictionPolicy` | Preserve attention sink tokens + recent sliding window                                 |

```java
import org.deeplearning4j.llm.kv.EvictablePagedKVCache;
import org.deeplearning4j.llm.kv.eviction.H2OEvictionPolicy;
import org.deeplearning4j.llm.kv.eviction.StreamingLLMEvictionPolicy;
import org.deeplearning4j.llm.kv.eviction.AttentionSinkDetector;

// H2O eviction — works well for long document summarization
EvictablePagedKVCache h2oCache = EvictablePagedKVCache.builder()
    .pageSize(16)
    .maxPages(512)
    .numLayers(32)
    .numHeads(8)
    .headDim(128)
    .evictionPolicy(new H2OEvictionPolicy())
    .build();

// StreamingLLM — preserves attention sinks, keeps recent window
AttentionSinkDetector sinkDetector = new AttentionSinkDetector(numSinkTokens: 4);

EvictablePagedKVCache streamingCache = EvictablePagedKVCache.builder()
    .pageSize(16)
    .maxPages(512)
    .numLayers(32)
    .numHeads(8)
    .headDim(128)
    .evictionPolicy(new StreamingLLMEvictionPolicy(sinkDetector, windowSize: 256))
    .build();
```

### Per-Layer Eviction

`PerLayerPagedKVCache` assigns a different `PerLayerKVPolicy` to each transformer layer. This is useful because attention patterns differ significantly between early and late layers.

```java
import org.deeplearning4j.llm.kv.PerLayerPagedKVCache;
import org.deeplearning4j.llm.kv.PerLayerKVPolicy;
import org.deeplearning4j.llm.kv.eviction.StreamingLLMEvictionPolicy;
import org.deeplearning4j.llm.kv.eviction.H2OEvictionPolicy;

List<PerLayerKVPolicy> policies = new ArrayList<>();
for (int layer = 0; layer < 32; layer++) {
    if (layer < 4) {
        // Early layers: protect attention sinks with StreamingLLM
        policies.add(PerLayerKVPolicy.of(new StreamingLLMEvictionPolicy(sinkDetector, 512)));
    } else {
        // Later layers: H2O importance-based eviction
        policies.add(PerLayerKVPolicy.of(new H2OEvictionPolicy()));
    }
}

PerLayerPagedKVCache perLayerCache = PerLayerPagedKVCache.builder()
    .pageSize(16)
    .maxPages(512)
    .perLayerPolicies(policies)
    .numHeads(8)
    .headDim(128)
    .build();
```

### Quantized KV Cache

`QuantizedPagedKVCache` stores pages in INT8 or FP16 and dequantizes on read. This roughly halves or quarters the memory footprint of the cache with minimal accuracy impact on most models.

```java
import org.deeplearning4j.llm.kv.QuantizedPagedKVCache;
import org.deeplearning4j.llm.kv.QuantizationMode;

QuantizedPagedKVCache quantCache = QuantizedPagedKVCache.builder()
    .pageSize(16)
    .maxPages(2048)          // 2x more pages for same memory vs FP32
    .numLayers(32)
    .numHeads(8)
    .headDim(128)
    .quantizationMode(QuantizationMode.INT8)  // or FP16
    .build();
```

### KV Cache Offloading

For very long contexts, the cache can be offloaded from GPU VRAM to host DRAM or disk.

```java
import org.deeplearning4j.llm.kv.offload.KVCacheHostOffloader;
import org.deeplearning4j.llm.kv.offload.KVCacheDiskOffloader;

// Offload evicted pages to host DRAM (PCIe transfer on demand)
KVCacheHostOffloader hostOffloader = KVCacheHostOffloader.builder()
    .maxHostBytes(8L * 1024 * 1024 * 1024)   // 8 GB host RAM
    .asyncTransfer(true)
    .build();

// Offload evicted pages to disk (NVMe SSD recommended)
KVCacheDiskOffloader diskOffloader = KVCacheDiskOffloader.builder()
    .storagePath(Path.of("/tmp/kvcache"))
    .maxDiskBytes(64L * 1024 * 1024 * 1024)  // 64 GB
    .build();
```

Use `TieredKVCacheManager` to combine GPU, host, and disk tiers automatically:

```java
import org.deeplearning4j.llm.kv.TieredKVCacheManager;

TieredKVCacheManager tieredManager = TieredKVCacheManager.builder()
    .gpuCache(cache)
    .hostOffloader(hostOffloader)
    .diskOffloader(diskOffloader)
    .build();
```

### Prefix Sharing

`KVCachePrefixTree` and `RadixPrefixCache` enable sharing KV cache pages across requests that share a common prompt prefix (e.g., a system prompt). Matching prefixes are detected and their cached pages are reused rather than recomputed.

```java
import org.deeplearning4j.llm.kv.prefix.RadixPrefixCache;

RadixPrefixCache prefixCache = RadixPrefixCache.builder()
    .pageSize(16)
    .maxEntries(10000)
    .build();

// The pipeline will check prefixCache before computing prefill
pipeline.setPrefixCache(prefixCache);
```

### KV Cache Checkpointing

Save and restore cache state to disk, enabling pause-and-resume of long generation sessions.

```java
import org.deeplearning4j.llm.kv.checkpoint.KVCacheCheckpointManager;

KVCacheCheckpointManager checkpointMgr = new KVCacheCheckpointManager(cache);

// Save
checkpointMgr.checkpoint(Path.of("/tmp/kvcache_checkpoint.bin"));

// Restore
checkpointMgr.restore(Path.of("/tmp/kvcache_checkpoint.bin"));
```

***

## 5. Speculative Decoding

Speculative decoding uses a fast draft model (or an n-gram heuristic) to propose multiple tokens ahead, then verifies them in a single forward pass of the full target model. Accepted tokens come for free; only rejected tokens require additional passes. On hardware where the target model is memory-bandwidth-bound, speculative decoding commonly delivers 2-3x throughput improvement.

### Speculator Implementations

| Class                  | Draft Source                  | Notes                       |
| ---------------------- | ----------------------------- | --------------------------- |
| `NgramSpeculator`      | N-gram from generated context | No secondary model required |
| `DraftModelSpeculator` | Smaller SameDiff model        | Highest acceptance rate     |

### NgramSpeculator

Uses an n-gram index built from the tokens already generated in the current sequence. No additional model weights are required.

```java
import org.deeplearning4j.llm.speculative.NgramSpeculator;
import org.deeplearning4j.llm.speculative.SpeculativeDecodeLoop;

NgramSpeculator speculator = NgramSpeculator.builder()
    .ngramOrder(4)          // use 4-gram drafts
    .draftLength(5)         // propose up to 5 tokens per step
    .build();

SpeculativeDecodeLoop loop = SpeculativeDecodeLoop.builder()
    .targetModel(model)
    .speculator(speculator)
    .kvCache(cache)
    .build();

GenerationResult result = loop.generate("Once upon a time", opts);
System.out.printf("Accepted %.1f%% of draft tokens%n",
    result.getSpeculativeAcceptanceRate() * 100);
```

### DraftModelSpeculator

Uses a smaller, faster model to generate draft tokens. The draft model should share the same vocabulary as the target model.

```java
import org.deeplearning4j.llm.speculative.DraftModelSpeculator;

SameDiff draftModel = SameDiff.load(new File("llama-3.1-1b.fb"), true);

DraftModelSpeculator speculator = DraftModelSpeculator.builder()
    .draftModel(draftModel)
    .draftLength(7)            // draft up to 7 tokens
    .draftKvCache(draftCache)  // separate smaller cache for the draft model
    .build();

SpeculativeDecodeLoop loop = SpeculativeDecodeLoop.builder()
    .targetModel(model)
    .speculator(speculator)
    .kvCache(cache)
    .verifier(new TreeAttentionVerifier())   // parallel tree-based verification
    .build();
```

### Tree Attention Verification

`TreeAttentionVerifier` organizes draft tokens into a tree structure and verifies all candidates in parallel with a single batched forward pass of the target model. This maximizes GPU utilization during the verification step.

The tree verifier is selected automatically when `draftLength > 1` and is the recommended choice for `DraftModelSpeculator`. It requires no additional configuration beyond being set on the `SpeculativeDecodeLoop`.

### Throughput and Auto-Disable

For structured or repetitive outputs (code, lists, repeated phrases), n-gram speculation typically achieves **2-5x throughput improvement** over greedy decode because the n-gram index captures recurring patterns with high acceptance rates.

A probe mechanism monitors acceptance rates automatically. If the target model cannot handle multi-token input (for example, some encoder-decoder models like SmolDocling that use cached cross-attention), the probe detects the failure, disables speculation for a cooldown period, then re-enables it to retry. This makes `SpeculativeDecodeLoop` safe to use without knowing in advance whether a given model supports speculative execution:

```java
SpeculativeDecodeLoop loop = SpeculativeDecodeLoop.builder()
    .targetModel(model)
    .speculator(new NgramSpeculator.builder()
        .ngramOrder(4)
        .draftLength(5)
        .build())
    .kvCache(cache)
    // Probe mechanism is enabled by default; no extra configuration required.
    // The loop logs a warning and falls back to greedy on unsupported models.
    .build();
```

***

## 6. Continuous Batching

Continuous batching (sometimes called in-flight batching) keeps the GPU fully saturated by interleaving prefill and decode steps across multiple requests. Unlike static batching, where a batch waits until all sequences in it complete, continuous batching allows new requests to be admitted and completed sequences to exit at any decode step.

### Architecture

```
Incoming requests
       │
       ▼
ContinuousBatchScheduler
  │  assigns requests to batch slots
  │  manages per-request BatchGenerationState
  │
  ├──► ChunkedPrefillEngine    ← breaks long prompts into chunks
  │       processed in decode steps alongside ongoing sequences
  │
  └──► Decode step (full batch)
          │
          ▼
       BatchCompactor
          removes completed sequences, compacts active slots
```

### ContinuousBatchScheduler

```java
import org.deeplearning4j.llm.batch.ContinuousBatchScheduler;
import org.deeplearning4j.llm.batch.ChunkedPrefillEngine;

ChunkedPrefillEngine prefillEngine = ChunkedPrefillEngine.builder()
    .chunkSize(512)           // tokens of prefill to process per step
    .build();

ContinuousBatchScheduler scheduler = ContinuousBatchScheduler.builder()
    .maxBatchSize(32)         // max concurrent sequences
    .maxSeqLen(4096)
    .model(model)
    .kvCache(cache)
    .prefillEngine(prefillEngine)
    .build();

scheduler.start();

// Submit requests (thread-safe; can be called from multiple threads)
CompletableFuture<GenerationResult> future1 =
    scheduler.submit("Summarize: " + longDocument, opts);
CompletableFuture<GenerationResult> future2 =
    scheduler.submit("Translate to French: Hello world", opts);

GenerationResult r1 = future1.get();
GenerationResult r2 = future2.get();

scheduler.shutdown();
```

### ChunkedPrefillEngine

`ChunkedPrefillEngine` solves the O(n²) memory problem of processing long prompts in a single pass. It splits the prompt into fixed-size windows (`chunkSize` tokens) and processes each chunk sequentially, accumulating KV cache entries across chunks. The decode phase begins only after all chunks complete.

This allows arbitrarily long prompts to be processed within a fixed GPU memory budget while keeping decode latency uniform across requests:

```java
import org.deeplearning4j.llm.batch.ChunkedPrefillEngine;

ChunkedPrefillEngine prefillEngine = ChunkedPrefillEngine.builder()
    .chunkSize(512)    // tokens of prefill to process per scheduler step
    .build();

// Attach to the scheduler; long-prompt requests are chunked automatically
ContinuousBatchScheduler scheduler = ContinuousBatchScheduler.builder()
    .prefillEngine(prefillEngine)
    // ... other config
    .build();
```

Chunk size is a latency-memory trade-off: smaller chunks use less memory per step but add more prefill steps before the first token is produced. 512 tokens is a practical starting point for most hardware.

### BatchGenerationState

`BatchGenerationState` tracks per-sequence state within the batch: current token position, KV cache page assignments, sampling state, and completion status. It is managed automatically by `ContinuousBatchScheduler` and is not normally accessed directly.

### BatchCompactor

`BatchCompactor` runs at the end of each decode step to remove completed sequences and compact the batch tensor so that the GPU kernel always operates on a dense, full-occupancy batch. It is attached to the scheduler automatically.

***

## 7. Tokenizers

The `nd4j-tokenizers` module provides tokenizers backed by Rust-native implementations for correctness and performance. All tokenizers implement the `Tokenizer` interface.

### Tokenizer Interface

```java
import org.nd4j.tokenizers.Tokenizer;
import org.nd4j.tokenizers.Encoding;

public interface Tokenizer {
    Encoding encode(String text);
    String decode(List<Integer> ids);
    Map<String, Integer> specialTokens();
    int vocabSize();
    void close();
}
```

`Encoding` holds the token IDs, attention mask, and (optionally) token type IDs.

### HuggingFaceTokenizer

Loads any tokenizer in the standard `tokenizer.json` format as exported by Hugging Face `transformers`. Supports BPE, WordPiece, and Unigram models.

```java
import org.nd4j.tokenizers.HuggingFaceTokenizer;

HuggingFaceTokenizer tokenizer =
    HuggingFaceTokenizer.fromFile(Path.of("path/to/tokenizer.json"));

Encoding enc = tokenizer.encode("The quick brown fox");
System.out.println(enc.getIds());           // [791, 4996, 14198, 39935]

String decoded = tokenizer.decode(enc.getIds());
System.out.println(decoded);                // "The quick brown fox"

tokenizer.close();
```

### SentencePieceTokenizer

Loads SentencePiece BPE models (`.model` files), used by LLaMA, Gemma, Mistral, and other models that do not use the HuggingFace format.

```java
import org.nd4j.tokenizers.SentencePieceTokenizer;

SentencePieceTokenizer tokenizer =
    SentencePieceTokenizer.fromFile(Path.of("tokenizer.model"));

Encoding enc = tokenizer.encode("Hello, SentencePiece!");
tokenizer.close();
```

### CLIPTokenizer

A specialized tokenizer for CLIP-family vision-language models, following the byte-pair encoding used by the original OpenAI CLIP implementation.

```java
import org.nd4j.tokenizers.CLIPTokenizer;

CLIPTokenizer tokenizer =
    CLIPTokenizer.fromFiles(
        Path.of("vocab.json"),
        Path.of("merges.txt"));

// Encode a text prompt for CLIP image-text alignment
Encoding enc = tokenizer.encode("a photo of a cat");
tokenizer.close();
```

### Chat Templates

`ChatTemplate` renders structured chat conversations into the prompt format expected by an instruction-tuned model. It implements a Jinja2-subset template engine compatible with the `chat_template` field in HuggingFace `tokenizer_config.json`.

```java
import org.nd4j.tokenizers.ChatTemplate;
import org.nd4j.tokenizers.ChatMessage;

ChatTemplate template = ChatTemplate.fromTokenizerConfig(
    Path.of("tokenizer_config.json"));

List<ChatMessage> messages = List.of(
    ChatMessage.system("You are a helpful assistant."),
    ChatMessage.user("What is the capital of France?"),
    ChatMessage.assistant("The capital of France is Paris."),
    ChatMessage.user("What is its population?"));

String prompt = template.apply(messages, addGenerationPrompt: true);
System.out.println(prompt);
// Produces the model-specific formatted prompt string
```

### TokenizerFactory

`TokenizerFactory` auto-detects the tokenizer type from the files present in a directory and instantiates the correct implementation.

```java
import org.nd4j.tokenizers.TokenizerFactory;

// Auto-detect from a directory containing tokenizer.json or tokenizer.model
Tokenizer tokenizer = TokenizerFactory.fromDirectory(Path.of("model-dir/"));
```

***

## 8. Evaluation Framework

The evaluation framework provides automated benchmarking of LLM capabilities across standard academic benchmarks and custom datasets.

### Core Evaluation Classes

| Class                        | Role                                                               |
| ---------------------------- | ------------------------------------------------------------------ |
| `EvalRunner`                 | Orchestrates evaluation runs; parallelizes across dataset examples |
| `EvalConfig`                 | Dataset, benchmark, metric, and generation options                 |
| `EvalResult`                 | Aggregated result: per-benchmark scores, timing, sample results    |
| `SampleResult`               | Per-example output, prediction, and score                          |
| `PerplexityEvaluator`        | Computes log-perplexity over a reference corpus                    |
| `GenerationQualityValidator` | Validates generation coherence (length, repetition, entropy)       |
| `AnswerExtractor`            | Extracts structured answers from free-form generated text          |

### Running a Standard Benchmark

```java
import org.deeplearning4j.llm.eval.EvalRunner;
import org.deeplearning4j.llm.eval.EvalConfig;
import org.deeplearning4j.llm.eval.EvalResult;
import org.deeplearning4j.llm.eval.benchmarks.MMLUBenchmark;
import org.deeplearning4j.llm.eval.datasets.HuggingFaceDataset;

HuggingFaceDataset dataset = HuggingFaceDataset.load("cais/mmlu", split: "test");

EvalConfig config = EvalConfig.builder()
    .benchmark(new MMLUBenchmark())
    .dataset(dataset)
    .pipeline(pipeline)
    .numShots(5)             // 5-shot evaluation
    .numWorkers(4)           // parallel evaluation workers
    .build();

EvalRunner runner = new EvalRunner(config);
EvalResult result = runner.run();

System.out.printf("MMLU accuracy: %.2f%%%n", result.getScore() * 100);
result.getPerSubjectScores().forEach((subject, score) ->
    System.out.printf("  %s: %.2f%%%n", subject, score * 100));
```

### Available Benchmarks

| Benchmark  | Class                 | Measures                                               |
| ---------- | --------------------- | ------------------------------------------------------ |
| MMLU       | `MMLUBenchmark`       | Massive Multitask Language Understanding (57 subjects) |
| ARC        | `ArcBenchmark`        | AI2 Reasoning Challenge (grade-school science)         |
| GSM8K      | `Gsm8kBenchmark`      | Grade school math word problems                        |
| HellaSwag  | `HellaSwagBenchmark`  | Commonsense reasoning / sentence completion            |
| TruthfulQA | `TruthfulQABenchmark` | Truthfulness and calibration                           |
| WinoGrande | `WinograndeBenchmark` | Pronoun coreference resolution                         |

### Metrics

| Metric           | Class                    | Description                                             |
| ---------------- | ------------------------ | ------------------------------------------------------- |
| Exact Match      | `ExactMatch`             | Binary: prediction equals gold label                    |
| F1               | `F1`                     | Token-level F1 between prediction and gold              |
| BLEU             | `BLEU`                   | N-gram precision (translation quality)                  |
| ROUGE            | `ROUGE`                  | Recall-oriented n-gram overlap (summarization)          |
| ANLS             | `ANLS`                   | Average Normalized Levenshtein Similarity (document QA) |
| VQA Accuracy     | `VqaAccuracy`            | Soft accuracy for visual question answering             |
| Relaxed Accuracy | `RelaxedAccuracy`        | Case/punctuation-insensitive exact match                |
| Multiple Choice  | `MultipleChoiceAccuracy` | Accuracy over A/B/C/D choices                           |

```java
import org.deeplearning4j.llm.eval.metrics.ROUGE;
import org.deeplearning4j.llm.eval.metrics.RougeVariant;

ROUGE rouge = new ROUGE(RougeVariant.ROUGE_L);
double score = rouge.compute(prediction, reference);
```

### Dataset Sources

| Class                | Loads From                                             |
| -------------------- | ------------------------------------------------------ |
| `HuggingFaceDataset` | HuggingFace Hub (requires network)                     |
| `JsonlDataset`       | Local JSONL file                                       |
| `CsvDataset`         | Local CSV file                                         |
| `CustomDataset`      | In-memory list of `(input, label)` pairs               |
| `DatasetCache`       | Wraps any dataset; caches to disk to avoid re-download |

```java
import org.deeplearning4j.llm.eval.datasets.JsonlDataset;
import org.deeplearning4j.llm.eval.datasets.DatasetCache;

JsonlDataset raw = JsonlDataset.builder()
    .path(Path.of("gsm8k_test.jsonl"))
    .inputField("question")
    .labelField("answer")
    .build();

// Cache to avoid re-reading the file on each evaluation run
DatasetCache cached = DatasetCache.wrap(raw, Path.of(".cache/gsm8k"));
```

### Perplexity

```java
import org.deeplearning4j.llm.eval.PerplexityEvaluator;

PerplexityEvaluator ppl = new PerplexityEvaluator(pipeline);
double perplexity = ppl.evaluate(Path.of("wikitext-103-test.txt"));
System.out.printf("Perplexity: %.2f%n", perplexity);
```

### Running All Standard Benchmarks

`EvalRunner` orchestrates evaluation runs and parallelizes across dataset examples using multiple worker threads. The example below runs all six built-in benchmarks back-to-back against the same pipeline:

```java
import org.deeplearning4j.llm.eval.EvalRunner;
import org.deeplearning4j.llm.eval.EvalConfig;
import org.deeplearning4j.llm.eval.EvalResult;
import org.deeplearning4j.llm.eval.benchmarks.MMLUBenchmark;
import org.deeplearning4j.llm.eval.benchmarks.ArcBenchmark;
import org.deeplearning4j.llm.eval.benchmarks.Gsm8kBenchmark;
import org.deeplearning4j.llm.eval.benchmarks.HellaSwagBenchmark;
import org.deeplearning4j.llm.eval.benchmarks.TruthfulQABenchmark;
import org.deeplearning4j.llm.eval.benchmarks.WinograndeBenchmark;
import org.deeplearning4j.llm.eval.datasets.HuggingFaceDataset;
import org.deeplearning4j.llm.eval.datasets.DatasetCache;

record BenchmarkSpec(String name, Object benchmark, String hfPath, String split) {}

List<BenchmarkSpec> specs = List.of(
    new BenchmarkSpec("MMLU",        new MMLUBenchmark(),       "cais/mmlu",       "test"),
    new BenchmarkSpec("ARC",         new ArcBenchmark(),        "allenai/arc",     "test"),
    new BenchmarkSpec("GSM8K",       new Gsm8kBenchmark(),      "gsm8k",           "test"),
    new BenchmarkSpec("HellaSwag",   new HellaSwagBenchmark(),  "hellaswag",       "validation"),
    new BenchmarkSpec("TruthfulQA",  new TruthfulQABenchmark(), "truthful_qa",     "validation"),
    new BenchmarkSpec("WinoGrande",  new WinograndeBenchmark(), "winogrande",      "validation")
);

for (BenchmarkSpec spec : specs) {
    HuggingFaceDataset dataset = HuggingFaceDataset.load(spec.hfPath(), split: spec.split());
    DatasetCache cached = DatasetCache.wrap(dataset, Path.of(".cache/" + spec.name().toLowerCase()));

    EvalConfig config = EvalConfig.builder()
        .benchmark(spec.benchmark())
        .dataset(cached)
        .pipeline(pipeline)
        .numShots(0)          // 0-shot by default; set higher for few-shot
        .numWorkers(4)
        .build();

    EvalResult result = new EvalRunner(config).run();

    System.out.printf("%-12s  %.2f%%%n", spec.name(), result.getScore() * 100);
}
```

Expected output (scores vary by model):

```
MMLU          65.3%
ARC           80.1%
GSM8K         72.4%
HellaSwag     83.7%
TruthfulQA    48.9%
WinoGrande    74.2%
```

***

## 9. Model Editing / Abliteration

The model editing module provides tools for modifying model behavior by directly editing weight matrices. The primary use case implemented is *abliteration*: removing a model's refusal directions to understand or modify how refusal behavior is encoded in the model's weights. This is useful for research into model internals and for running ablations on safety-trained models in controlled research environments.

**Important:** These tools modify model weights irreversibly. Always work on a copy. Abliterated models should be used only within the bounds of your organization's AI safety policies.

### Abliteration Workflow

Abliteration works by:

1. Collecting activations for harmful and harmless prompt pairs (contrastive pairs).
2. Computing the mean activation difference between the two sets — the "refusal direction".
3. Orthogonalizing all weight matrices in the model against the refusal direction using Gram-Schmidt.

This removes the direction from the model's weight space so the model cannot activate along it, effectively removing the refusal behavior.

```java
import org.deeplearning4j.llm.edit.AbliterationWorkflow;
import org.deeplearning4j.llm.edit.AbliterationConfig;
import org.deeplearning4j.llm.edit.AbliterationResult;
import org.deeplearning4j.llm.edit.DefaultPromptSets;

AbliterationConfig config = AbliterationConfig.builder()
    .model(model)
    .tokenizer(tokenizer)
    .harmfulPrompts(DefaultPromptSets.HARMFUL_PROMPTS)        // built-in set
    .harmlessPrompts(DefaultPromptSets.HARMLESS_PROMPTS)      // built-in set
    .targetLayers(List.of(15, 16, 17, 18))                    // layers to edit
    .numActivationSamples(64)                                 // prompts per direction
    .build();

AbliterationWorkflow workflow = new AbliterationWorkflow(config);
AbliterationResult result = workflow.run();

System.out.printf("Edited %d weight matrices%n",
    result.getNumEditedMatrices());

// Save the modified model
SameDiff editedModel = result.getEditedModel();
editedModel.save(new File("model-abliterated.fb"), true);
```

### RefusalDirectionFinder

Used internally by `AbliterationWorkflow`, but can also be used standalone to analyze where refusal behavior is most strongly encoded across layers.

```java
import org.deeplearning4j.llm.edit.RefusalDirectionFinder;
import org.deeplearning4j.llm.edit.RefusalDirection;

RefusalDirectionFinder finder = new RefusalDirectionFinder(model, tokenizer);

List<RefusalDirection> directions = finder.find(
    harmfulPrompts, harmlessPrompts, layers: List.of(0, 8, 16, 24, 31));

for (RefusalDirection dir : directions) {
    System.out.printf("Layer %d: direction norm=%.4f%n",
        dir.getLayer(), dir.getDirection().norm2Number().floatValue());
}
```

### WeightOrthogonalizer

Applies the Gram-Schmidt orthogonalization to remove a direction from a weight matrix. Used by `AbliterationWorkflow` but also available directly.

```java
import org.deeplearning4j.llm.edit.WeightOrthogonalizer;

INDArray weightMatrix = model.getVariable("decoder/layer.16/mlp/down_proj/W").getArr();
INDArray direction   = refusalDirection.getDirection();

INDArray edited = WeightOrthogonalizer.orthogonalize(weightMatrix, direction);
```

***

## 10. Benchmarking

The benchmark framework measures LLM inference throughput under controlled conditions. It distinguishes between three throughput regimes that capture different aspects of serving performance.

### Throughput Metrics

| Metric             | Description                                                            |
| ------------------ | ---------------------------------------------------------------------- |
| `lateSteady tok/s` | Tokens per second after full JIT warmup and cache warmup               |
| `steady tok/s`     | Tokens per second during the steady decode phase (most representative) |
| `decode tok/s`     | Tokens per second for the decode phase only (excludes prefill)         |

### BenchmarkConfig Presets

`BenchmarkConfig` ships four presets that control how the SameDiff graph is executed during the benchmark run.

| Preset       | Constant                       | Description                                                                |
| ------------ | ------------------------------ | -------------------------------------------------------------------------- |
| Optimal      | `BenchmarkConfig.OPTIMAL`      | Lets the system select the best execution mode automatically               |
| Slot-by-slot | `BenchmarkConfig.SLOT_BY_SLOT` | Executes one op at a time; useful for per-op profiling                     |
| Triton       | `BenchmarkConfig.TRITON`       | Routes eligible ops through Triton kernels (requires `tritonEnabled=true`) |
| CUDA Graphs  | `BenchmarkConfig.CUDA_GRAPHS`  | Captures and replays CUDA graphs; lowest decode latency on GPU             |

### Running a Benchmark

```java
import org.deeplearning4j.llm.benchmark.BenchmarkRunner;
import org.deeplearning4j.llm.benchmark.BenchmarkConfig;
import org.deeplearning4j.llm.benchmark.BenchmarkResult;

BenchmarkConfig config = BenchmarkConfig.OPTIMAL
    .withFp16PreCast(true)          // cast weights to FP16 before benchmarking
    .withGraphOptimizer(true)       // enable SameDiff graph fusion passes
    .withTritonEnabled(false);      // set true to enable Triton kernel routing

BenchmarkRunner runner = BenchmarkRunner.builder()
    .pipeline(pipeline)
    .config(config)
    .prompt("The quick brown fox jumps over the lazy dog.")
    .warmupIterations(50)
    .benchmarkIterations(200)
    .build();

BenchmarkResult result = runner.run();

System.out.printf("steady tok/s:     %.1f%n", result.getSteadyThroughput());
System.out.printf("decode tok/s:     %.1f%n", result.getDecodeThroughput());
System.out.printf("lateSteady tok/s: %.1f%n", result.getLateSteadyThroughput());
System.out.printf("mean decode ms:   %.2f%n", result.getMeanDecodeMs());
System.out.printf("p99 decode ms:    %.2f%n", result.getP99DecodeMs());
```

### BenchmarkConfigApplier

`BenchmarkConfigApplier` is the only legitimate caller of `setGraphExecutionMode` on a `SameDiff` instance. If you need to apply a `BenchmarkConfig` to an existing pipeline outside of `BenchmarkRunner`, use it rather than calling `SameDiff` execution mode methods directly.

```java
import org.deeplearning4j.llm.benchmark.BenchmarkConfigApplier;

BenchmarkConfigApplier.apply(model, BenchmarkConfig.CUDA_GRAPHS);
```

### Decode Step Validation

The benchmark framework ships a suite of validation utilities for verifying that optimization changes do not alter numerical outputs.

```java
import org.deeplearning4j.llm.benchmark.DecodeValidationFramework;
import org.deeplearning4j.llm.benchmark.MultiLevelComparator;

DecodeValidationFramework validator = new DecodeValidationFramework(
    referenceModel,
    optimizedModel,
    new MultiLevelComparator(atol: 1e-3f, rtol: 1e-3f));

boolean pass = validator.validate("Test prompt for numerical equivalence.");
System.out.println("Validation: " + (pass ? "PASS" : "FAIL"));
```

***

## 11. VLM, Audio, and Other Modules

### samediff-vlm: Vision-Language Models

`samediff-vlm` extends the generation pipeline with image conditioning. The module handles image preprocessing (resize, normalize, patch extraction), image encoding via a vision encoder SameDiff graph, cross-attention injection into the language model, and the combined text-image generation loop.

```java
import org.deeplearning4j.vlm.VlmPipeline;
import org.deeplearning4j.vlm.VlmPipelineConfig;
import org.deeplearning4j.vlm.VlmGenerationResult;

SameDiff visionEncoder = SameDiff.load(new File("clip-vit-large.fb"), true);
SameDiff languageModel  = SameDiff.load(new File("llava-1.6-mistral-7b.fb"), true);

VlmPipelineConfig config = VlmPipelineConfig.builder()
    .visionEncoder(visionEncoder)
    .languageModel(languageModel)
    .tokenizer(TokenizerFactory.fromDirectory(Path.of("llava-tokenizer/")))
    .imageSize(336)                    // model-specific image resolution
    .build();

VlmPipeline vlm = new VlmPipeline(config);

BufferedImage image = ImageIO.read(new File("photo.jpg"));
String prompt = "Describe what is happening in this image in detail.";

VlmGenerationResult result = vlm.generate(image, prompt, DecodeOptions.defaults());
System.out.println(result.getText());
```

The `CLIPTokenizer` in `nd4j-tokenizers` is used by `samediff-vlm` to tokenize text for CLIP-family vision encoders. Text embeddings and image patch embeddings are concatenated in the language model's embedding space before the decode loop begins.

### samediff-audio: Whisper ASR

`samediff-audio` provides a complete Whisper automatic speech recognition pipeline, including mel spectrogram extraction, audio chunking for long audio, beam search decoding, and optional language detection.

```java
import org.deeplearning4j.audio.WhisperPipeline;
import org.deeplearning4j.audio.WhisperConfig;
import org.deeplearning4j.audio.TranscriptionResult;

SameDiff whisperModel = SameDiff.load(new File("whisper-large-v3.fb"), true);

WhisperConfig config = WhisperConfig.builder()
    .model(whisperModel)
    .language("en")           // or null for auto-detect
    .task(WhisperTask.TRANSCRIBE)
    .beamSize(5)
    .chunkLengthSeconds(30)   // Whisper processes 30-second chunks
    .build();

WhisperPipeline whisper = new WhisperPipeline(config);

// Input: 16 kHz mono PCM as INDArray
INDArray audio = loadAudio("interview.wav");

TranscriptionResult result = whisper.transcribe(audio);
System.out.println(result.getText());

// With timestamps
result.getSegments().forEach(seg ->
    System.out.printf("[%.2f → %.2f] %s%n",
        seg.getStart(), seg.getEnd(), seg.getText()));
```

#### WhisperArchitecture and GGUF Loading

Whisper models can be loaded directly from GGUF files (whisper.cpp format) using the `WhisperArchitecture` handler in `nd4j-ggml`. The `WhisperArchitecture` class implements `ModelArchitecture` and is detected automatically from the GGUF metadata key `general.architecture = "whisper"`. It builds a complete encoder-decoder SameDiff graph from the GGML weight tensors.

```java
import org.eclipse.deeplearning4j.audio.whisper.WhisperModel;
import org.eclipse.deeplearning4j.audio.whisper.WhisperConfig;
import org.eclipse.deeplearning4j.audio.whisper.WhisperDecoderResult;

// Load from GGUF (whisper.cpp format) — architecture auto-detected
WhisperModel model = WhisperModel.fromGgml(new File("ggml-large-v3.bin"));

// Or load from ONNX export (HuggingFace Optimum format)
WhisperModel model = WhisperModel.fromOnnx(new File("whisper-large-v3/"));
// Expects: encoder_model.onnx, decoder_model.onnx, tokenizer.json

// Transcribe an audio file (auto-resamples to 16 kHz if needed)
WhisperDecoderResult result = model.transcribe(new File("interview.wav"));
System.out.println(result.getText());

// With language and timestamp options
WhisperDecoderResult result = model.transcribe(
    new File("interview.wav"),
    "en",           // language code, or null for auto-detect
    "transcribe",   // task: "transcribe" or "translate"
    true);          // include timestamps

result.getSegments().forEach(seg ->
    System.out.printf("[%.2f -> %.2f] %s%n",
        seg.getStart(), seg.getEnd(), seg.getText()));

model.close();
```

#### Mel Filterbank Parameters

The mel spectrogram is extracted by a native C++ op (`whisper_mel_spectrogram`) that runs STFT, mel filterbank, and Whisper-specific log normalization in a single kernel. The fixed parameters for all standard Whisper variants are:

| Parameter     | Value                                                      | Description                                                |
| ------------- | ---------------------------------------------------------- | ---------------------------------------------------------- |
| `sampleRate`  | 16000 Hz                                                   | Required input sample rate                                 |
| `N_FFT`       | 400                                                        | FFT window size (\~25 ms at 16 kHz)                        |
| `hopLength`   | 160                                                        | Hop between frames (\~10 ms at 16 kHz)                     |
| `numMelBins`  | 80 (tiny/base/small/medium/large-v2), 128 (large-v3/turbo) | Mel filter count                                           |
| `chunkLength` | 30 seconds                                                 | Audio is padded or trimmed to this length                  |
| `numFrames`   | 3000                                                       | Frames per chunk: `(sampleRate * chunkLength) / hopLength` |

Log normalization applies `log10(max(mel, 1e-10))`, clamps values to `(max - 8.0)`, then scales with `(x + 4.0) / 4.0`.

`WhisperConfig` provides named presets for each model size:

```java
import org.eclipse.deeplearning4j.audio.whisper.WhisperConfig;

WhisperConfig cfg = WhisperConfig.largeV3();
// numMelBins=128, hiddenSize=1280, numAttentionHeads=20, 32 encoder + 32 decoder layers

WhisperConfig cfg = WhisperConfig.turbo();
// numMelBins=128, hiddenSize=1280, numAttentionHeads=20, 32 encoder + 4 decoder layers

WhisperConfig cfg = WhisperConfig.base();
// numMelBins=80, hiddenSize=512, numAttentionHeads=8, 6 encoder + 6 decoder layers
```

To extract mel features manually (e.g., for pre-processing pipelines):

```java
import org.eclipse.deeplearning4j.audio.feature.WhisperMelSpectrogram;

WhisperMelSpectrogram mel = new WhisperMelSpectrogram(WhisperConfig.largeV3());

// From an INDArray of raw samples at 16 kHz
INDArray melFeatures = mel.extractFeatures(audioSamples);
// Output shape: [1, 128, 3000]

// Directly from a WAV file (resamples automatically)
INDArray melFeatures = mel.extractFeaturesFromFile(new File("audio.wav"));
```

#### Beam-Search Decoder

The Whisper decode loop is driven by `GenerationPipeline` in encoder-decoder mode. Greedy decoding is the default. To use beam search, configure the sampling to select the top-beam paths:

```java
import org.eclipse.deeplearning4j.audio.whisper.WhisperModel;
import org.eclipse.deeplearning4j.audio.whisper.WhisperConfig;
import org.eclipse.deeplearning4j.llm.generation.KvCacheStrategy;
import org.eclipse.deeplearning4j.llm.generation.SamplingConfig;

WhisperModel model = WhisperModel.builder()
    .encoder(encoder)
    .decoder(decoder)
    .config(WhisperConfig.largeV3())
    .kvCacheStrategy(KvCacheStrategy.STATIC)
    .samplingConfig(SamplingConfig.greedy())  // greedy (default) or beam
    .maxTokens(448)
    .build();

WhisperDecoderResult result = model.transcribe(new File("audio.wav"));
```

The encoder output (shape `[1, seqLen, hiddenSize]`) is computed once and then injected into every decoder cross-attention step via `ModelIOConfig.encoderDecoder(true)`. Special tokens (`SOT`, language token, task token) form the decoder prompt; generation stops on `EOT`.

### nd4j-torchscript: PyTorch Model Import

`nd4j-torchscript` imports TorchScript (`.pt`) files exported from PyTorch into native SameDiff graphs. This allows any PyTorch model that can be `torch.jit.traced` or `torch.jit.scripted` to be run without any Python dependency at inference time.

```java
import org.nd4j.torchscript.TorchScriptImporter;

// Export from PyTorch:
// traced = torch.jit.trace(model, example_input)
// traced.save("model.pt")

SameDiff sd = TorchScriptImporter.importModel(Path.of("model.pt"));

// Run inference
Map<String, INDArray> inputs = Map.of("input", inputTensor);
Map<String, INDArray> outputs = sd.outputAll(inputs);
```

Supported op coverage includes all ops commonly used in transformer architectures: matrix multiply, layer norm, softmax, attention, RoPE, SiLU/GELU activations, and element-wise operations. Unsupported ops will raise `TorchScriptImportException` with the op name.

### nd4j-web: Browser Frontend for ND4J Graphs

`nd4j-web` provides a TypeScript/FlatBuffers-based web frontend for visualizing and executing ND4J computation graphs in a browser. Graphs are serialized to FlatBuffers format and served over a lightweight HTTP endpoint. This is primarily useful for debugging graph structure and for building web-based tooling around ND4J models.

```java
import org.nd4j.web.Nd4jWebServer;

Nd4jWebServer server = Nd4jWebServer.builder()
    .port(8080)
    .graph(model)
    .build();

server.start();
System.out.println("ND4J graph viewer at http://localhost:8080");
```

Navigate to `http://localhost:8080` to see the graph structure, inspect variable shapes, and trigger execution from the browser.

***

## 12. OCR Operations

The `samediff-vlm` module ships a native document OCR subsystem built on top of the VLM inference pipeline. It replaces external OCR libraries (Tesseract, EasyOCR, cloud APIs) with GPU-accelerated model-based recognition that runs end-to-end inside SameDiff.

### Architecture

```
AbstractOCREngine
       │
       └── DeepSeekOCREngine
               │
               ├── Vision Encoder  (SameDiff/ONNX)   image → feature tensor
               └── Text Decoder    (SameDiff/ONNX)   features → text + bounding boxes
```

### Core Classes

| Class               | Role                                                                                             |
| ------------------- | ------------------------------------------------------------------------------------------------ |
| `AbstractOCREngine` | Abstract base; defines the `recognize(File)` / `recognize(BufferedImage)` contract               |
| `DeepSeekOCREngine` | Concrete implementation backed by a vision encoder + text decoder                                |
| `OCRResult`         | Output: list of `TextRegion` objects plus full concatenated text and overall confidence          |
| `OCRConfig`         | Image preprocessing parameters: `imageSize` (default 1024), `imageMean`, `imageStd`, `maxTokens` |
| `TextRegion`        | Per-region data: bounding box `[x, y, width, height]`, text, confidence, detected language       |

### Loading and Running OCR

```java
import org.eclipse.deeplearning4j.vlm.input.ocr.DeepSeekOCREngine;
import org.eclipse.deeplearning4j.vlm.input.ocr.OCRConfig;
import org.eclipse.deeplearning4j.vlm.input.ocr.OCRResult;
import org.eclipse.deeplearning4j.vlm.input.ocr.TextRegion;

// Load model from directory containing vision_encoder.onnx and text_decoder.onnx
DeepSeekOCREngine engine = DeepSeekOCREngine.create(new File("deepseek-ocr/"));
engine.initialize();

// Recognize from a file
OCRResult result = engine.recognize(new File("document.png"));
System.out.println(result.getFullText());
System.out.printf("Confidence: %.2f%n", result.getConfidence());

// Per-region breakdown with bounding boxes
for (TextRegion region : result.getRegions()) {
    System.out.printf("[%s] (%.2f) bbox=%s%n",
        region.getText(), region.getConfidence(), region.getBbox());
}

engine.close();
```

### Custom Configuration

```java
import org.eclipse.deeplearning4j.vlm.input.ocr.OCRConfig;

OCRConfig config = OCRConfig.builder()
    .imageSize(1024)                          // resize to 1024x1024 (default)
    .imageMean(new float[]{0.485f, 0.456f, 0.406f})   // ImageNet mean
    .imageStd(new float[]{0.229f, 0.224f, 0.225f})    // ImageNet std
    .maxTokens(1024)                          // max decoder tokens per page
    .build();

DeepSeekOCREngine engine = DeepSeekOCREngine.create(new File("deepseek-ocr/"), config);
engine.initialize();
```

### Preprocessing Pipeline

The OCR engine reuses the `VLMImagePreprocessor` infrastructure:

1. **Resize**: scale input to `config.imageSize x config.imageSize`
2. **Normalize**: apply ImageNet mean/std: `[0.485, 0.456, 0.406]` / `[0.229, 0.224, 0.225]`
3. **Tile**: for high-resolution documents, split into overlapping tiles processed in parallel
4. **Tensor**: convert to `[1, 3, H, W]` float tensor

### Multi-Language Support

Language detection and switching happens inside the model — no per-language configuration is needed. The `DeepSeekOCREngine` supports 12+ scripts out of the box:

```java
List<String> langs = engine.getSupportedLanguages();
// ["en", "zh", "ja", "ko", "ar", "hi", "ru", "de", "fr", "es", "pt", "it"]
```

A single model handles all supported scripts. Detected per-region language is available on each `TextRegion.getLanguage()`.

### Implementing a Custom OCR Engine

Extend `AbstractOCREngine` to integrate a different backend:

```java
import org.eclipse.deeplearning4j.vlm.input.ocr.AbstractOCREngine;
import org.eclipse.deeplearning4j.vlm.input.ocr.OCRResult;

public class MyOCREngine extends AbstractOCREngine {

    @Override
    public void initialize() throws Exception {
        // load your models here
    }

    @Override
    public OCRResult recognize(File imageFile) throws Exception {
        return recognize(ImageIO.read(imageFile));
    }

    @Override
    public OCRResult recognize(BufferedImage image) throws Exception {
        // preprocess, run inference, return OCRResult
    }

    @Override
    public List<String> getSupportedLanguages() {
        return List.of("en");
    }

    @Override
    public void close() {
        // release resources
    }
}
```

***

## 13. SDX Serving Protocol (REST + gRPC)

The SDX serving layer exposes any `.sdz` or `.sdnb` model as a network service with a dual-protocol contract: a REST endpoint for binary NPZ payloads and a gRPC endpoint for strongly-typed tensor streaming. Both transports share the same execution core so there is no behavioral drift between them.

### REST: `POST /v1/models/{model_id}:run-npz`

The primary REST endpoint for production inference. The request body is an NPZ archive containing the input arrays; the response body is an NPZ archive containing the output arrays.

**Request**

```
POST /v1/models/my-llm-8b:run-npz
Content-Type: application/octet-stream

<NPZ binary body — input tensors keyed by variable name>

X-SDX-Input-Order:  ["input_ids", "attention_mask"]
X-SDX-Output-Specs: [{"name":"logits","dtype":1,"shape":[1,512,32000]}]
```

**Response**

```
HTTP/1.1 200 OK
Content-Type: application/octet-stream
X-SDX-Execution-Report: {"backend":"CUDA","device":0,"elapsed_ms":12.4}

<NPZ binary body — output tensors keyed by variable name>
```

**Custom Headers**

| Header                   | Direction | Description                                                                                                                    |
| ------------------------ | --------- | ------------------------------------------------------------------------------------------------------------------------------ |
| `X-SDX-Input-Order`      | Request   | JSON array of input tensor names, controlling the order they are mapped to the model's placeholders                            |
| `X-SDX-Output-Specs`     | Request   | JSON array of `{"name", "dtype", "shape"}` objects; required because the C ABI (`sdxRun`) needs caller-provided output buffers |
| `X-SDX-Execution-Report` | Response  | JSON object with backend, device ID, and wall-clock elapsed time for the execution                                             |

A JSON/base64 compatibility endpoint is also available for smaller or debugging payloads:

```
POST /v1/models/{model_id}:run
Content-Type: application/json

{"inputs": {"input_ids": {"dtype": "INT64", "shape": [1, 16], "data_b64": "..."}}}
```

### gRPC Protocol

The primary typed binary protocol. The proto contract is defined in `sdx_serving.proto`.

**Proto contract**

```protobuf
// libnd4j/include/dsp/runtime/bindings/python/sdx_serving.proto

message Tensor {
  bytes           data  = 1;   // raw little-endian binary
  repeated int64  shape = 2;
  int32           dtype = 3;   // SDX dtype code
}

message TensorSpec {
  string          name  = 1;
  repeated int64  shape = 2;
  int32           dtype = 3;
}

message RunRequest {
  string                   model_id     = 1;
  map<string, Tensor>      inputs       = 2;
  repeated TensorSpec      output_specs = 3;   // required: server allocates outputs
  repeated string          input_order  = 4;
}

message RunResponse {
  map<string, Tensor>      outputs      = 1;
  string                   exec_report  = 2;   // JSON execution metadata
}

service SdxServing {
  rpc Run (RunRequest) returns (RunResponse);
}
```

**Java gRPC client example**

```java
import io.grpc.ManagedChannel;
import io.grpc.ManagedChannelBuilder;

ManagedChannel channel = ManagedChannelBuilder
    .forAddress("inference-host", 50051)
    .maxInboundMessageSize(256 * 1024 * 1024)  // raise beyond 4 MiB default for large tensors
    .usePlaintext()
    .build();

SdxServingGrpc.SdxServingBlockingStub stub = SdxServingGrpc.newBlockingStub(channel);

RunRequest request = RunRequest.newBuilder()
    .setModelId("my-llm-8b")
    .putInputs("input_ids", tensorFromArray(inputIds))
    .putInputs("attention_mask", tensorFromArray(mask))
    .addOutputSpecs(TensorSpec.newBuilder()
        .setName("logits")
        .addShape(1).addShape(512).addShape(32000)
        .setDtype(DType.FLOAT32_VALUE)
        .build())
    .build();

RunResponse response = stub.run(request);
Tensor logits = response.getOutputsOrThrow("logits");
```

### NPZ Payload Format

The NPZ format (NumPy archive) stores each tensor as a separate `.npy` file within a ZIP container. The key in the archive matches the tensor name expected by the model.

```python
# Python client example
import numpy as np
import requests
import io

# Build request
buf = io.BytesIO()
np.savez(buf,
    input_ids=np.array([[1, 2, 3, 4]], dtype=np.int64),
    attention_mask=np.ones((1, 4), dtype=np.int64))
buf.seek(0)

resp = requests.post(
    "http://inference-host:8080/v1/models/my-llm-8b:run-npz",
    data=buf.read(),
    headers={
        "Content-Type": "application/octet-stream",
        "X-SDX-Input-Order": '["input_ids","attention_mask"]',
        "X-SDX-Output-Specs": '[{"name":"logits","dtype":1,"shape":[1,4,32000]}]'
    })

outputs = np.load(io.BytesIO(resp.content))
logits = outputs["logits"]   # shape [1, 4, 32000]
```

### Execution Lifecycle

Both transports use the same server-side execution sequence:

1. Load model into the runtime registry (`sdx_sdk_runner.py`)
2. Create a context for the request
3. Decode input tensors via the shared codec (`sdx_tensor_transport.py`)
4. Call `sdxRun(...)` on the C runtime — caller-provided output buffers must be allocated from `X-SDX-Output-Specs` / `output_specs`
5. Encode output tensors and return
6. Context released; model stays loaded for subsequent requests

***

## 14. VLM Multi-GPU Inference Pipeline

The `samediff-vlm` module includes a dedicated multi-GPU pipeline for Vision-Language Models (VLMs) such as SmolDocling. VLMs combine a vision encoder (processes images) with a language decoder (generates text), and these two components have very different memory profiles. The multi-GPU pipeline assigns them to separate GPUs to maximize available memory for each.

### Architecture Overview

```
VLMPipelineExecutor
       │
       ├── MultiPartModelLoader
       │     ├── vision_encoder.sdz  → encoder GPU (e.g. RTX 3070 Ti, 8 GB)
       │     ├── embed_tokens.sdz    → decoder GPU (e.g. RTX 4090, 24 GB)
       │     └── decoder.sdz         → decoder GPU
       │
       ├── ImageTiler
       │     └── splits pages into tiles; parallel encoding
       │
       └── VLMImagePreprocessor
             └── resize / normalize / patch extraction per tile
```

**GPU assignment:**

* **Decoder GPU** (largest available, selected by `selectBestGpu()`): decoder model constants, token embedding, and the autoregressive KV-cache growth loop.
* **Encoder GPU** (next-best): vision encoder model constants and per-tile encoding. Released after all pages are encoded.

### Maven Dependency

```xml
<dependency>
    <groupId>org.deeplearning4j</groupId>
    <artifactId>samediff-vlm</artifactId>
    <version>${dl4j.version}</version>
</dependency>
```

### MultiPartModelLoader

VLMs are stored as separate `.sdz` files — one per sub-model. `MultiPartModelLoader` loads them and assigns each to the correct device:

```java
import org.deeplearning4j.vlm.MultiPartModelLoader;
import org.deeplearning4j.vlm.VisionLanguageModel;

// Expects vision_encoder.sdz, embed_tokens.sdz, decoder.sdz in modelDirectory
VisionLanguageModel vlm = MultiPartModelLoader.load(new File("smol-docling/"));
// The loader calls selectBestGpu() to assign the decoder and picks the
// next-best GPU for the encoder automatically.
```

You can also control device assignment explicitly:

```java
VisionLanguageModel vlm = MultiPartModelLoader.builder()
    .modelDirectory(new File("smol-docling/"))
    .encoderDeviceId(1)     // RTX 3070 Ti (8 GB)
    .decoderDeviceId(0)     // RTX 4090 (24 GB)
    .build()
    .load();
```

### VLMPipelineExecutor — End-to-End Usage

`VLMPipelineExecutor` is the single entry point for VLM inference. It coordinates image preprocessing, tile encoding, cross-device transfers, and the autoregressive decode loop:

```java
import org.deeplearning4j.vlm.VLMPipelineExecutor;
import org.deeplearning4j.vlm.VLMPipelineConfig;
import org.deeplearning4j.vlm.VlmGenerationResult;
import org.nd4j.tokenizers.TokenizerFactory;

VLMPipelineConfig config = VLMPipelineConfig.builder()
    .model(vlm)
    .tokenizer(TokenizerFactory.fromDirectory(new File("smol-docling/")))
    .maxTokens(2048)
    .build();

VLMPipelineExecutor executor = new VLMPipelineExecutor(config);

BufferedImage page = ImageIO.read(new File("document-page-1.png"));
VlmGenerationResult result = executor.generate(page, "Describe the layout of this page.");

System.out.println(result.getText());
executor.close();
```

### ImageTiler — Multi-Page Documents

`ImageTiler` splits high-resolution or multi-page inputs into fixed-size tiles. For document-understanding tasks each page is processed as a separate tile, and encoding is pipelined so that page N+1 preprocessing (CPU-bound) overlaps with page N encoding (GPU-bound):

```java
import org.deeplearning4j.vlm.ImageTiler;
import org.deeplearning4j.vlm.TilerConfig;

ImageTiler tiler = ImageTiler.builder()
    .tileWidth(560)         // pixels per tile (model-specific)
    .tileHeight(560)
    .overlapPixels(56)      // overlap between adjacent tiles
    .build();

// Split a large document scan into tiles
List<BufferedImage> tiles = tiler.tile(documentImage);

// Or pass a multi-page PDF list directly to VLMPipelineExecutor
List<BufferedImage> pages = loadPdfPages(new File("report.pdf"));
VlmGenerationResult result = executor.generateFromPages(pages, "Extract all invoice totals.");
```

### Encoder-GPU / Decoder-GPU Device Affinity

A single-thread executor pins all encoder work to the encoder device. This prevents CUDA context switching and isolates each GPU's memory pools and streams:

```java
// This is done internally by VLMPipelineExecutor; shown here for reference.
ExecutorService encoderExecutor = Executors.newSingleThreadExecutor(r -> {
    Thread t = new Thread(() -> {
        DeviceMemoryManager.switchDevice(encoderDeviceId);
        r.run();
    });
    t.setDaemon(true);
    return t;
});
```

Cross-device transfers (encoder output → decoder input) use `CudaAffinityManager.replicateToDevice`. On GPU pairs that support NVLink, the transfer is direct (device-to-device). On non-P2P pairs the transfer is staged through host memory (D2H + H2D).

### Deferred Vision-Encoder Release

After all pages are encoded, the vision encoder model is freed. This recovers 5–8 GB of GPU memory on the encoder device (or shared device on single-GPU systems) before the decode loop begins:

```java
// Called automatically by VLMPipelineExecutor after encoding all pages.
// To trigger manually in a custom pipeline:
vlm.freeVisionEncoder();
// → SameDiff graph closed, constant arrays freed.
// GPU memory freed: 5–8 GB now available for decoder KV-cache growth.
```

On single-GPU setups the encoder and decoder share one device. The encoder must complete and be released before the decoder's KV cache can grow freely. This serializes encoding and decoding but is handled transparently by `VLMPipelineExecutor`.

### Decode Loop Integration

The decode loop uses `DynamicShapePlan` to handle the growing KV cache across thousands of steps. The pipeline follows this sequence per token:

1. Embed the current token ID through the `embed_tokens` model on the decoder GPU.
2. On the first step, concatenate vision features (transferred from the encoder GPU) with the token embeddings.
3. Execute the decoder with `DynamicShapePlan` (handles shape changes as the KV cache grows).
4. Select the next token by argmax on the output logits.
5. Stop if the end-of-sequence token is produced.
6. Reuse intermediate arrays across steps (one persistent array per slot — no per-step allocate/free overhead).

### Configuration Reference

| Option                     | Type      | Default       | Description                                            |
| -------------------------- | --------- | ------------- | ------------------------------------------------------ |
| `encoderDeviceId`          | `int`     | auto          | GPU device ID for the vision encoder                   |
| `decoderDeviceId`          | `int`     | auto          | GPU device ID for the language decoder                 |
| `tileWidth` / `tileHeight` | `int`     | model default | Tile size in pixels for `ImageTiler`                   |
| `overlapPixels`            | `int`     | 0             | Tile overlap to avoid edge artifacts                   |
| `maxTokens`                | `int`     | 2048          | Maximum generated tokens per page                      |
| `freeEncoderAfterEncoding` | `boolean` | `true`        | Release encoder GPU memory after all pages are encoded |
| `pipelineParallelism`      | `boolean` | `true`        | Overlap page N+1 preprocessing with page N encoding    |

### Performance Notes

* SmolDocling on RTX 4090 (24 GB) + RTX 3070 Ti (8 GB): approximately 87–92 tok/s steady-state decode with CUDA graph replay and Triton fusion.
* Vision encoder: approximately 150 ms per page (1962 DSP ops per frame on native executor).
* After encoder release: approximately 5.3 GB baseline GPU usage (model constants) with approximately 1 MB/step memory growth in the decode loop.
* For single-GPU systems, the pipeline falls back to serial encode-then-decode automatically. Multi-GPU provides the pipeline-parallelism advantage only when two or more GPUs are available.

***

## Next Steps

* **Getting Started:** See the [Quickstart](/en-1.0.0-rewrite/deeplearning4j/quickstart.md) for setting up the Maven project and running your first model.
* **SameDiff Graph Execution:** Review the [SameDiff Execution documentation](/en-1.0.0-rewrite/nd4j/overview-2/execution.md) to understand how `GenerationPipeline` integrates with the DSP plan lifecycle.
* **OmniHub Model Zoo:** Use [OmniHub](/en-1.0.0-rewrite/omnihub/usage.md) to download pre-converted LLM weights in the SameDiff FlatBuffers format without manual conversion.
* **Performance Tuning:** See [GPU/CPU Configuration](/en-1.0.0-rewrite/configuration/gpu-cpu.md) and [Memory and Workspaces](/en-1.0.0-rewrite/core-concepts/memory-and-workspaces.md) for hardware-specific tuning guidance that applies to LLM inference.
* **CUDA Graphs:** The `BenchmarkConfig.CUDA_GRAPHS` preset delivers the lowest decode latency on NVIDIA GPUs; see the [CUDA backend documentation](/en-1.0.0-rewrite/nd4j/overview-1/cuda.md) for prerequisites.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/deeplearning4j/overview-4.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
