> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/deeplearning4j/overview-4.md). # LLM & VLM Stack Deeplearning4j 1.0.0-rewrite ships a full large language model (LLM) and vision-language model (VLM) application stack built on top of SameDiff. The stack is organized into six new Maven modules that together cover every layer of inference: tokenization, generation, KV cache management, speculative decoding, continuous batching, evaluation, benchmarking, model editing, audio transcription, and a web frontend for ND4J graphs. This page gives a complete reference for all six modules, with API details and working Java code examples for the most important classes. *** ## 1. Overview and Module Map The LLM stack sits above the existing SameDiff execution engine. SameDiff handles op dispatch and graph execution; the new modules provide the inference-specific infrastructure that production LLM serving requires. ``` Your Application │ ▼ samediff-llm ← generation pipeline, KV cache, speculative decoding, continuous batching, tokenizers, evaluation, benchmarking, model editing samediff-vlm ← vision-language model support (image+text) samediff-audio ← Whisper ASR support nd4j-tokenizers ← Rust-backed HuggingFace / SentencePiece / CLIP tokenizers nd4j-torchscript ← TorchScript / PyTorch model import nd4j-web ← TypeScript / FlatBuffers frontend for ND4J graphs │ ▼ SameDiff (ND4J) ← op execution, graph optimization, DSP plan lifecycle │ ▼ libnd4j (C++) ← CPU/CUDA kernels, BLAS, cuDNN ``` The six modules are independent of each other except that `samediff-vlm` and `samediff-audio` depend on `samediff-llm`, and all of them depend on `nd4j-tokenizers`. *** ## 2. Maven Dependencies Add only the modules you need. All modules share the same version string. ```xml 1.0.0-rewrite org.deeplearning4j samediff-llm ${dl4j.version} org.deeplearning4j samediff-vlm ${dl4j.version} org.deeplearning4j samediff-audio ${dl4j.version} org.nd4j nd4j-tokenizers ${dl4j.version} org.nd4j nd4j-torchscript ${dl4j.version} org.nd4j nd4j-web ${dl4j.version} ``` *** ## 3. Generation Pipeline The generation pipeline is the unified entry point for all text generation tasks. It handles model I/O auto-discovery, embedding extraction, tokenization, decode loop construction, and configuration-driven optimization. ### Core Classes | Class | Role | | ------------------------------------------------ | --------------------------------------------------------- | | `GenerationPipeline` | Top-level entry point; owns the decode loop and lifecycle | | `GenerationPipelineConfig` | Builder-style configuration for the pipeline | | `DecodeOptions` | Per-call generation parameters (temperature, top-k, etc.) | | `GenerationResult` | Output: token IDs, decoded text, timing data | | `TextGenerator` | Higher-level API with streaming callback support | | `Sampler` / `GreedySampler` / `CompositeSampler` | Sampling strategy hierarchy | | `SamplingConfig` / `SamplerUtils` | Temperature / top-k / top-p config and utilities | | `DecoderInputBuilder` / `DecoderUtils` | Tensor construction for each decode step | | `DecodeStepDiagnostics` | Per-step diagnostics: token IDs, logit stats, timing | ### Building a Pipeline `GenerationPipelineConfig` uses a fluent builder. All fields are optional; the pipeline performs auto-discovery for any field not set. ```java import org.deeplearning4j.llm.generation.GenerationPipeline; import org.deeplearning4j.llm.generation.GenerationPipelineConfig; import org.deeplearning4j.llm.generation.DecodeOptions; import org.deeplearning4j.llm.generation.GenerationResult; import org.deeplearning4j.llm.generation.SamplingConfig; import org.deeplearning4j.llm.kv.KvCacheStrategy; import org.nd4j.autodiff.samediff.SameDiff; // Load or import your model SameDiff model = SameDiff.load(new File("llama-3.1-8b.fb"), true); // Configure the pipeline GenerationPipelineConfig config = GenerationPipelineConfig.builder() .decoder(model) // SameDiff graph containing the decoder .tokenizer("tokenizer.json") // path or classpath resource .maxTokens(2048) // maximum context length .samplingConfig(SamplingConfig.builder() .temperature(0.7f) .topK(50) .topP(0.9f) .repetitionPenalty(1.1f) .build()) .kvCacheStrategy(KvCacheStrategy.PAGED) // see KV Cache section .build(); GenerationPipeline pipeline = new GenerationPipeline(config); ``` ### Running Generation Pass per-call options via `DecodeOptions`. Settings in `DecodeOptions` override the pipeline-level `SamplingConfig` for that call only. ```java DecodeOptions opts = DecodeOptions.builder() .temperature(0.8f) .topK(40) .maxNewTokens(512) .build(); GenerationResult result = pipeline.generate("Explain quantum entanglement.", opts); System.out.println(result.getText()); // decoded string System.out.println(result.getTokenIds()); // List of generated token IDs System.out.printf("%.1f tok/s%n", result.getTokensPerSecond()); // throughput from timing data ``` ### Streaming with TextGenerator `TextGenerator` wraps `GenerationPipeline` and adds token-by-token streaming via a callback. ```java import org.deeplearning4j.llm.generation.TextGenerator; TextGenerator generator = new TextGenerator(pipeline); generator.generate("Write a haiku about neural networks.", opts, token -> System.out.print(token)); // callback receives each decoded token System.out.println(); // newline after streaming completes ``` ### Sampling Strategies `GreedySampler` always picks the highest-probability token. `CompositeSampler` chains a sequence of sampling transforms — temperature scaling, then top-k filtering, then top-p nucleus filtering — before the final argmax or categorical sample. ```java import org.deeplearning4j.llm.generation.sampling.GreedySampler; import org.deeplearning4j.llm.generation.sampling.CompositeSampler; import org.deeplearning4j.llm.generation.sampling.SamplerUtils; // Greedy decoding — deterministic, fast Sampler greedy = new GreedySampler(); // Nucleus sampling: temperature → top-k → top-p → sample Sampler nucleus = CompositeSampler.builder() .temperature(0.9f) .topK(100) .topP(0.95f) .build(); ``` ### Per-Step Diagnostics Enable `DecodeStepDiagnostics` to capture detailed information about each decode step. This is useful for debugging generation quality issues. ```java import org.deeplearning4j.llm.generation.DecodeStepDiagnostics; pipeline.enableDiagnostics(true); GenerationResult result = pipeline.generate("Hello, world!", opts); for (DecodeStepDiagnostics step : result.getDiagnostics()) { System.out.printf("Step %d: token=%d logit_max=%.3f logit_entropy=%.3f time_ms=%d%n", step.getStep(), step.getSelectedTokenId(), step.getMaxLogit(), step.getLogitEntropy(), step.getStepTimeMs()); } ``` *** ## 4. KV Cache Management The key-value (KV) cache stores the attention keys and values computed during the prefill and previous decode steps. Good cache management is the single largest lever for improving LLM serving throughput. The LLM stack provides a comprehensive hierarchy of cache implementations. ### Cache Strategy Overview | Strategy | Class | When to Use | | ---------------- | --------------------------- | ---------------------------------------- | | Paged | `PagedKVCache` | Default; best memory utilization | | Paged + eviction | `EvictablePagedKVCache` | Long conversations; evict old pages | | Per-layer policy | `PerLayerPagedKVCache` | Different eviction per transformer layer | | Quantized | `QuantizedPagedKVCache` | Memory-constrained GPUs; INT8/FP16 pages | | MLA | `MLAKVCache` | DeepSeek Multi-head Latent Attention | | Beam search | `BeamKVCacheManager` | Beam decoding with K beams | | Speculative | `SpeculativeKVCacheManager` | Speculative decoding draft/verify | | Tiered | `TieredKVCacheManager` | GPU → host DRAM → disk tiering | | Unified | `UnifiedKvCacheManager` | Single manager across all strategies | ### PagedKVCache `PagedKVCache` partitions the cache into fixed-size pages. Sequences are allocated pages on demand; eviction is O(1) — just reclaim a page. This is the vLLM-style approach and is the default strategy. ```java import org.deeplearning4j.llm.kv.PagedKVCache; PagedKVCache cache = PagedKVCache.builder() .pageSize(16) // tokens per page .maxPages(1024) // total pages (controls max memory) .numLayers(32) // transformer layer count .numHeads(8) // KV heads (use GQA count, not full head count) .headDim(128) // per-head dimension .dtype(DataType.FLOAT16) .build(); ``` ### Eviction Policies `EvictablePagedKVCache` adds eviction support. Three built-in eviction policies are provided: | Policy | Class | Description | | ------------ | ---------------------------- | -------------------------------------------------------------------------------------- | | LRU | default | Evict least recently used page | | H2O | `H2OEvictionPolicy` | Heavy Hitter Oracle: evict low-importance tokens based on accumulated attention scores | | StreamingLLM | `StreamingLLMEvictionPolicy` | Preserve attention sink tokens + recent sliding window | ```java import org.deeplearning4j.llm.kv.EvictablePagedKVCache; import org.deeplearning4j.llm.kv.eviction.H2OEvictionPolicy; import org.deeplearning4j.llm.kv.eviction.StreamingLLMEvictionPolicy; import org.deeplearning4j.llm.kv.eviction.AttentionSinkDetector; // H2O eviction — works well for long document summarization EvictablePagedKVCache h2oCache = EvictablePagedKVCache.builder() .pageSize(16) .maxPages(512) .numLayers(32) .numHeads(8) .headDim(128) .evictionPolicy(new H2OEvictionPolicy()) .build(); // StreamingLLM — preserves attention sinks, keeps recent window AttentionSinkDetector sinkDetector = new AttentionSinkDetector(numSinkTokens: 4); EvictablePagedKVCache streamingCache = EvictablePagedKVCache.builder() .pageSize(16) .maxPages(512) .numLayers(32) .numHeads(8) .headDim(128) .evictionPolicy(new StreamingLLMEvictionPolicy(sinkDetector, windowSize: 256)) .build(); ``` ### Per-Layer Eviction `PerLayerPagedKVCache` assigns a different `PerLayerKVPolicy` to each transformer layer. This is useful because attention patterns differ significantly between early and late layers. ```java import org.deeplearning4j.llm.kv.PerLayerPagedKVCache; import org.deeplearning4j.llm.kv.PerLayerKVPolicy; import org.deeplearning4j.llm.kv.eviction.StreamingLLMEvictionPolicy; import org.deeplearning4j.llm.kv.eviction.H2OEvictionPolicy; List policies = new ArrayList<>(); for (int layer = 0; layer < 32; layer++) { if (layer < 4) { // Early layers: protect attention sinks with StreamingLLM policies.add(PerLayerKVPolicy.of(new StreamingLLMEvictionPolicy(sinkDetector, 512))); } else { // Later layers: H2O importance-based eviction policies.add(PerLayerKVPolicy.of(new H2OEvictionPolicy())); } } PerLayerPagedKVCache perLayerCache = PerLayerPagedKVCache.builder() .pageSize(16) .maxPages(512) .perLayerPolicies(policies) .numHeads(8) .headDim(128) .build(); ``` ### Quantized KV Cache `QuantizedPagedKVCache` stores pages in INT8 or FP16 and dequantizes on read. This roughly halves or quarters the memory footprint of the cache with minimal accuracy impact on most models. ```java import org.deeplearning4j.llm.kv.QuantizedPagedKVCache; import org.deeplearning4j.llm.kv.QuantizationMode; QuantizedPagedKVCache quantCache = QuantizedPagedKVCache.builder() .pageSize(16) .maxPages(2048) // 2x more pages for same memory vs FP32 .numLayers(32) .numHeads(8) .headDim(128) .quantizationMode(QuantizationMode.INT8) // or FP16 .build(); ``` ### KV Cache Offloading For very long contexts, the cache can be offloaded from GPU VRAM to host DRAM or disk. ```java import org.deeplearning4j.llm.kv.offload.KVCacheHostOffloader; import org.deeplearning4j.llm.kv.offload.KVCacheDiskOffloader; // Offload evicted pages to host DRAM (PCIe transfer on demand) KVCacheHostOffloader hostOffloader = KVCacheHostOffloader.builder() .maxHostBytes(8L * 1024 * 1024 * 1024) // 8 GB host RAM .asyncTransfer(true) .build(); // Offload evicted pages to disk (NVMe SSD recommended) KVCacheDiskOffloader diskOffloader = KVCacheDiskOffloader.builder() .storagePath(Path.of("/tmp/kvcache")) .maxDiskBytes(64L * 1024 * 1024 * 1024) // 64 GB .build(); ``` Use `TieredKVCacheManager` to combine GPU, host, and disk tiers automatically: ```java import org.deeplearning4j.llm.kv.TieredKVCacheManager; TieredKVCacheManager tieredManager = TieredKVCacheManager.builder() .gpuCache(cache) .hostOffloader(hostOffloader) .diskOffloader(diskOffloader) .build(); ``` ### Prefix Sharing `KVCachePrefixTree` and `RadixPrefixCache` enable sharing KV cache pages across requests that share a common prompt prefix (e.g., a system prompt). Matching prefixes are detected and their cached pages are reused rather than recomputed. ```java import org.deeplearning4j.llm.kv.prefix.RadixPrefixCache; RadixPrefixCache prefixCache = RadixPrefixCache.builder() .pageSize(16) .maxEntries(10000) .build(); // The pipeline will check prefixCache before computing prefill pipeline.setPrefixCache(prefixCache); ``` ### KV Cache Checkpointing Save and restore cache state to disk, enabling pause-and-resume of long generation sessions. ```java import org.deeplearning4j.llm.kv.checkpoint.KVCacheCheckpointManager; KVCacheCheckpointManager checkpointMgr = new KVCacheCheckpointManager(cache); // Save checkpointMgr.checkpoint(Path.of("/tmp/kvcache_checkpoint.bin")); // Restore checkpointMgr.restore(Path.of("/tmp/kvcache_checkpoint.bin")); ``` *** ## 5. Speculative Decoding Speculative decoding uses a fast draft model (or an n-gram heuristic) to propose multiple tokens ahead, then verifies them in a single forward pass of the full target model. Accepted tokens come for free; only rejected tokens require additional passes. On hardware where the target model is memory-bandwidth-bound, speculative decoding commonly delivers 2-3x throughput improvement. ### Speculator Implementations | Class | Draft Source | Notes | | ---------------------- | ----------------------------- | --------------------------- | | `NgramSpeculator` | N-gram from generated context | No secondary model required | | `DraftModelSpeculator` | Smaller SameDiff model | Highest acceptance rate | ### NgramSpeculator Uses an n-gram index built from the tokens already generated in the current sequence. No additional model weights are required. ```java import org.deeplearning4j.llm.speculative.NgramSpeculator; import org.deeplearning4j.llm.speculative.SpeculativeDecodeLoop; NgramSpeculator speculator = NgramSpeculator.builder() .ngramOrder(4) // use 4-gram drafts .draftLength(5) // propose up to 5 tokens per step .build(); SpeculativeDecodeLoop loop = SpeculativeDecodeLoop.builder() .targetModel(model) .speculator(speculator) .kvCache(cache) .build(); GenerationResult result = loop.generate("Once upon a time", opts); System.out.printf("Accepted %.1f%% of draft tokens%n", result.getSpeculativeAcceptanceRate() * 100); ``` ### DraftModelSpeculator Uses a smaller, faster model to generate draft tokens. The draft model should share the same vocabulary as the target model. ```java import org.deeplearning4j.llm.speculative.DraftModelSpeculator; SameDiff draftModel = SameDiff.load(new File("llama-3.1-1b.fb"), true); DraftModelSpeculator speculator = DraftModelSpeculator.builder() .draftModel(draftModel) .draftLength(7) // draft up to 7 tokens .draftKvCache(draftCache) // separate smaller cache for the draft model .build(); SpeculativeDecodeLoop loop = SpeculativeDecodeLoop.builder() .targetModel(model) .speculator(speculator) .kvCache(cache) .verifier(new TreeAttentionVerifier()) // parallel tree-based verification .build(); ``` ### Tree Attention Verification `TreeAttentionVerifier` organizes draft tokens into a tree structure and verifies all candidates in parallel with a single batched forward pass of the target model. This maximizes GPU utilization during the verification step. The tree verifier is selected automatically when `draftLength > 1` and is the recommended choice for `DraftModelSpeculator`. It requires no additional configuration beyond being set on the `SpeculativeDecodeLoop`. ### Throughput and Auto-Disable For structured or repetitive outputs (code, lists, repeated phrases), n-gram speculation typically achieves **2-5x throughput improvement** over greedy decode because the n-gram index captures recurring patterns with high acceptance rates. A probe mechanism monitors acceptance rates automatically. If the target model cannot handle multi-token input (for example, some encoder-decoder models like SmolDocling that use cached cross-attention), the probe detects the failure, disables speculation for a cooldown period, then re-enables it to retry. This makes `SpeculativeDecodeLoop` safe to use without knowing in advance whether a given model supports speculative execution: ```java SpeculativeDecodeLoop loop = SpeculativeDecodeLoop.builder() .targetModel(model) .speculator(new NgramSpeculator.builder() .ngramOrder(4) .draftLength(5) .build()) .kvCache(cache) // Probe mechanism is enabled by default; no extra configuration required. // The loop logs a warning and falls back to greedy on unsupported models. .build(); ``` *** ## 6. Continuous Batching Continuous batching (sometimes called in-flight batching) keeps the GPU fully saturated by interleaving prefill and decode steps across multiple requests. Unlike static batching, where a batch waits until all sequences in it complete, continuous batching allows new requests to be admitted and completed sequences to exit at any decode step. ### Architecture ``` Incoming requests │ ▼ ContinuousBatchScheduler │ assigns requests to batch slots │ manages per-request BatchGenerationState │ ├──► ChunkedPrefillEngine ← breaks long prompts into chunks │ processed in decode steps alongside ongoing sequences │ └──► Decode step (full batch) │ ▼ BatchCompactor removes completed sequences, compacts active slots ``` ### ContinuousBatchScheduler ```java import org.deeplearning4j.llm.batch.ContinuousBatchScheduler; import org.deeplearning4j.llm.batch.ChunkedPrefillEngine; ChunkedPrefillEngine prefillEngine = ChunkedPrefillEngine.builder() .chunkSize(512) // tokens of prefill to process per step .build(); ContinuousBatchScheduler scheduler = ContinuousBatchScheduler.builder() .maxBatchSize(32) // max concurrent sequences .maxSeqLen(4096) .model(model) .kvCache(cache) .prefillEngine(prefillEngine) .build(); scheduler.start(); // Submit requests (thread-safe; can be called from multiple threads) CompletableFuture future1 = scheduler.submit("Summarize: " + longDocument, opts); CompletableFuture future2 = scheduler.submit("Translate to French: Hello world", opts); GenerationResult r1 = future1.get(); GenerationResult r2 = future2.get(); scheduler.shutdown(); ``` ### ChunkedPrefillEngine `ChunkedPrefillEngine` solves the O(n²) memory problem of processing long prompts in a single pass. It splits the prompt into fixed-size windows (`chunkSize` tokens) and processes each chunk sequentially, accumulating KV cache entries across chunks. The decode phase begins only after all chunks complete. This allows arbitrarily long prompts to be processed within a fixed GPU memory budget while keeping decode latency uniform across requests: ```java import org.deeplearning4j.llm.batch.ChunkedPrefillEngine; ChunkedPrefillEngine prefillEngine = ChunkedPrefillEngine.builder() .chunkSize(512) // tokens of prefill to process per scheduler step .build(); // Attach to the scheduler; long-prompt requests are chunked automatically ContinuousBatchScheduler scheduler = ContinuousBatchScheduler.builder() .prefillEngine(prefillEngine) // ... other config .build(); ``` Chunk size is a latency-memory trade-off: smaller chunks use less memory per step but add more prefill steps before the first token is produced. 512 tokens is a practical starting point for most hardware. ### BatchGenerationState `BatchGenerationState` tracks per-sequence state within the batch: current token position, KV cache page assignments, sampling state, and completion status. It is managed automatically by `ContinuousBatchScheduler` and is not normally accessed directly. ### BatchCompactor `BatchCompactor` runs at the end of each decode step to remove completed sequences and compact the batch tensor so that the GPU kernel always operates on a dense, full-occupancy batch. It is attached to the scheduler automatically. *** ## 7. Tokenizers The `nd4j-tokenizers` module provides tokenizers backed by Rust-native implementations for correctness and performance. All tokenizers implement the `Tokenizer` interface. ### Tokenizer Interface ```java import org.nd4j.tokenizers.Tokenizer; import org.nd4j.tokenizers.Encoding; public interface Tokenizer { Encoding encode(String text); String decode(List ids); Map specialTokens(); int vocabSize(); void close(); } ``` `Encoding` holds the token IDs, attention mask, and (optionally) token type IDs. ### HuggingFaceTokenizer Loads any tokenizer in the standard `tokenizer.json` format as exported by Hugging Face `transformers`. Supports BPE, WordPiece, and Unigram models. ```java import org.nd4j.tokenizers.HuggingFaceTokenizer; HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.fromFile(Path.of("path/to/tokenizer.json")); Encoding enc = tokenizer.encode("The quick brown fox"); System.out.println(enc.getIds()); // [791, 4996, 14198, 39935] String decoded = tokenizer.decode(enc.getIds()); System.out.println(decoded); // "The quick brown fox" tokenizer.close(); ``` ### SentencePieceTokenizer Loads SentencePiece BPE models (`.model` files), used by LLaMA, Gemma, Mistral, and other models that do not use the HuggingFace format. ```java import org.nd4j.tokenizers.SentencePieceTokenizer; SentencePieceTokenizer tokenizer = SentencePieceTokenizer.fromFile(Path.of("tokenizer.model")); Encoding enc = tokenizer.encode("Hello, SentencePiece!"); tokenizer.close(); ``` ### CLIPTokenizer A specialized tokenizer for CLIP-family vision-language models, following the byte-pair encoding used by the original OpenAI CLIP implementation. ```java import org.nd4j.tokenizers.CLIPTokenizer; CLIPTokenizer tokenizer = CLIPTokenizer.fromFiles( Path.of("vocab.json"), Path.of("merges.txt")); // Encode a text prompt for CLIP image-text alignment Encoding enc = tokenizer.encode("a photo of a cat"); tokenizer.close(); ``` ### Chat Templates `ChatTemplate` renders structured chat conversations into the prompt format expected by an instruction-tuned model. It implements a Jinja2-subset template engine compatible with the `chat_template` field in HuggingFace `tokenizer_config.json`. ```java import org.nd4j.tokenizers.ChatTemplate; import org.nd4j.tokenizers.ChatMessage; ChatTemplate template = ChatTemplate.fromTokenizerConfig( Path.of("tokenizer_config.json")); List messages = List.of( ChatMessage.system("You are a helpful assistant."), ChatMessage.user("What is the capital of France?"), ChatMessage.assistant("The capital of France is Paris."), ChatMessage.user("What is its population?")); String prompt = template.apply(messages, addGenerationPrompt: true); System.out.println(prompt); // Produces the model-specific formatted prompt string ``` ### TokenizerFactory `TokenizerFactory` auto-detects the tokenizer type from the files present in a directory and instantiates the correct implementation. ```java import org.nd4j.tokenizers.TokenizerFactory; // Auto-detect from a directory containing tokenizer.json or tokenizer.model Tokenizer tokenizer = TokenizerFactory.fromDirectory(Path.of("model-dir/")); ``` *** ## 8. Evaluation Framework The evaluation framework provides automated benchmarking of LLM capabilities across standard academic benchmarks and custom datasets. ### Core Evaluation Classes | Class | Role | | ---------------------------- | ------------------------------------------------------------------ | | `EvalRunner` | Orchestrates evaluation runs; parallelizes across dataset examples | | `EvalConfig` | Dataset, benchmark, metric, and generation options | | `EvalResult` | Aggregated result: per-benchmark scores, timing, sample results | | `SampleResult` | Per-example output, prediction, and score | | `PerplexityEvaluator` | Computes log-perplexity over a reference corpus | | `GenerationQualityValidator` | Validates generation coherence (length, repetition, entropy) | | `AnswerExtractor` | Extracts structured answers from free-form generated text | ### Running a Standard Benchmark ```java import org.deeplearning4j.llm.eval.EvalRunner; import org.deeplearning4j.llm.eval.EvalConfig; import org.deeplearning4j.llm.eval.EvalResult; import org.deeplearning4j.llm.eval.benchmarks.MMLUBenchmark; import org.deeplearning4j.llm.eval.datasets.HuggingFaceDataset; HuggingFaceDataset dataset = HuggingFaceDataset.load("cais/mmlu", split: "test"); EvalConfig config = EvalConfig.builder() .benchmark(new MMLUBenchmark()) .dataset(dataset) .pipeline(pipeline) .numShots(5) // 5-shot evaluation .numWorkers(4) // parallel evaluation workers .build(); EvalRunner runner = new EvalRunner(config); EvalResult result = runner.run(); System.out.printf("MMLU accuracy: %.2f%%%n", result.getScore() * 100); result.getPerSubjectScores().forEach((subject, score) -> System.out.printf(" %s: %.2f%%%n", subject, score * 100)); ``` ### Available Benchmarks | Benchmark | Class | Measures | | ---------- | --------------------- | ------------------------------------------------------ | | MMLU | `MMLUBenchmark` | Massive Multitask Language Understanding (57 subjects) | | ARC | `ArcBenchmark` | AI2 Reasoning Challenge (grade-school science) | | GSM8K | `Gsm8kBenchmark` | Grade school math word problems | | HellaSwag | `HellaSwagBenchmark` | Commonsense reasoning / sentence completion | | TruthfulQA | `TruthfulQABenchmark` | Truthfulness and calibration | | WinoGrande | `WinograndeBenchmark` | Pronoun coreference resolution | ### Metrics | Metric | Class | Description | | ---------------- | ------------------------ | ------------------------------------------------------- | | Exact Match | `ExactMatch` | Binary: prediction equals gold label | | F1 | `F1` | Token-level F1 between prediction and gold | | BLEU | `BLEU` | N-gram precision (translation quality) | | ROUGE | `ROUGE` | Recall-oriented n-gram overlap (summarization) | | ANLS | `ANLS` | Average Normalized Levenshtein Similarity (document QA) | | VQA Accuracy | `VqaAccuracy` | Soft accuracy for visual question answering | | Relaxed Accuracy | `RelaxedAccuracy` | Case/punctuation-insensitive exact match | | Multiple Choice | `MultipleChoiceAccuracy` | Accuracy over A/B/C/D choices | ```java import org.deeplearning4j.llm.eval.metrics.ROUGE; import org.deeplearning4j.llm.eval.metrics.RougeVariant; ROUGE rouge = new ROUGE(RougeVariant.ROUGE_L); double score = rouge.compute(prediction, reference); ``` ### Dataset Sources | Class | Loads From | | -------------------- | ------------------------------------------------------ | | `HuggingFaceDataset` | HuggingFace Hub (requires network) | | `JsonlDataset` | Local JSONL file | | `CsvDataset` | Local CSV file | | `CustomDataset` | In-memory list of `(input, label)` pairs | | `DatasetCache` | Wraps any dataset; caches to disk to avoid re-download | ```java import org.deeplearning4j.llm.eval.datasets.JsonlDataset; import org.deeplearning4j.llm.eval.datasets.DatasetCache; JsonlDataset raw = JsonlDataset.builder() .path(Path.of("gsm8k_test.jsonl")) .inputField("question") .labelField("answer") .build(); // Cache to avoid re-reading the file on each evaluation run DatasetCache cached = DatasetCache.wrap(raw, Path.of(".cache/gsm8k")); ``` ### Perplexity ```java import org.deeplearning4j.llm.eval.PerplexityEvaluator; PerplexityEvaluator ppl = new PerplexityEvaluator(pipeline); double perplexity = ppl.evaluate(Path.of("wikitext-103-test.txt")); System.out.printf("Perplexity: %.2f%n", perplexity); ``` ### Running All Standard Benchmarks `EvalRunner` orchestrates evaluation runs and parallelizes across dataset examples using multiple worker threads. The example below runs all six built-in benchmarks back-to-back against the same pipeline: ```java import org.deeplearning4j.llm.eval.EvalRunner; import org.deeplearning4j.llm.eval.EvalConfig; import org.deeplearning4j.llm.eval.EvalResult; import org.deeplearning4j.llm.eval.benchmarks.MMLUBenchmark; import org.deeplearning4j.llm.eval.benchmarks.ArcBenchmark; import org.deeplearning4j.llm.eval.benchmarks.Gsm8kBenchmark; import org.deeplearning4j.llm.eval.benchmarks.HellaSwagBenchmark; import org.deeplearning4j.llm.eval.benchmarks.TruthfulQABenchmark; import org.deeplearning4j.llm.eval.benchmarks.WinograndeBenchmark; import org.deeplearning4j.llm.eval.datasets.HuggingFaceDataset; import org.deeplearning4j.llm.eval.datasets.DatasetCache; record BenchmarkSpec(String name, Object benchmark, String hfPath, String split) {} List specs = List.of( new BenchmarkSpec("MMLU", new MMLUBenchmark(), "cais/mmlu", "test"), new BenchmarkSpec("ARC", new ArcBenchmark(), "allenai/arc", "test"), new BenchmarkSpec("GSM8K", new Gsm8kBenchmark(), "gsm8k", "test"), new BenchmarkSpec("HellaSwag", new HellaSwagBenchmark(), "hellaswag", "validation"), new BenchmarkSpec("TruthfulQA", new TruthfulQABenchmark(), "truthful_qa", "validation"), new BenchmarkSpec("WinoGrande", new WinograndeBenchmark(), "winogrande", "validation") ); for (BenchmarkSpec spec : specs) { HuggingFaceDataset dataset = HuggingFaceDataset.load(spec.hfPath(), split: spec.split()); DatasetCache cached = DatasetCache.wrap(dataset, Path.of(".cache/" + spec.name().toLowerCase())); EvalConfig config = EvalConfig.builder() .benchmark(spec.benchmark()) .dataset(cached) .pipeline(pipeline) .numShots(0) // 0-shot by default; set higher for few-shot .numWorkers(4) .build(); EvalResult result = new EvalRunner(config).run(); System.out.printf("%-12s %.2f%%%n", spec.name(), result.getScore() * 100); } ``` Expected output (scores vary by model): ``` MMLU 65.3% ARC 80.1% GSM8K 72.4% HellaSwag 83.7% TruthfulQA 48.9% WinoGrande 74.2% ``` *** ## 9. Model Editing / Abliteration The model editing module provides tools for modifying model behavior by directly editing weight matrices. The primary use case implemented is *abliteration*: removing a model's refusal directions to understand or modify how refusal behavior is encoded in the model's weights. This is useful for research into model internals and for running ablations on safety-trained models in controlled research environments. **Important:** These tools modify model weights irreversibly. Always work on a copy. Abliterated models should be used only within the bounds of your organization's AI safety policies. ### Abliteration Workflow Abliteration works by: 1. Collecting activations for harmful and harmless prompt pairs (contrastive pairs). 2. Computing the mean activation difference between the two sets — the "refusal direction". 3. Orthogonalizing all weight matrices in the model against the refusal direction using Gram-Schmidt. This removes the direction from the model's weight space so the model cannot activate along it, effectively removing the refusal behavior. ```java import org.deeplearning4j.llm.edit.AbliterationWorkflow; import org.deeplearning4j.llm.edit.AbliterationConfig; import org.deeplearning4j.llm.edit.AbliterationResult; import org.deeplearning4j.llm.edit.DefaultPromptSets; AbliterationConfig config = AbliterationConfig.builder() .model(model) .tokenizer(tokenizer) .harmfulPrompts(DefaultPromptSets.HARMFUL_PROMPTS) // built-in set .harmlessPrompts(DefaultPromptSets.HARMLESS_PROMPTS) // built-in set .targetLayers(List.of(15, 16, 17, 18)) // layers to edit .numActivationSamples(64) // prompts per direction .build(); AbliterationWorkflow workflow = new AbliterationWorkflow(config); AbliterationResult result = workflow.run(); System.out.printf("Edited %d weight matrices%n", result.getNumEditedMatrices()); // Save the modified model SameDiff editedModel = result.getEditedModel(); editedModel.save(new File("model-abliterated.fb"), true); ``` ### RefusalDirectionFinder Used internally by `AbliterationWorkflow`, but can also be used standalone to analyze where refusal behavior is most strongly encoded across layers. ```java import org.deeplearning4j.llm.edit.RefusalDirectionFinder; import org.deeplearning4j.llm.edit.RefusalDirection; RefusalDirectionFinder finder = new RefusalDirectionFinder(model, tokenizer); List directions = finder.find( harmfulPrompts, harmlessPrompts, layers: List.of(0, 8, 16, 24, 31)); for (RefusalDirection dir : directions) { System.out.printf("Layer %d: direction norm=%.4f%n", dir.getLayer(), dir.getDirection().norm2Number().floatValue()); } ``` ### WeightOrthogonalizer Applies the Gram-Schmidt orthogonalization to remove a direction from a weight matrix. Used by `AbliterationWorkflow` but also available directly. ```java import org.deeplearning4j.llm.edit.WeightOrthogonalizer; INDArray weightMatrix = model.getVariable("decoder/layer.16/mlp/down_proj/W").getArr(); INDArray direction = refusalDirection.getDirection(); INDArray edited = WeightOrthogonalizer.orthogonalize(weightMatrix, direction); ``` *** ## 10. Benchmarking The benchmark framework measures LLM inference throughput under controlled conditions. It distinguishes between three throughput regimes that capture different aspects of serving performance. ### Throughput Metrics | Metric | Description | | ------------------ | ---------------------------------------------------------------------- | | `lateSteady tok/s` | Tokens per second after full JIT warmup and cache warmup | | `steady tok/s` | Tokens per second during the steady decode phase (most representative) | | `decode tok/s` | Tokens per second for the decode phase only (excludes prefill) | ### BenchmarkConfig Presets `BenchmarkConfig` ships four presets that control how the SameDiff graph is executed during the benchmark run. | Preset | Constant | Description | | ------------ | ------------------------------ | -------------------------------------------------------------------------- | | Optimal | `BenchmarkConfig.OPTIMAL` | Lets the system select the best execution mode automatically | | Slot-by-slot | `BenchmarkConfig.SLOT_BY_SLOT` | Executes one op at a time; useful for per-op profiling | | Triton | `BenchmarkConfig.TRITON` | Routes eligible ops through Triton kernels (requires `tritonEnabled=true`) | | CUDA Graphs | `BenchmarkConfig.CUDA_GRAPHS` | Captures and replays CUDA graphs; lowest decode latency on GPU | ### Running a Benchmark ```java import org.deeplearning4j.llm.benchmark.BenchmarkRunner; import org.deeplearning4j.llm.benchmark.BenchmarkConfig; import org.deeplearning4j.llm.benchmark.BenchmarkResult; BenchmarkConfig config = BenchmarkConfig.OPTIMAL .withFp16PreCast(true) // cast weights to FP16 before benchmarking .withGraphOptimizer(true) // enable SameDiff graph fusion passes .withTritonEnabled(false); // set true to enable Triton kernel routing BenchmarkRunner runner = BenchmarkRunner.builder() .pipeline(pipeline) .config(config) .prompt("The quick brown fox jumps over the lazy dog.") .warmupIterations(50) .benchmarkIterations(200) .build(); BenchmarkResult result = runner.run(); System.out.printf("steady tok/s: %.1f%n", result.getSteadyThroughput()); System.out.printf("decode tok/s: %.1f%n", result.getDecodeThroughput()); System.out.printf("lateSteady tok/s: %.1f%n", result.getLateSteadyThroughput()); System.out.printf("mean decode ms: %.2f%n", result.getMeanDecodeMs()); System.out.printf("p99 decode ms: %.2f%n", result.getP99DecodeMs()); ``` ### BenchmarkConfigApplier `BenchmarkConfigApplier` is the only legitimate caller of `setGraphExecutionMode` on a `SameDiff` instance. If you need to apply a `BenchmarkConfig` to an existing pipeline outside of `BenchmarkRunner`, use it rather than calling `SameDiff` execution mode methods directly. ```java import org.deeplearning4j.llm.benchmark.BenchmarkConfigApplier; BenchmarkConfigApplier.apply(model, BenchmarkConfig.CUDA_GRAPHS); ``` ### Decode Step Validation The benchmark framework ships a suite of validation utilities for verifying that optimization changes do not alter numerical outputs. ```java import org.deeplearning4j.llm.benchmark.DecodeValidationFramework; import org.deeplearning4j.llm.benchmark.MultiLevelComparator; DecodeValidationFramework validator = new DecodeValidationFramework( referenceModel, optimizedModel, new MultiLevelComparator(atol: 1e-3f, rtol: 1e-3f)); boolean pass = validator.validate("Test prompt for numerical equivalence."); System.out.println("Validation: " + (pass ? "PASS" : "FAIL")); ``` *** ## 11. VLM, Audio, and Other Modules ### samediff-vlm: Vision-Language Models `samediff-vlm` extends the generation pipeline with image conditioning. The module handles image preprocessing (resize, normalize, patch extraction), image encoding via a vision encoder SameDiff graph, cross-attention injection into the language model, and the combined text-image generation loop. ```java import org.deeplearning4j.vlm.VlmPipeline; import org.deeplearning4j.vlm.VlmPipelineConfig; import org.deeplearning4j.vlm.VlmGenerationResult; SameDiff visionEncoder = SameDiff.load(new File("clip-vit-large.fb"), true); SameDiff languageModel = SameDiff.load(new File("llava-1.6-mistral-7b.fb"), true); VlmPipelineConfig config = VlmPipelineConfig.builder() .visionEncoder(visionEncoder) .languageModel(languageModel) .tokenizer(TokenizerFactory.fromDirectory(Path.of("llava-tokenizer/"))) .imageSize(336) // model-specific image resolution .build(); VlmPipeline vlm = new VlmPipeline(config); BufferedImage image = ImageIO.read(new File("photo.jpg")); String prompt = "Describe what is happening in this image in detail."; VlmGenerationResult result = vlm.generate(image, prompt, DecodeOptions.defaults()); System.out.println(result.getText()); ``` The `CLIPTokenizer` in `nd4j-tokenizers` is used by `samediff-vlm` to tokenize text for CLIP-family vision encoders. Text embeddings and image patch embeddings are concatenated in the language model's embedding space before the decode loop begins. ### samediff-audio: Whisper ASR `samediff-audio` provides a complete Whisper automatic speech recognition pipeline, including mel spectrogram extraction, audio chunking for long audio, beam search decoding, and optional language detection. ```java import org.deeplearning4j.audio.WhisperPipeline; import org.deeplearning4j.audio.WhisperConfig; import org.deeplearning4j.audio.TranscriptionResult; SameDiff whisperModel = SameDiff.load(new File("whisper-large-v3.fb"), true); WhisperConfig config = WhisperConfig.builder() .model(whisperModel) .language("en") // or null for auto-detect .task(WhisperTask.TRANSCRIBE) .beamSize(5) .chunkLengthSeconds(30) // Whisper processes 30-second chunks .build(); WhisperPipeline whisper = new WhisperPipeline(config); // Input: 16 kHz mono PCM as INDArray INDArray audio = loadAudio("interview.wav"); TranscriptionResult result = whisper.transcribe(audio); System.out.println(result.getText()); // With timestamps result.getSegments().forEach(seg -> System.out.printf("[%.2f → %.2f] %s%n", seg.getStart(), seg.getEnd(), seg.getText())); ``` #### WhisperArchitecture and GGUF Loading Whisper models can be loaded directly from GGUF files (whisper.cpp format) using the `WhisperArchitecture` handler in `nd4j-ggml`. The `WhisperArchitecture` class implements `ModelArchitecture` and is detected automatically from the GGUF metadata key `general.architecture = "whisper"`. It builds a complete encoder-decoder SameDiff graph from the GGML weight tensors. ```java import org.eclipse.deeplearning4j.audio.whisper.WhisperModel; import org.eclipse.deeplearning4j.audio.whisper.WhisperConfig; import org.eclipse.deeplearning4j.audio.whisper.WhisperDecoderResult; // Load from GGUF (whisper.cpp format) — architecture auto-detected WhisperModel model = WhisperModel.fromGgml(new File("ggml-large-v3.bin")); // Or load from ONNX export (HuggingFace Optimum format) WhisperModel model = WhisperModel.fromOnnx(new File("whisper-large-v3/")); // Expects: encoder_model.onnx, decoder_model.onnx, tokenizer.json // Transcribe an audio file (auto-resamples to 16 kHz if needed) WhisperDecoderResult result = model.transcribe(new File("interview.wav")); System.out.println(result.getText()); // With language and timestamp options WhisperDecoderResult result = model.transcribe( new File("interview.wav"), "en", // language code, or null for auto-detect "transcribe", // task: "transcribe" or "translate" true); // include timestamps result.getSegments().forEach(seg -> System.out.printf("[%.2f -> %.2f] %s%n", seg.getStart(), seg.getEnd(), seg.getText())); model.close(); ``` #### Mel Filterbank Parameters The mel spectrogram is extracted by a native C++ op (`whisper_mel_spectrogram`) that runs STFT, mel filterbank, and Whisper-specific log normalization in a single kernel. The fixed parameters for all standard Whisper variants are: | Parameter | Value | Description | | ------------- | ---------------------------------------------------------- | ---------------------------------------------------------- | | `sampleRate` | 16000 Hz | Required input sample rate | | `N_FFT` | 400 | FFT window size (\~25 ms at 16 kHz) | | `hopLength` | 160 | Hop between frames (\~10 ms at 16 kHz) | | `numMelBins` | 80 (tiny/base/small/medium/large-v2), 128 (large-v3/turbo) | Mel filter count | | `chunkLength` | 30 seconds | Audio is padded or trimmed to this length | | `numFrames` | 3000 | Frames per chunk: `(sampleRate * chunkLength) / hopLength` | Log normalization applies `log10(max(mel, 1e-10))`, clamps values to `(max - 8.0)`, then scales with `(x + 4.0) / 4.0`. `WhisperConfig` provides named presets for each model size: ```java import org.eclipse.deeplearning4j.audio.whisper.WhisperConfig; WhisperConfig cfg = WhisperConfig.largeV3(); // numMelBins=128, hiddenSize=1280, numAttentionHeads=20, 32 encoder + 32 decoder layers WhisperConfig cfg = WhisperConfig.turbo(); // numMelBins=128, hiddenSize=1280, numAttentionHeads=20, 32 encoder + 4 decoder layers WhisperConfig cfg = WhisperConfig.base(); // numMelBins=80, hiddenSize=512, numAttentionHeads=8, 6 encoder + 6 decoder layers ``` To extract mel features manually (e.g., for pre-processing pipelines): ```java import org.eclipse.deeplearning4j.audio.feature.WhisperMelSpectrogram; WhisperMelSpectrogram mel = new WhisperMelSpectrogram(WhisperConfig.largeV3()); // From an INDArray of raw samples at 16 kHz INDArray melFeatures = mel.extractFeatures(audioSamples); // Output shape: [1, 128, 3000] // Directly from a WAV file (resamples automatically) INDArray melFeatures = mel.extractFeaturesFromFile(new File("audio.wav")); ``` #### Beam-Search Decoder The Whisper decode loop is driven by `GenerationPipeline` in encoder-decoder mode. Greedy decoding is the default. To use beam search, configure the sampling to select the top-beam paths: ```java import org.eclipse.deeplearning4j.audio.whisper.WhisperModel; import org.eclipse.deeplearning4j.audio.whisper.WhisperConfig; import org.eclipse.deeplearning4j.llm.generation.KvCacheStrategy; import org.eclipse.deeplearning4j.llm.generation.SamplingConfig; WhisperModel model = WhisperModel.builder() .encoder(encoder) .decoder(decoder) .config(WhisperConfig.largeV3()) .kvCacheStrategy(KvCacheStrategy.STATIC) .samplingConfig(SamplingConfig.greedy()) // greedy (default) or beam .maxTokens(448) .build(); WhisperDecoderResult result = model.transcribe(new File("audio.wav")); ``` The encoder output (shape `[1, seqLen, hiddenSize]`) is computed once and then injected into every decoder cross-attention step via `ModelIOConfig.encoderDecoder(true)`. Special tokens (`SOT`, language token, task token) form the decoder prompt; generation stops on `EOT`. ### nd4j-torchscript: PyTorch Model Import `nd4j-torchscript` imports TorchScript (`.pt`) files exported from PyTorch into native SameDiff graphs. This allows any PyTorch model that can be `torch.jit.traced` or `torch.jit.scripted` to be run without any Python dependency at inference time. ```java import org.nd4j.torchscript.TorchScriptImporter; // Export from PyTorch: // traced = torch.jit.trace(model, example_input) // traced.save("model.pt") SameDiff sd = TorchScriptImporter.importModel(Path.of("model.pt")); // Run inference Map inputs = Map.of("input", inputTensor); Map outputs = sd.outputAll(inputs); ``` Supported op coverage includes all ops commonly used in transformer architectures: matrix multiply, layer norm, softmax, attention, RoPE, SiLU/GELU activations, and element-wise operations. Unsupported ops will raise `TorchScriptImportException` with the op name. ### nd4j-web: Browser Frontend for ND4J Graphs `nd4j-web` provides a TypeScript/FlatBuffers-based web frontend for visualizing and executing ND4J computation graphs in a browser. Graphs are serialized to FlatBuffers format and served over a lightweight HTTP endpoint. This is primarily useful for debugging graph structure and for building web-based tooling around ND4J models. ```java import org.nd4j.web.Nd4jWebServer; Nd4jWebServer server = Nd4jWebServer.builder() .port(8080) .graph(model) .build(); server.start(); System.out.println("ND4J graph viewer at http://localhost:8080"); ``` Navigate to `http://localhost:8080` to see the graph structure, inspect variable shapes, and trigger execution from the browser. *** ## 12. OCR Operations The `samediff-vlm` module ships a native document OCR subsystem built on top of the VLM inference pipeline. It replaces external OCR libraries (Tesseract, EasyOCR, cloud APIs) with GPU-accelerated model-based recognition that runs end-to-end inside SameDiff. ### Architecture ``` AbstractOCREngine │ └── DeepSeekOCREngine │ ├── Vision Encoder (SameDiff/ONNX) image → feature tensor └── Text Decoder (SameDiff/ONNX) features → text + bounding boxes ``` ### Core Classes | Class | Role | | ------------------- | ------------------------------------------------------------------------------------------------ | | `AbstractOCREngine` | Abstract base; defines the `recognize(File)` / `recognize(BufferedImage)` contract | | `DeepSeekOCREngine` | Concrete implementation backed by a vision encoder + text decoder | | `OCRResult` | Output: list of `TextRegion` objects plus full concatenated text and overall confidence | | `OCRConfig` | Image preprocessing parameters: `imageSize` (default 1024), `imageMean`, `imageStd`, `maxTokens` | | `TextRegion` | Per-region data: bounding box `[x, y, width, height]`, text, confidence, detected language | ### Loading and Running OCR ```java import org.eclipse.deeplearning4j.vlm.input.ocr.DeepSeekOCREngine; import org.eclipse.deeplearning4j.vlm.input.ocr.OCRConfig; import org.eclipse.deeplearning4j.vlm.input.ocr.OCRResult; import org.eclipse.deeplearning4j.vlm.input.ocr.TextRegion; // Load model from directory containing vision_encoder.onnx and text_decoder.onnx DeepSeekOCREngine engine = DeepSeekOCREngine.create(new File("deepseek-ocr/")); engine.initialize(); // Recognize from a file OCRResult result = engine.recognize(new File("document.png")); System.out.println(result.getFullText()); System.out.printf("Confidence: %.2f%n", result.getConfidence()); // Per-region breakdown with bounding boxes for (TextRegion region : result.getRegions()) { System.out.printf("[%s] (%.2f) bbox=%s%n", region.getText(), region.getConfidence(), region.getBbox()); } engine.close(); ``` ### Custom Configuration ```java import org.eclipse.deeplearning4j.vlm.input.ocr.OCRConfig; OCRConfig config = OCRConfig.builder() .imageSize(1024) // resize to 1024x1024 (default) .imageMean(new float[]{0.485f, 0.456f, 0.406f}) // ImageNet mean .imageStd(new float[]{0.229f, 0.224f, 0.225f}) // ImageNet std .maxTokens(1024) // max decoder tokens per page .build(); DeepSeekOCREngine engine = DeepSeekOCREngine.create(new File("deepseek-ocr/"), config); engine.initialize(); ``` ### Preprocessing Pipeline The OCR engine reuses the `VLMImagePreprocessor` infrastructure: 1. **Resize**: scale input to `config.imageSize x config.imageSize` 2. **Normalize**: apply ImageNet mean/std: `[0.485, 0.456, 0.406]` / `[0.229, 0.224, 0.225]` 3. **Tile**: for high-resolution documents, split into overlapping tiles processed in parallel 4. **Tensor**: convert to `[1, 3, H, W]` float tensor ### Multi-Language Support Language detection and switching happens inside the model — no per-language configuration is needed. The `DeepSeekOCREngine` supports 12+ scripts out of the box: ```java List langs = engine.getSupportedLanguages(); // ["en", "zh", "ja", "ko", "ar", "hi", "ru", "de", "fr", "es", "pt", "it"] ``` A single model handles all supported scripts. Detected per-region language is available on each `TextRegion.getLanguage()`. ### Implementing a Custom OCR Engine Extend `AbstractOCREngine` to integrate a different backend: ```java import org.eclipse.deeplearning4j.vlm.input.ocr.AbstractOCREngine; import org.eclipse.deeplearning4j.vlm.input.ocr.OCRResult; public class MyOCREngine extends AbstractOCREngine { @Override public void initialize() throws Exception { // load your models here } @Override public OCRResult recognize(File imageFile) throws Exception { return recognize(ImageIO.read(imageFile)); } @Override public OCRResult recognize(BufferedImage image) throws Exception { // preprocess, run inference, return OCRResult } @Override public List getSupportedLanguages() { return List.of("en"); } @Override public void close() { // release resources } } ``` *** ## 13. SDX Serving Protocol (REST + gRPC) The SDX serving layer exposes any `.sdz` or `.sdnb` model as a network service with a dual-protocol contract: a REST endpoint for binary NPZ payloads and a gRPC endpoint for strongly-typed tensor streaming. Both transports share the same execution core so there is no behavioral drift between them. ### REST: `POST /v1/models/{model_id}:run-npz` The primary REST endpoint for production inference. The request body is an NPZ archive containing the input arrays; the response body is an NPZ archive containing the output arrays. **Request** ``` POST /v1/models/my-llm-8b:run-npz Content-Type: application/octet-stream X-SDX-Input-Order: ["input_ids", "attention_mask"] X-SDX-Output-Specs: [{"name":"logits","dtype":1,"shape":[1,512,32000]}] ``` **Response** ``` HTTP/1.1 200 OK Content-Type: application/octet-stream X-SDX-Execution-Report: {"backend":"CUDA","device":0,"elapsed_ms":12.4} ``` **Custom Headers** | Header | Direction | Description | | ------------------------ | --------- | ------------------------------------------------------------------------------------------------------------------------------ | | `X-SDX-Input-Order` | Request | JSON array of input tensor names, controlling the order they are mapped to the model's placeholders | | `X-SDX-Output-Specs` | Request | JSON array of `{"name", "dtype", "shape"}` objects; required because the C ABI (`sdxRun`) needs caller-provided output buffers | | `X-SDX-Execution-Report` | Response | JSON object with backend, device ID, and wall-clock elapsed time for the execution | A JSON/base64 compatibility endpoint is also available for smaller or debugging payloads: ``` POST /v1/models/{model_id}:run Content-Type: application/json {"inputs": {"input_ids": {"dtype": "INT64", "shape": [1, 16], "data_b64": "..."}}} ``` ### gRPC Protocol The primary typed binary protocol. The proto contract is defined in `sdx_serving.proto`. **Proto contract** ```protobuf // libnd4j/include/dsp/runtime/bindings/python/sdx_serving.proto message Tensor { bytes data = 1; // raw little-endian binary repeated int64 shape = 2; int32 dtype = 3; // SDX dtype code } message TensorSpec { string name = 1; repeated int64 shape = 2; int32 dtype = 3; } message RunRequest { string model_id = 1; map inputs = 2; repeated TensorSpec output_specs = 3; // required: server allocates outputs repeated string input_order = 4; } message RunResponse { map outputs = 1; string exec_report = 2; // JSON execution metadata } service SdxServing { rpc Run (RunRequest) returns (RunResponse); } ``` **Java gRPC client example** ```java import io.grpc.ManagedChannel; import io.grpc.ManagedChannelBuilder; ManagedChannel channel = ManagedChannelBuilder .forAddress("inference-host", 50051) .maxInboundMessageSize(256 * 1024 * 1024) // raise beyond 4 MiB default for large tensors .usePlaintext() .build(); SdxServingGrpc.SdxServingBlockingStub stub = SdxServingGrpc.newBlockingStub(channel); RunRequest request = RunRequest.newBuilder() .setModelId("my-llm-8b") .putInputs("input_ids", tensorFromArray(inputIds)) .putInputs("attention_mask", tensorFromArray(mask)) .addOutputSpecs(TensorSpec.newBuilder() .setName("logits") .addShape(1).addShape(512).addShape(32000) .setDtype(DType.FLOAT32_VALUE) .build()) .build(); RunResponse response = stub.run(request); Tensor logits = response.getOutputsOrThrow("logits"); ``` ### NPZ Payload Format The NPZ format (NumPy archive) stores each tensor as a separate `.npy` file within a ZIP container. The key in the archive matches the tensor name expected by the model. ```python # Python client example import numpy as np import requests import io # Build request buf = io.BytesIO() np.savez(buf, input_ids=np.array([[1, 2, 3, 4]], dtype=np.int64), attention_mask=np.ones((1, 4), dtype=np.int64)) buf.seek(0) resp = requests.post( "http://inference-host:8080/v1/models/my-llm-8b:run-npz", data=buf.read(), headers={ "Content-Type": "application/octet-stream", "X-SDX-Input-Order": '["input_ids","attention_mask"]', "X-SDX-Output-Specs": '[{"name":"logits","dtype":1,"shape":[1,4,32000]}]' }) outputs = np.load(io.BytesIO(resp.content)) logits = outputs["logits"] # shape [1, 4, 32000] ``` ### Execution Lifecycle Both transports use the same server-side execution sequence: 1. Load model into the runtime registry (`sdx_sdk_runner.py`) 2. Create a context for the request 3. Decode input tensors via the shared codec (`sdx_tensor_transport.py`) 4. Call `sdxRun(...)` on the C runtime — caller-provided output buffers must be allocated from `X-SDX-Output-Specs` / `output_specs` 5. Encode output tensors and return 6. Context released; model stays loaded for subsequent requests *** ## 14. VLM Multi-GPU Inference Pipeline The `samediff-vlm` module includes a dedicated multi-GPU pipeline for Vision-Language Models (VLMs) such as SmolDocling. VLMs combine a vision encoder (processes images) with a language decoder (generates text), and these two components have very different memory profiles. The multi-GPU pipeline assigns them to separate GPUs to maximize available memory for each. ### Architecture Overview ``` VLMPipelineExecutor │ ├── MultiPartModelLoader │ ├── vision_encoder.sdz → encoder GPU (e.g. RTX 3070 Ti, 8 GB) │ ├── embed_tokens.sdz → decoder GPU (e.g. RTX 4090, 24 GB) │ └── decoder.sdz → decoder GPU │ ├── ImageTiler │ └── splits pages into tiles; parallel encoding │ └── VLMImagePreprocessor └── resize / normalize / patch extraction per tile ``` **GPU assignment:** * **Decoder GPU** (largest available, selected by `selectBestGpu()`): decoder model constants, token embedding, and the autoregressive KV-cache growth loop. * **Encoder GPU** (next-best): vision encoder model constants and per-tile encoding. Released after all pages are encoded. ### Maven Dependency ```xml org.deeplearning4j samediff-vlm ${dl4j.version} ``` ### MultiPartModelLoader VLMs are stored as separate `.sdz` files — one per sub-model. `MultiPartModelLoader` loads them and assigns each to the correct device: ```java import org.deeplearning4j.vlm.MultiPartModelLoader; import org.deeplearning4j.vlm.VisionLanguageModel; // Expects vision_encoder.sdz, embed_tokens.sdz, decoder.sdz in modelDirectory VisionLanguageModel vlm = MultiPartModelLoader.load(new File("smol-docling/")); // The loader calls selectBestGpu() to assign the decoder and picks the // next-best GPU for the encoder automatically. ``` You can also control device assignment explicitly: ```java VisionLanguageModel vlm = MultiPartModelLoader.builder() .modelDirectory(new File("smol-docling/")) .encoderDeviceId(1) // RTX 3070 Ti (8 GB) .decoderDeviceId(0) // RTX 4090 (24 GB) .build() .load(); ``` ### VLMPipelineExecutor — End-to-End Usage `VLMPipelineExecutor` is the single entry point for VLM inference. It coordinates image preprocessing, tile encoding, cross-device transfers, and the autoregressive decode loop: ```java import org.deeplearning4j.vlm.VLMPipelineExecutor; import org.deeplearning4j.vlm.VLMPipelineConfig; import org.deeplearning4j.vlm.VlmGenerationResult; import org.nd4j.tokenizers.TokenizerFactory; VLMPipelineConfig config = VLMPipelineConfig.builder() .model(vlm) .tokenizer(TokenizerFactory.fromDirectory(new File("smol-docling/"))) .maxTokens(2048) .build(); VLMPipelineExecutor executor = new VLMPipelineExecutor(config); BufferedImage page = ImageIO.read(new File("document-page-1.png")); VlmGenerationResult result = executor.generate(page, "Describe the layout of this page."); System.out.println(result.getText()); executor.close(); ``` ### ImageTiler — Multi-Page Documents `ImageTiler` splits high-resolution or multi-page inputs into fixed-size tiles. For document-understanding tasks each page is processed as a separate tile, and encoding is pipelined so that page N+1 preprocessing (CPU-bound) overlaps with page N encoding (GPU-bound): ```java import org.deeplearning4j.vlm.ImageTiler; import org.deeplearning4j.vlm.TilerConfig; ImageTiler tiler = ImageTiler.builder() .tileWidth(560) // pixels per tile (model-specific) .tileHeight(560) .overlapPixels(56) // overlap between adjacent tiles .build(); // Split a large document scan into tiles List tiles = tiler.tile(documentImage); // Or pass a multi-page PDF list directly to VLMPipelineExecutor List pages = loadPdfPages(new File("report.pdf")); VlmGenerationResult result = executor.generateFromPages(pages, "Extract all invoice totals."); ``` ### Encoder-GPU / Decoder-GPU Device Affinity A single-thread executor pins all encoder work to the encoder device. This prevents CUDA context switching and isolates each GPU's memory pools and streams: ```java // This is done internally by VLMPipelineExecutor; shown here for reference. ExecutorService encoderExecutor = Executors.newSingleThreadExecutor(r -> { Thread t = new Thread(() -> { DeviceMemoryManager.switchDevice(encoderDeviceId); r.run(); }); t.setDaemon(true); return t; }); ``` Cross-device transfers (encoder output → decoder input) use `CudaAffinityManager.replicateToDevice`. On GPU pairs that support NVLink, the transfer is direct (device-to-device). On non-P2P pairs the transfer is staged through host memory (D2H + H2D). ### Deferred Vision-Encoder Release After all pages are encoded, the vision encoder model is freed. This recovers 5–8 GB of GPU memory on the encoder device (or shared device on single-GPU systems) before the decode loop begins: ```java // Called automatically by VLMPipelineExecutor after encoding all pages. // To trigger manually in a custom pipeline: vlm.freeVisionEncoder(); // → SameDiff graph closed, constant arrays freed. // GPU memory freed: 5–8 GB now available for decoder KV-cache growth. ``` On single-GPU setups the encoder and decoder share one device. The encoder must complete and be released before the decoder's KV cache can grow freely. This serializes encoding and decoding but is handled transparently by `VLMPipelineExecutor`. ### Decode Loop Integration The decode loop uses `DynamicShapePlan` to handle the growing KV cache across thousands of steps. The pipeline follows this sequence per token: 1. Embed the current token ID through the `embed_tokens` model on the decoder GPU. 2. On the first step, concatenate vision features (transferred from the encoder GPU) with the token embeddings. 3. Execute the decoder with `DynamicShapePlan` (handles shape changes as the KV cache grows). 4. Select the next token by argmax on the output logits. 5. Stop if the end-of-sequence token is produced. 6. Reuse intermediate arrays across steps (one persistent array per slot — no per-step allocate/free overhead). ### Configuration Reference | Option | Type | Default | Description | | -------------------------- | --------- | ------------- | ------------------------------------------------------ | | `encoderDeviceId` | `int` | auto | GPU device ID for the vision encoder | | `decoderDeviceId` | `int` | auto | GPU device ID for the language decoder | | `tileWidth` / `tileHeight` | `int` | model default | Tile size in pixels for `ImageTiler` | | `overlapPixels` | `int` | 0 | Tile overlap to avoid edge artifacts | | `maxTokens` | `int` | 2048 | Maximum generated tokens per page | | `freeEncoderAfterEncoding` | `boolean` | `true` | Release encoder GPU memory after all pages are encoded | | `pipelineParallelism` | `boolean` | `true` | Overlap page N+1 preprocessing with page N encoding | ### Performance Notes * SmolDocling on RTX 4090 (24 GB) + RTX 3070 Ti (8 GB): approximately 87–92 tok/s steady-state decode with CUDA graph replay and Triton fusion. * Vision encoder: approximately 150 ms per page (1962 DSP ops per frame on native executor). * After encoder release: approximately 5.3 GB baseline GPU usage (model constants) with approximately 1 MB/step memory growth in the decode loop. * For single-GPU systems, the pipeline falls back to serial encode-then-decode automatically. Multi-GPU provides the pipeline-parallelism advantage only when two or more GPUs are available. *** ## Next Steps * **Getting Started:** See the [Quickstart](/en-1.0.0-rewrite/deeplearning4j/quickstart.md) for setting up the Maven project and running your first model. * **SameDiff Graph Execution:** Review the [SameDiff Execution documentation](/en-1.0.0-rewrite/nd4j/overview-2/execution.md) to understand how `GenerationPipeline` integrates with the DSP plan lifecycle. * **OmniHub Model Zoo:** Use [OmniHub](/en-1.0.0-rewrite/omnihub/usage.md) to download pre-converted LLM weights in the SameDiff FlatBuffers format without manual conversion. * **Performance Tuning:** See [GPU/CPU Configuration](/en-1.0.0-rewrite/configuration/gpu-cpu.md) and [Memory and Workspaces](/en-1.0.0-rewrite/core-concepts/memory-and-workspaces.md) for hardware-specific tuning guidance that applies to LLM inference. * **CUDA Graphs:** The `BenchmarkConfig.CUDA_GRAPHS` preset delivers the lowest decode latency on NVIDIA GPUs; see the [CUDA backend documentation](/en-1.0.0-rewrite/nd4j/overview-1/cuda.md) for prerequisites.