> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/model-import/overview-3.md).

# GGML/GGUF Import

### GGML/GGUF Model Import

Eclipse Deeplearning4j 1.0.0-rewrite introduces native support for loading GGUF and GGML model files directly into SameDiff. This enables the JVM ecosystem to consume the enormous library of community-quantized models distributed through Hugging Face and other repositories — LLaMA, Gemma, Mistral, Phi, Qwen, Whisper, and many others — without any Python tooling or intermediate conversion step.

The implementation is split across the `nd4j-ggml` Maven module (87 files) and three pipeline SPI modules (34 files) that provide a pluggable format layer on top of SameDiff.

***

### When to Use GGML Import

| Scenario                                            | Recommended approach                                         |
| --------------------------------------------------- | ------------------------------------------------------------ |
| Run a community quantized LLM (.gguf) on the JVM    | `GGMLModelImport.importModel(File)`                          |
| Inspect metadata and tensor layout before loading   | `GGMLModelImport.inspectModel(File)`                         |
| Convert to DL4J native format for repeated use      | `GGMLModelImport.convertToSDZ(src, dst)`                     |
| Export a SameDiff model back to GGUF                | `GGMLModelExport.exportModel(SameDiff, File, ExportOptions)` |
| Load split multimodal GGUF bundles (e.g., Qwen3-VL) | `MultimodalGGUFLoader`                                       |

***

### Maven Setup

The core import capability lives in `nd4j-ggml`. Add it alongside the ND4J backend for your platform.

```xml
<!-- GGML/GGUF model import -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-ggml</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- ND4J CPU backend (choose one) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native-platform</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- ND4J CUDA backend (alternative) -->
<!--
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-cuda-12.3-platform</artifactId>
    <version>${dl4j.version}</version>
</dependency>
-->
```

For pipeline integration (format-agnostic loading across GGUF, SafeTensors, and ONNX), add the relevant SPI modules:

```xml
<!-- Pipeline core (shared SPI interfaces) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>samediff-pipeline-core</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- GGUF pipeline adapter -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>samediff-pipeline-ggml</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- SafeTensors pipeline adapter (optional) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>samediff-pipeline-safetensors</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- ONNX pipeline adapter (optional) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>samediff-pipeline-onnx</artifactId>
    <version>${dl4j.version}</version>
</dependency>
```

Replace `${dl4j.version}` with your project version, for example `1.0.0-rewrite`.

***

### Quick Start

#### Import a GGUF model into SameDiff

```java
import org.nd4j.ggml.GGMLModelImport;
import org.nd4j.autodiff.samediff.SameDiff;

import java.io.File;

File ggufFile = new File("llama-3-8b-instruct.Q4_K_M.gguf");

// Reads the GGUF file, dequantizes all tensors, and maps them to SameDiff variables
SameDiff model = GGMLModelImport.importModel(ggufFile);

System.out.println("Variables loaded: " + model.variables().size());
```

#### Inspect a model without loading all weights

```java
import org.nd4j.ggml.GGMLModelImport;
import org.nd4j.ggml.format.GGMLMetadata;

File ggufFile = new File("model.gguf");
GGMLMetadata metadata = GGMLModelImport.inspectModel(ggufFile);

System.out.println("Architecture : " + metadata.getArchitecture());
System.out.println("Tensor count : " + metadata.getTensorCount());
System.out.println("Context length: " + metadata.getContextLength());
System.out.println("Quant type   : " + metadata.getQuantizationType());
```

#### Convert to DL4J native format (SDZ)

```java
import org.nd4j.ggml.GGMLModelImport;

import java.io.File;

File src = new File("llama-3-8b-instruct.Q4_K_M.gguf");
File dst = new File("llama-3-8b-instruct.sdz");

// One-time conversion; subsequent loads from the .sdz are faster
GGMLModelImport.convertToSDZ(src, dst);
```

#### Run a forward pass

```java
import org.nd4j.autodiff.samediff.SameDiff;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.factory.Nd4j;
import org.nd4j.ggml.GGMLModelImport;

import java.io.File;
import java.util.Map;

SameDiff model = GGMLModelImport.importModel(new File("model.gguf"));

// Token IDs for a sample prompt (shape: [batch, seq_len])
INDArray inputIds = Nd4j.createFromArray(new long[][]{{1, 15043, 29892, 3186}});

Map<String, INDArray> outputs = model.output(
        Map.of("input_ids", inputIds),
        "logits"
);

INDArray logits = outputs.get("logits");
System.out.println("Logits shape: " + java.util.Arrays.toString(logits.shape()));
```

***

### GGUF Format Support

#### File Format Versions

`GGUFReader` supports all three released versions of the GGUF binary format:

| Version | Notes                              |
| ------- | ---------------------------------- |
| GGUF v1 | Original release format            |
| GGUF v2 | Adds alignment padding for tensors |
| GGUF v3 | Extended metadata KV type set      |

All versions share the same outer structure:

1. **Magic bytes** — `0x46554747` (`GGUF` in ASCII, little-endian)
2. **Version** — uint32
3. **Tensor count** — uint64
4. **Metadata KV count** — uint64
5. **Metadata KV pairs** — typed key-value entries (strings, scalars, arrays)
6. **Tensor descriptors** — name, shape, quantization type, offset
7. **Tensor data** — raw quantized bytes, padded to alignment boundary

`GGMLFormatDetector` reads the first four bytes of any file and selects either `GGUFReader` (magic `0x46554747`) or the legacy `GGMLReader` (older magic `0x67676d6c` / `0x67676d66`). You never need to choose the reader manually; `GGMLModelImport` calls the detector automatically.

#### Legacy GGML Format

For pre-GGUF models (GGML format v1–v3), `GGMLReader` and `GGMLWriter` provide compatible reading and writing. These files lack the structured metadata KV section; architecture detection falls back to heuristics based on tensor name patterns.

***

### Supported Architectures

Architecture detection is handled by `ArchitectureRegistry`, which uses `ServiceLoader` auto-discovery and a priority ordering. Each handler implements the `ModelArchitecture` interface:

```java
public interface ModelArchitecture {
    boolean isCompatible(GGMLMetadata metadata);
    SameDiff buildSameDiff(GGUFReader reader, ConversionOptions options);
}
```

The registry iterates handlers in priority order and delegates to the first compatible one. `GenericArchitecture` is always last and accepts any model as a fallback.

#### Architecture Handler Reference

| Architecture class     | Model families                         | Notes                                                 |
| ---------------------- | -------------------------------------- | ----------------------------------------------------- |
| `LLaMAArchitecture`    | LLaMA 1, LLaMA 2, LLaMA 3              | Standard dense transformer; RoPE positional encoding  |
| `Llama4Architecture`   | LLaMA 4                                | Interleaved mixture-of-experts (MoE) layers           |
| `GemmaArchitecture`    | Gemma 2, Gemma 3                       | Google's open models; grouped-query attention         |
| `MistralArchitecture`  | Mistral 7B, Mixtral                    | Sliding-window attention; optional MoE                |
| `PhiArchitecture`      | Phi-3, Phi-3.5                         | Microsoft; ROPE + QKV fused projection                |
| `GLMArchitecture`      | ChatGLM, CodeGLM                       | Zhipu AI; bidirectional prefix attention              |
| `GraniteArchitecture`  | IBM Granite                            | IBM Research code/language models                     |
| `LFM2Architecture`     | Liquid LFM-2                           | State-space model (SSM) hybrid                        |
| `NemotronArchitecture` | NVIDIA Nemotron                        | NVIDIA instruction-tuned models                       |
| `OLMoArchitecture`     | OLMo                                   | Allen AI; no bias in attention                        |
| `OpenELMArchitecture`  | OpenELM                                | Apple; layer-wise head count variation                |
| `Qwen3VLArchitecture`  | Qwen3-VL                               | Alibaba multimodal; loads split GGUF shards           |
| `SmolVLM2Architecture` | SmolVLM2                               | HuggingFace compact vision-language model             |
| `MiniCPMVArchitecture` | MiniCPM-V                              | ModelBest multimodal                                  |
| `WhisperArchitecture`  | Whisper (tiny/base/small/medium/large) | OpenAI ASR; encoder-decoder                           |
| `GenericArchitecture`  | Any                                    | Fallback; maps tensors by name without graph rewiring |

`LayerTensorDiscovery` resolves GGUF tensor naming conventions (e.g., `blk.0.attn_q.weight`, `blk.0.ffn_gate.weight`) to the canonical SameDiff variable names used by each architecture.

***

### Quantization Formats

#### Quantization Type Reference

`GGMLQuantType` enumerates every supported dtype with its bits-per-weight value. `GGMLDataType` provides the corresponding GGML integer dtype codes used in the binary format.

**Standard quantization types**

| Type     | Bits/weight | Block size | Notes                          | When to use                                  |
| -------- | ----------- | ---------- | ------------------------------ | -------------------------------------------- |
| `F32`    | 32.0        | —          | Full precision float32         | Accuracy-critical research; very large GPU   |
| `F16`    | 16.0        | —          | Half precision float16         | Good balance; standard GPU inference         |
| `Q8_0`   | 8.5         | 32         | 8-bit, zero-point offset       | Near-lossless; reference quality             |
| `Q8_K`   | 8.5         | 256        | 8-bit, K-quant block           | Used as intermediate for K-quant dequant     |
| `Q6_K`   | 6.5625      | 256        | 6-bit K-quant                  | Excellent quality; fits larger models in RAM |
| `Q5_K_M` | 5.6875      | 256        | 5-bit K-quant, mixed precision | Recommended for quality-size balance         |
| `Q5_K_S` | 5.5         | 256        | 5-bit K-quant, small           | Slightly smaller than Q5\_K\_M               |
| `Q5_1`   | 5.5         | 32         | 5-bit with non-zero min        | Legacy; prefer Q5\_K\_M                      |
| `Q5_0`   | 5.5         | 32         | 5-bit, zero min                | Legacy; prefer Q5\_K\_S                      |
| `Q4_K_M` | 4.8         | 256        | 4-bit K-quant, mixed precision | Most popular community choice                |
| `Q4_K_S` | 4.375       | 256        | 4-bit K-quant, small           | Compact; slightly lower quality              |
| `Q4_1`   | 4.5         | 32         | 4-bit with non-zero min        | Legacy; prefer Q4\_K\_M                      |
| `Q4_0`   | 4.0         | 32         | 4-bit, zero min                | Smallest standard quant; legacy use          |
| `Q3_K_L` | 3.4375      | 256        | 3-bit K-quant, large           | Very compressed; some quality loss           |
| `Q3_K_M` | 3.28125     | 256        | 3-bit K-quant, medium          | Aggressive compression                       |
| `Q3_K_S` | 3.0         | 256        | 3-bit K-quant, small           | Extreme compression                          |
| `Q2_K`   | 2.5625      | 256        | 2-bit K-quant                  | Maximum compression; significant degradation |

**I-quant types (importance-matrix quantization)**

I-quants use a calibration dataset to assign higher precision to the weights that matter most. They require an importance matrix (imatrix) generated during quantization and generally outperform equivalent-BPW standard quants.

| Type      | Bits/weight | Notes                              |
| --------- | ----------- | ---------------------------------- |
| `IQ4_XS`  | 4.25        | 4-bit imatrix; best 4-bit quality  |
| `IQ4_NL`  | 4.0         | 4-bit non-linear imatrix           |
| `IQ3_XXS` | 3.0625      | 3-bit ultra-small imatrix          |
| `IQ3_S`   | 3.4375      | 3-bit imatrix, standard            |
| `IQ2_XXS` | 2.0625      | 2-bit ultra-small imatrix          |
| `IQ2_XS`  | 2.3125      | 2-bit imatrix, extra small         |
| `IQ2_S`   | 2.5         | 2-bit imatrix, standard            |
| `IQ1_S`   | 1.5625      | 1-bit imatrix; extreme compression |
| `IQ1_M`   | 1.75        | 1-bit imatrix, mixed               |

**Ternary types**

| Type    | Bits/weight | Notes                                         |
| ------- | ----------- | --------------------------------------------- |
| `TQ1_0` | \~1.69      | Ternary quant v1                              |
| `TQ2_0` | \~2.06      | Ternary quant v2; better accuracy than TQ1\_0 |

#### Dequantization at Import Time

During `GGMLModelImport.importModel()`, each tensor is dequantized to `float32` (or `float16` depending on `ConversionOptions`) before being stored as a SameDiff variable. A dedicated dequantizer class handles each format:

* Standard: `Q4_0Dequantizer`, `Q4_1Dequantizer`, `Q5_0Dequantizer`, `Q5_1Dequantizer`, `Q8_0Dequantizer`, `Q8_KDequantizer`, `Q4_KDequantizer`, `Q5_KDequantizer`, `Q6_KDequantizer`, `Q2_KDequantizer`, `Q3_KDequantizer`
* I-quant: `IQ1_MDequantizer`, `IQ1_SDequantizer`, `IQ2_SDequantizer`, `IQ2_XSDequantizer`, `IQ2_XXSDequantizer`, `IQ3_SDequantizer`, `IQ3_XXSDequantizer`, `IQ4_NLDequantizer`, `IQ4_XSDequantizer`
* Ternary: `TQ1_0Dequantizer`, `TQ2_0Dequantizer`

`GGMLToSameDiffConverter` coordinates this process: it iterates the tensor descriptors from `GGUFReader`, dispatches to the appropriate dequantizer, and creates the resulting `SDVariable` in the target `SameDiff` graph.

***

### Adaptive Quantization

When exporting a SameDiff model back to GGUF, you rarely want uniform quantization across all layers. The adaptive quantization subsystem assigns per-layer quantization types to meet a target model size budget while preserving quality in the most sensitive weight matrices.

#### How it works

`AdaptiveLayerQuantizer` uses a two-pass algorithm:

1. **Analysis pass** — `DynamicQuantizationAnalyzer` inspects each weight tensor's value distribution (min, max, kurtosis, outlier ratio) and computes a recommended precision level.
2. **Budget pass** — `AdaptiveLayerQuantizer` maps the recommendations to concrete `GGMLQuantType` values, then adjusts to meet the total size budget. Attention projection matrices (`attn_q`, `attn_k`, `attn_v`, `attn_output`) are assigned a higher-quality quantization type than feed-forward matrices, since attention weights are more sensitive to precision loss.

#### Quantizer classes (for export)

The following quantizer classes are available for re-quantizing `float32` tensors during export:

| Class           | Output type |
| --------------- | ----------- |
| `Q4_0Quantizer` | Q4\_0       |
| `Q4_1Quantizer` | Q4\_1       |
| `Q4_KQuantizer` | Q4\_K       |
| `Q5_0Quantizer` | Q5\_0       |
| `Q5_1Quantizer` | Q5\_1       |
| `Q5_KQuantizer` | Q5\_K       |
| `Q6_KQuantizer` | Q6\_K       |
| `Q8_0Quantizer` | Q8\_0       |

#### Configuring adaptive quantization

```java
import org.nd4j.ggml.export.GGMLModelExport;
import org.nd4j.ggml.export.ExportOptions;
import org.nd4j.ggml.quantization.GGMLQuantType;

ExportOptions options = ExportOptions.builder()
        // Target model size in bytes (e.g., 4 GB)
        .targetSizeBytes(4L * 1024 * 1024 * 1024)
        // Default quant type for non-attention layers
        .defaultQuantType(GGMLQuantType.Q4_K_M)
        // Higher quality for attention projections
        .attentionQuantType(GGMLQuantType.Q6_K)
        // Enable per-layer dynamic analysis
        .adaptiveQuantization(true)
        .build();

GGMLModelExport.exportModel(sameDiff, new File("output.gguf"), options);
```

***

### Round-Trip Export

`GGMLModelExport` writes a SameDiff graph back to GGUF format, enabling workflows that:

* Fine-tune a model inside SameDiff, then re-export for use with llama.cpp or other GGUF-native runtimes.
* Quantize a float32 SameDiff model to GGUF for distribution.
* Convert between quantization levels (e.g., Q8\_0 -> Q4\_K\_M) without leaving the JVM.

Export counterpart architecture classes (e.g., `LLaMAExportArchitecture`) handle the reverse tensor name mapping from SameDiff variable names back to the GGUF `blk.N.*` naming convention. `ExportArchitectureRegistry` mirrors `ArchitectureRegistry` and uses the same `ServiceLoader` discovery mechanism.

#### Basic export

```java
import org.nd4j.ggml.export.GGMLModelExport;
import org.nd4j.ggml.export.ExportOptions;
import org.nd4j.ggml.quantization.GGMLQuantType;
import org.nd4j.autodiff.samediff.SameDiff;

import java.io.File;

SameDiff model = /* load or fine-tune your model */;

ExportOptions options = ExportOptions.builder()
        .defaultQuantType(GGMLQuantType.Q4_K_M)
        .build();

GGMLModelExport.exportModel(model, new File("finetuned-llama.gguf"), options);
```

`GGUFWriter` handles alignment padding so that the output file is compatible with any standard GGUF reader.

***

### Pipeline SPI Modules

The pipeline SPI provides a format-agnostic loading interface. When multiple format adapters are on the classpath, code written against `samediff-pipeline-core` works with any supported format without change.

#### Module overview

| Maven artifact                  | Purpose                                                               |
| ------------------------------- | --------------------------------------------------------------------- |
| `samediff-pipeline-core`        | Shared SPI interfaces (`ModelPipelineLoader`, `PipelineFormat`, etc.) |
| `samediff-pipeline-ggml`        | Adapts `nd4j-ggml` behind the SPI; auto-registers via `ServiceLoader` |
| `samediff-pipeline-safetensors` | Loads Hugging Face SafeTensors (`.safetensors`) files                 |
| `samediff-pipeline-onnx`        | Loads ONNX models through the SameDiff ONNX importer                  |

#### Using the pipeline API

```java
import org.nd4j.pipeline.ModelPipelineLoader;
import org.nd4j.pipeline.PipelineFormat;
import org.nd4j.autodiff.samediff.SameDiff;

import java.io.File;

// The loader auto-detects the format from the file magic bytes / extension.
// If both samediff-pipeline-ggml and samediff-pipeline-safetensors are on
// the classpath, this single call works for .gguf or .safetensors files.
SameDiff model = ModelPipelineLoader.load(new File("model.gguf"));
```

Explicit format selection:

```java
SameDiff model = ModelPipelineLoader.load(new File("model.gguf"), PipelineFormat.GGUF);
SameDiff model2 = ModelPipelineLoader.load(new File("model.safetensors"), PipelineFormat.SAFETENSORS);
SameDiff model3 = ModelPipelineLoader.load(new File("model.onnx"), PipelineFormat.ONNX);
```

***

### Multimodal Model Support

Several vision-language and audio models distribute as multiple GGUF shards — a language model shard and one or more projection/vision encoder shards. `MultimodalGGUFLoader` assembles these into a unified `SameDiff` graph.

#### Supported multimodal families

| Model family | Architecture handler   | Shards                    |
| ------------ | ---------------------- | ------------------------- |
| Qwen3-VL     | `Qwen3VLArchitecture`  | Language + vision encoder |
| SmolVLM2     | `SmolVLM2Architecture` | Language + vision encoder |
| MiniCPM-V    | `MiniCPMVArchitecture` | Language + vision encoder |

#### Loading a multimodal GGUF

```java
import org.nd4j.ggml.multimodal.MultimodalGGUFLoader;
import org.nd4j.autodiff.samediff.SameDiff;

import java.io.File;
import java.util.List;

// Provide the language model shard first, then vision encoder shard(s)
List<File> shards = List.of(
        new File("qwen3-vl-7b-language.gguf"),
        new File("qwen3-vl-7b-vision.gguf")
);

SameDiff model = MultimodalGGUFLoader.load(shards);
```

`MultimodalGGUFLoader` reads the metadata from each shard to identify its role, delegates to the appropriate `Qwen3VLArchitecture` (or equivalent), and merges the resulting SameDiff sub-graphs with shared variable namespaces.

***

### API Reference

#### `GGMLModelImport`

Primary entry point for all import operations.

| Method         | Signature                                                           | Description                                                                        |
| -------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
| `importModel`  | `static SameDiff importModel(File file)`                            | Reads GGUF or GGML file, dequantizes all tensors, returns populated SameDiff graph |
| `importModel`  | `static SameDiff importModel(File file, ConversionOptions options)` | Import with custom conversion options (target dtype, layer filter, etc.)           |
| `inspectModel` | `static GGMLMetadata inspectModel(File file)`                       | Reads header and metadata only; does not load tensor data                          |
| `convertToSDZ` | `static void convertToSDZ(File src, File dst)`                      | Converts GGUF to DL4J native SDZ format for fast subsequent loads                  |

#### `GGMLModelExport`

Round-trip export from SameDiff back to GGUF.

| Method        | Signature                                                               | Description                                                       |
| ------------- | ----------------------------------------------------------------------- | ----------------------------------------------------------------- |
| `exportModel` | `static void exportModel(SameDiff sd, File dst, ExportOptions options)` | Writes SameDiff graph to GGUF with the given quantization options |

#### `GGUFReader`

Low-level reader for the GGUF binary format.

| Method                             | Description                                                               |
| ---------------------------------- | ------------------------------------------------------------------------- |
| `readHeader()`                     | Parses magic, version, tensor count, KV count                             |
| `readMetadata()`                   | Returns the full metadata KV map                                          |
| `readTensorDescriptors()`          | Returns list of `TensorDescriptor` (name, shape, quant type, data offset) |
| `readTensorData(TensorDescriptor)` | Returns raw quantized bytes for a single tensor                           |

#### `GGMLMetadata`

Structured view of GGUF metadata KV entries.

| Method                  | Description                                                             |
| ----------------------- | ----------------------------------------------------------------------- |
| `getArchitecture()`     | Returns the `general.architecture` string (e.g., `"llama"`, `"gemma2"`) |
| `getTensorCount()`      | Total number of tensors in the file                                     |
| `getContextLength()`    | `llm.context_length` KV value                                           |
| `getQuantizationType()` | Most common quant type across all tensors                               |
| `get(String key)`       | Returns raw KV value by key                                             |

#### `ArchitectureRegistry`

| Method                                      | Description                                                    |
| ------------------------------------------- | -------------------------------------------------------------- |
| `findCompatible(GGMLMetadata)`              | Returns first compatible `ModelArchitecture` in priority order |
| `register(ModelArchitecture, int priority)` | Registers a custom architecture handler                        |
| `listAll()`                                 | Returns all registered handlers                                |

#### `GGMLQuantType`

Enum of quantization types with bits-per-weight metadata.

```java
GGMLQuantType qt = GGMLQuantType.Q4_K_M;
double bpw = qt.getBitsPerWeight();  // 4.8
int blockSize = qt.getBlockSize();   // 256
String name = qt.name();             // "Q4_K_M"
```

#### `AdaptiveLayerQuantizer`

```java
import org.nd4j.ggml.quantization.AdaptiveLayerQuantizer;
import org.nd4j.ggml.quantization.QuantizationBudget;

QuantizationBudget budget = QuantizationBudget.ofBytes(4L * 1024 * 1024 * 1024);
AdaptiveLayerQuantizer quantizer = new AdaptiveLayerQuantizer(budget);

// Returns a map of SameDiff variable name -> assigned GGMLQuantType
Map<String, GGMLQuantType> plan = quantizer.buildQuantizationPlan(sameDiff);
```

#### `DynamicQuantizationAnalyzer`

```java
import org.nd4j.ggml.quantization.DynamicQuantizationAnalyzer;
import org.nd4j.ggml.quantization.QuantizationRecommendation;
import org.nd4j.linalg.api.ndarray.INDArray;

DynamicQuantizationAnalyzer analyzer = new DynamicQuantizationAnalyzer();
INDArray weights = /* your weight tensor */;

QuantizationRecommendation rec = analyzer.analyze(weights);
System.out.println("Recommended type: " + rec.getRecommendedType());
System.out.println("Outlier ratio:    " + rec.getOutlierRatio());
```

#### `MultimodalGGUFLoader`

```java
SameDiff model = MultimodalGGUFLoader.load(List<File> shards);
SameDiff model = MultimodalGGUFLoader.load(List<File> shards, ConversionOptions options);
```

***

### Architecture Auto-Detection (ADR 0054)

When `GGMLModelImport.importModel()` is called, the first thing the importer does is read the GGUF metadata header and pass it to `ArchitectureRegistry.detectArchitecture()`. The registry resolves the correct handler without any user input.

#### How detection works

Every GGUF file written by a conforming tool sets the `general.architecture` key in its metadata. The value is a short ASCII string such as `"llama"`, `"mistral"`, `"bert"`, or `"gpt2"`. The detection sequence is:

1. Read `general.architecture` from the GGUF KV metadata.
2. Look up the string directly in the registry's name/variant map.
3. If found, call `canHandle(metadata)` on the candidate — some handlers check secondary fields (e.g., a `general.architecture` of `"llama"` with a `num_experts` field triggers the MoE handler for Llama 4).
4. If no direct match, iterate all registered handlers and call `canHandle()` on each.
5. If still unresolved, fall back to `GenericArchitecture`, which maps every tensor by its raw GGUF name.

The handler that wins provides two things: an `ArchitectureConfig` (derived from metadata fields like `llama.embedding_length`, `llama.block_count`, `llama.attention.head_count_kv`) and a `buildGraph()` implementation that constructs the SameDiff computational graph.

#### Inspecting the detected architecture before loading

```java
import org.nd4j.ggml.GGMLModelImport;
import org.nd4j.ggml.format.GGMLMetadata;

GGMLMetadata meta = GGMLModelImport.inspectModel(new File("model.gguf"));

// "general.architecture" from the KV metadata
System.out.println("Architecture : " + meta.getArchitecture());      // e.g. "llama"
System.out.println("Model name   : " + meta.get("general.name"));    // e.g. "Meta-Llama-3.1-8B"

// Architecture-specific hyperparameters
System.out.println("Hidden size  : " + meta.get("llama.embedding_length"));  // 4096
System.out.println("Layer count  : " + meta.get("llama.block_count"));       // 32
System.out.println("KV heads     : " + meta.get("llama.attention.head_count_kv")); // 8
```

#### Architecture-specific features extracted from metadata

| Metadata key                              | Description                     | Used by                  |
| ----------------------------------------- | ------------------------------- | ------------------------ |
| `general.architecture`                    | Primary architecture identifier | All handlers             |
| `{arch}.embedding_length`                 | Hidden dimension size           | LLaMA, Mistral, Gemma, … |
| `{arch}.block_count`                      | Number of transformer layers    | All transformer handlers |
| `{arch}.attention.head_count`             | Number of query heads           | All transformer handlers |
| `{arch}.attention.head_count_kv`          | Number of KV heads (GQA)        | LLaMA 3, Mistral, Gemma  |
| `{arch}.context_length`                   | Maximum sequence length         | All handlers             |
| `{arch}.rope.freq_base`                   | RoPE frequency base             | LLaMA, Mistral, Phi      |
| `{arch}.attention.layer_norm_rms_epsilon` | RMSNorm epsilon                 | LLaMA, Mistral           |

#### GGUF tensor naming to SameDiff variable mapping

Each architecture handler declares a `getTensorNamePatterns()` map that `LayerTensorDiscovery` uses to translate GGUF block-indexed names (e.g., `blk.0.attn_q.weight`) to canonical SameDiff names (e.g., `model.layers.0.self_attn.q_proj.weight`). The LLaMA handler's mapping is representative:

| GGUF tensor name         | SameDiff variable name                   |
| ------------------------ | ---------------------------------------- |
| `token_embd.weight`      | `model.embed_tokens.weight`              |
| `blk.0.attn_q.weight`    | `model.layers.0.self_attn.q_proj.weight` |
| `blk.15.ffn_gate.weight` | `model.layers.15.mlp.gate_proj.weight`   |
| `output_norm.weight`     | `model.norm.weight`                      |
| `output.weight`          | `lm_head.weight`                         |

***

### Quantization Handling During Import (ADR 0053)

#### The dequantization decision

ND4J does not support native quantized tensor operations, so the importer dequantizes each tensor to floating-point before creating the corresponding `SDVariable`. The target precision is controlled by `ConversionOptions`:

```java
import org.nd4j.ggml.GGMLModelImport;
import org.nd4j.ggml.format.ConversionOptions;
import org.nd4j.linalg.api.buffer.DataType;

// Default: dequantize to float32
SameDiff model = GGMLModelImport.importModel(new File("model.gguf"));

// Dequantize to float16 to halve memory usage
ConversionOptions opts = ConversionOptions.builder()
        .targetDtype(DataType.HALF)
        .build();
SameDiff model16 = GGMLModelImport.importModel(new File("model.gguf"), opts);
```

The available `QuantizationMode` values (set via `ConversionOptions`) are:

| Mode                     | Output dtype | When to use                           |
| ------------------------ | ------------ | ------------------------------------- |
| `DEQUANTIZE_TO_FLOAT32`  | FLOAT32      | Maximum accuracy; fine-tuning         |
| `DEQUANTIZE_TO_FLOAT16`  | FLOAT16      | Halved memory; good for inference     |
| `DEQUANTIZE_TO_BFLOAT16` | BFLOAT16     | Hardware-specific (e.g., Ampere GPUs) |

#### How dequantization works inside the importer

`GGMLToSameDiffConverter` iterates the tensor descriptors from `GGUFReader` and for each one:

1. Reads the raw quantized bytes with `GGUFReader.readTensorData(TensorDescriptor)`.
2. Looks up the matching `Dequantizer` in `DequantizerFactory` by `GGMLDataType`.
3. Calls `dequantizer.dequantizeToArray(bytes, shape, targetDtype)`, which decodes the block structure and reconstructs floating-point values.
4. Wraps the result in an `INDArray` and stores it as an `SDVariable` constant.

#### Block-level dequantization mechanics

Each quantization format packs values into fixed-size blocks with a shared scale (and sometimes a minimum). Two representative examples:

**Q4\_0** (legacy, 18 bytes per 32 values):

* 2 bytes: FP16 scale
* 16 bytes: 32 four-bit unsigned values (centered at 8, so value = `(nibble - 8) * scale`)

**Q4\_K** (K-quant, 144 bytes per 256 values):

* 4 bytes: super-block FP16 scale and min
* 12 bytes: 8 sub-block 6-bit scales
* 12 bytes: 8 sub-block 6-bit minimums
* 128 bytes: 256 four-bit values (value = `nibble * sub_scale + sub_min`)

#### Lazy vs eager dequantization

By default, `importModel` dequantizes all tensors **eagerly** — every tensor is decoded and loaded into memory before the method returns. For very large models, you can work at the reader level to dequantize one tensor at a time:

```java
import org.nd4j.ggml.format.GGUFReader;
import org.nd4j.ggml.format.TensorDescriptor;
import org.nd4j.linalg.api.ndarray.INDArray;

try (GGUFReader reader = new GGUFReader(new File("model.gguf"))) {
    reader.readHeader();
    reader.readMetadata();

    for (TensorDescriptor td : reader.readTensorDescriptors()) {
        // Dequantize one tensor at a time — only this tensor is in memory
        INDArray decoded = reader.readAndDequantize(td);
        // process decoded ...
    }
}
```

#### Typical reconstruction error by format

| Format | Mean absolute error | Max error |
| ------ | ------------------- | --------- |
| Q8\_0  | \~0.001             | \~0.01    |
| Q6\_K  | \~0.002             | \~0.02    |
| Q5\_K  | \~0.003             | \~0.03    |
| Q4\_K  | \~0.005             | \~0.05    |
| Q4\_0  | \~0.008             | \~0.08    |
| Q3\_K  | \~0.010             | \~0.10    |
| Q2\_K  | \~0.020             | \~0.20    |

These values match the llama.cpp reference implementation; they are well within acceptable range for inference and can be corrected during fine-tuning.

***

### GGUF Export Depth

The export path mirrors the import path with a dedicated set of classes:

* **`GGMLModelExport`** — public entry point; calls `ExportArchitectureRegistry` and `GGUFWriter`.
* **`SameDiffToGGMLConverter`** — coordinates the export: iterates SameDiff variables, dispatches to per-layer quantizers, writes tensor descriptors and data blocks.
* **`GGUFWriter`** — low-level binary writer; handles alignment padding (default 32 bytes) so output files are compatible with llama.cpp and other GGUF readers.
* **`ExportArchitectureRegistry`** — counterpart to `ArchitectureRegistry`; holds `ExportArchitecture` handlers (e.g., `LLaMAExportArchitecture`) that reverse-map SameDiff variable names back to the GGUF `blk.N.*` naming convention.

#### Round-trip workflow

```
GGUF file
    │ GGMLModelImport.importModel()
    ▼
SameDiff graph (float32 / float16 weights)
    │ [optional fine-tuning or modification]
    │
    │ GGMLModelExport.exportModel()
    │   └─ SameDiffToGGMLConverter
    │       ├─ ExportArchitectureRegistry.findHandler()
    │       ├─ per-variable quantization (Q4_K_M, Q6_K, …)
    │       └─ GGUFWriter.write()
    ▼
GGUF file (loadable by llama.cpp, Ollama, etc.)
```

The exported file includes re-generated GGUF metadata KV pairs derived from the SameDiff graph's variable shapes and the `ExportOptions` you supply.

***

### Adaptive Quantization: `AdaptiveLayerQuantizer`

When you export a SameDiff model to GGUF, uniform quantization across all layers is rarely optimal. `AdaptiveLayerQuantizer` implements a budget-aware assignment algorithm:

#### Budget-aware Q2\_K-to-F32 walk

Starting from the most aggressive compression (`Q2_K`), the quantizer walks up the precision ladder one step at a time — Q2\_K → Q3\_K\_S → Q3\_K\_M → Q4\_K\_S → Q4\_K\_M → Q5\_K\_S → Q5\_K\_M → Q6\_K → Q8\_0 → F16 → F32 — assigning higher precision to layers that have the highest sensitivity score until the cumulative model size would exceed the target budget.

Sensitivity is computed by `DynamicQuantizationAnalyzer`, which examines each weight tensor's value distribution (min, max, kurtosis, outlier ratio).

#### Protected layers

Two categories of layers are always assigned higher precision, regardless of budget:

* **Embedding matrices** (`token_embd.weight`, `output.weight`): these span the entire vocabulary and errors compound across all token positions.
* **First and last transformer blocks**: errors in the initial and final blocks have the largest impact on output quality.

All other layers compete in the budget walk.

#### Building a quantization plan without exporting

```java
import org.nd4j.ggml.quantization.AdaptiveLayerQuantizer;
import org.nd4j.ggml.quantization.QuantizationBudget;
import org.nd4j.ggml.quantization.GGMLQuantType;
import org.nd4j.autodiff.samediff.SameDiff;

import java.util.Map;

SameDiff model = /* your loaded or fine-tuned model */;

// Target 4 GB total size
QuantizationBudget budget = QuantizationBudget.ofBytes(4L * 1024 * 1024 * 1024);
AdaptiveLayerQuantizer quantizer = new AdaptiveLayerQuantizer(budget);

// Returns a map from variable name to assigned quant type
Map<String, GGMLQuantType> plan = quantizer.buildQuantizationPlan(model);

plan.forEach((name, type) ->
    System.out.printf("%-60s  %s%n", name, type));
```

#### Exporting with adaptive quantization

```java
import org.nd4j.ggml.export.GGMLModelExport;
import org.nd4j.ggml.export.ExportOptions;

ExportOptions opts = ExportOptions.builder()
        .targetSizeBytes(4L * 1024 * 1024 * 1024)   // 4 GB budget
        .defaultQuantType(GGMLQuantType.Q4_K_M)       // floor for most layers
        .attentionQuantType(GGMLQuantType.Q6_K)        // higher quality for attn projections
        .adaptiveQuantization(true)                    // enable budget-aware walk
        .build();

GGMLModelExport.exportModel(model, new File("output-adaptive.gguf"), opts);
```

***

### Unified Model Loading: AutoModel and the Pipeline SPI

The `samediff-pipeline-core` module provides a format-agnostic loading API modeled on HuggingFace's `from_pretrained` pattern. `AutoModel` is the single entry point; the actual loading is delegated to whichever `PipelineLoader` implementation handles the detected format.

#### AutoModel.fromPretrained

`AutoModel.fromPretrained` accepts a path to either a single model file or a directory containing a model with a manifest. It detects the format from the file extension and magic bytes and dispatches to the registered loader.

```java
import org.eclipse.deeplearning4j.pipeline.AutoModel;
import org.nd4j.autodiff.samediff.SameDiff;

// Single GGUF file
SameDiff llama = AutoModel.fromPretrained("/models/llama-3-8b.Q4_K_M.gguf");

// Single SafeTensors file
SameDiff bert = AutoModel.fromPretrained("/models/bert-base/model.safetensors");

// Directory with manifest (selects loader from manifest format field)
SameDiff model = AutoModel.fromPretrained("/models/my-model/");
```

#### Supported format values (ModelFormat enum)

| Enum constant | File extension(s)     | Description                  |
| ------------- | --------------------- | ---------------------------- |
| `GGUF`        | `.gguf`, `.ggml`      | GGML Universal Format        |
| `SAFETENSORS` | `.safetensors`        | HuggingFace SafeTensors      |
| `ONNX`        | `.onnx`               | Open Neural Network Exchange |
| `SDZ`         | `.sdz`                | DL4J native SameDiff ZIP     |
| `PYTORCH`     | `.pt`, `.pth`, `.bin` | PyTorch pickle/TorchScript   |

`ModelFormat.fromFilename(String)` and `ModelFormat.fromExtension(String)` resolve the enum from a file name or bare extension string.

#### Customizing load behavior

```java
import org.eclipse.deeplearning4j.pipeline.AutoModel;
import org.eclipse.deeplearning4j.pipeline.PipelineLoader;

PipelineLoader.LoadConfig config = PipelineLoader.LoadConfig.builder()
        .dataType("float16")          // dequantize to float16
        .device("cpu")                // target device hint
        .dequantize(true)             // dequantize quantized weights
        .cacheConvertedModel(true)    // cache converted .sdz alongside source
        .cacheDirectory(new File("/tmp/model-cache"))
        .build();

SameDiff model = AutoModel.fromPretrained("/models/llama-3.Q4_K_M.gguf", config);
```

When `cacheConvertedModel` is `true`, `AutoModel` saves the resulting SameDiff graph as a `.sdz` file next to the cache directory after the first conversion. Subsequent calls load from that cache, skipping dequantization entirely.

#### PipelineLoader SPI

Each format adapter implements `PipelineLoader` and registers itself via Java `ServiceLoader`. To add a custom format:

1. Implement `PipelineLoader`.
2. Add a `META-INF/services/org.eclipse.deeplearning4j.pipeline.PipelineLoader` file listing your class.
3. Put the jar on the classpath. `PipelineLoaderRegistry` discovers it automatically.

```java
import org.eclipse.deeplearning4j.pipeline.PipelineLoader;
import org.eclipse.deeplearning4j.pipeline.ModelFormat;
import org.eclipse.deeplearning4j.pipeline.ModelManifest;
import org.nd4j.autodiff.samediff.SameDiff;

import java.io.File;
import java.io.IOException;
import java.util.Map;

public class MyCustomLoader implements PipelineLoader {

    @Override
    public ModelFormat getFormat() {
        // Return an existing ModelFormat constant or extend if needed
        return ModelFormat.UNKNOWN;
    }

    @Override
    public boolean supports(ModelFormat format) {
        return format == ModelFormat.UNKNOWN; // match your format
    }

    @Override
    public SameDiff loadModel(ModelManifest manifest, LoadConfig config) throws IOException {
        // load from manifest.getPrimaryWeightFile()
        return loadModel(manifest.getPrimaryWeightFile(), config);
    }

    @Override
    public SameDiff loadModel(File file, LoadConfig config) throws IOException {
        SameDiff sd = SameDiff.create();
        // populate sd from file ...
        return sd;
    }

    @Override
    public Map<String, SameDiff> loadPipeline(ModelManifest manifest, LoadConfig config) throws IOException {
        return Map.of("model", loadModel(manifest, config));
    }
}
```

#### Loading a multi-component pipeline

For multimodal models that expose several sub-graphs (language model + vision encoder), use `pipelineFromPretrained`:

```java
import org.eclipse.deeplearning4j.pipeline.AutoModel;
import org.eclipse.deeplearning4j.pipeline.Pipeline;

Pipeline pipeline = AutoModel.pipelineFromPretrained("/models/qwen3-vl/");

SameDiff languageModel = pipeline.getComponents().get("language_model");
SameDiff visionEncoder = pipeline.getComponents().get("vision_encoder");
```

***

### SafeTensors Import

HuggingFace SafeTensors (`.safetensors`) files can be loaded through the unified `AutoModel` API or directly using `SafeTensorsReader`. Add the pipeline adapter to your Maven dependencies:

```xml
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>samediff-pipeline-safetensors</artifactId>
    <version>${dl4j.version}</version>
</dependency>
```

#### Via AutoModel (recommended)

```java
SameDiff model = AutoModel.fromPretrained("/models/bert-base-uncased/model.safetensors");
```

`AutoModel` detects the `.safetensors` extension, resolves `SafeTensorsPipelineLoader`, and returns a populated `SameDiff` graph.

#### Via SafeTensorsReader directly

`SafeTensorsReader` gives access to the raw tensor map without constructing a SameDiff graph, which is useful for inspection or custom assembly:

```java
import org.eclipse.deeplearning4j.safetensors.SafeTensorsReader;
import org.nd4j.linalg.api.ndarray.INDArray;

import java.io.File;
import java.util.Map;

File file = new File("/models/model.safetensors");

// Inspect: see all tensor names and shapes without loading data
try (SafeTensorsReader reader = SafeTensorsReader.open(file)) {
    System.out.println("Tensor count: " + reader.getTensorCount());
    reader.getTensorNames().forEach(name -> {
        var info = reader.getTensorInfo(name);
        System.out.printf("  %-50s  %s  %s%n",
                name, info.getSafeTensorsDtype(), java.util.Arrays.toString(info.getShape()));
    });
}

// Load all tensors into memory
Map<String, INDArray> tensors = SafeTensorsReader.loadFile(file);

// Load multiple shards (e.g., model-00001-of-00002.safetensors, …)
Map<String, INDArray> sharded = SafeTensorsReader.loadFiles(java.util.List.of(
        new File("/models/model-00001-of-00002.safetensors"),
        new File("/models/model-00002-of-00002.safetensors")
));
```

`SafeTensorsReader` reads the 8-byte little-endian header length, parses the JSON tensor index, then seeks directly to each tensor's data region using a `RandomAccessFile` + `FileChannel` — no intermediate copy of the full file into RAM.

#### Supported SafeTensors dtypes

| SafeTensors dtype | ND4J DataType |
| ----------------- | ------------- |
| F32               | FLOAT         |
| F16               | HALF          |
| BF16              | BFLOAT16      |
| F64               | DOUBLE        |
| I64               | LONG          |
| I32               | INT           |
| I16               | SHORT         |
| I8                | BYTE          |
| U8                | UBYTE         |
| BOOL              | BOOL          |

***

### TorchScript Import

PyTorch `.pt` ZIP archives and `.safetensors` weight files can be imported via the `nd4j-torchscript` module. Add it to your Maven dependencies:

```xml
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-torchscript</artifactId>
    <version>${dl4j.version}</version>
</dependency>
```

#### Importing a .pt file

TorchScript `.pt` files are ZIP archives containing a `data.pkl` (Python pickle) and numbered tensor data files. `PickleParser` reads the pickle stream to reconstruct the model graph, and `TorchScriptReader` maps each tensor to its data file.

```java
import org.nd4j.torchscript.TorchScriptModelImport;
import org.nd4j.autodiff.samediff.SameDiff;

// Auto-detects format from extension
SameDiff resnet = TorchScriptModelImport.importModel("resnet50.pt");

// With options
import org.nd4j.torchscript.convert.ConversionOptions;

ConversionOptions opts = ConversionOptions.builder()
        .targetDataType(org.nd4j.linalg.api.buffer.DataType.FLOAT)
        .forTraining(false)
        .build();

SameDiff efficientnet = TorchScriptModelImport.importModel("efficientnet_b0.pt", opts);
```

#### Inspecting a file before import

```java
import org.nd4j.torchscript.format.TorchScriptMetadata;

TorchScriptMetadata meta = TorchScriptModelImport.inspectModel("model.pt");

System.out.println("Format              : " + meta.getFormat());
System.out.println("Detected arch       : " + meta.getArchitecture());
System.out.println("Total parameters    : " + meta.getTotalParameters());
for (var tensor : meta.getTensors()) {
    System.out.printf("  %-40s  %s%n",
            tensor.getName(), java.util.Arrays.toString(tensor.getShape()));
}
```

#### Supported architectures (TorchScript)

Architecture detection in `TorchScriptToSameDiffConverter` is pattern-based — it looks for characteristic weight names in the tensor map:

| Architecture | Detection signals                                                                   |
| ------------ | ----------------------------------------------------------------------------------- |
| ResNet       | `layer1.0.conv1.weight`, `layer1.0.downsample.0`, `fc.weight`                       |
| VGG          | `features.0.weight`, `classifier.0.weight`                                          |
| EfficientNet | `_conv_stem.weight`, `_blocks.0._expand_conv.weight`, `_blocks.0._se_reduce.weight` |
| Generic CNN  | Fallback for unrecognized patterns                                                  |

#### Convert a .pt model to SDZ for faster repeated loading

```java
TorchScriptModelImport.convertToSDZ("resnet50.pt", "resnet50.sdz");

// Subsequent loads skip the conversion entirely
SameDiff model = SameDiff.load(new File("resnet50.sdz"), true);
```

#### PyTorch-to-ND4J weight layout transformations

PyTorch and ND4J use different memory layouts for convolution and linear weights. `TorchScriptToSameDiffConverter` applies these automatically:

| Layer type     | PyTorch shape       | ND4J shape          | Transformation        |
| -------------- | ------------------- | ------------------- | --------------------- |
| Conv2D weights | `[out, in, kH, kW]` | `[kH, kW, in, out]` | `permute(2, 3, 1, 0)` |
| Linear weights | `[out, in]`         | `[in, out]`         | `transpose()`         |

***

### Format Detection and Custom Architectures

#### Adding a custom architecture

Implement `ModelArchitecture` and register it either programmatically or via `ServiceLoader`:

```java
public class MyCustomArchitecture implements ModelArchitecture {

    @Override
    public boolean isCompatible(GGMLMetadata metadata) {
        return "my-arch".equals(metadata.getArchitecture());
    }

    @Override
    public SameDiff buildSameDiff(GGUFReader reader, ConversionOptions options) {
        SameDiff sd = SameDiff.create();
        // map tensors from reader to SameDiff variables
        for (TensorDescriptor td : reader.readTensorDescriptors()) {
            INDArray data = reader.readAndDequantize(td);
            sd.var(td.getName(), data);
        }
        return sd;
    }
}
```

Register programmatically (highest priority wins):

```java
ArchitectureRegistry.getInstance().register(new MyCustomArchitecture(), 100);
```

Or register via `ServiceLoader` by adding a file:

```
src/main/resources/META-INF/services/org.nd4j.ggml.architecture.ModelArchitecture
```

containing the fully qualified class name of your implementation. The registry picks it up automatically at startup.

***

### Troubleshooting

**`UnsupportedFormatException: Not a GGUF file`**: the file does not start with magic bytes `0x46554747`. Verify the file is a valid GGUF and was not corrupted during download. Use `GGMLFormatDetector.detect(file)` to check before importing.

**`UnknownArchitectureException`**: none of the registered architecture handlers returned `true` from `isCompatible()`. The model's `general.architecture` metadata key may be absent or set to an unrecognized value. `GenericArchitecture` should handle this as a fallback; if it does not, inspect the metadata with `GGMLModelImport.inspectModel()` and register a custom handler.

**Out of memory during dequantization**: dequantizing to float32 expands each tensor significantly. A 4-bit quantized 8B model is \~4 GB on disk but expands to \~16 GB in float32. Use `ConversionOptions.targetDtype(DataType.FLOAT16)` to halve the memory footprint, or process tensors one at a time using `GGUFReader` directly.

**Slow first load**: `importModel` dequantizes every tensor synchronously. For repeated use, call `convertToSDZ` once to produce a native `.sdz` file, which loads approximately 3-5x faster on subsequent runs.

**Split shard ordering**: `MultimodalGGUFLoader` expects shards in a specific order (language model first). Pass the files in the order listed in the model card. Incorrect ordering results in mismatched tensor namespaces.

**`ServiceLoader` not finding pipeline adapters**: ensure the `samediff-pipeline-ggml` jar is on the runtime classpath (not just compile scope). The SPI registration in `META-INF/services/` must be present in the deployed artifact.

***

### Further Reading

* [Model Import Overview](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/model-import/overview/README.md) — comparison of all import paths in DL4J
* [SameDiff TF/ONNX Import](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/model-import/samediff-import/overview/README.md) — SameDiff-native import for TF and ONNX
* [ONNX Runtime](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/model-import/onnx-runtime/overview/README.md) — direct ONNX inference without graph conversion
* [nd4j-ggml source code](https://github.com/eclipse/deeplearning4j/tree/master/nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/ggml)
* [GGUF format specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
* [DL4J examples repository](https://github.com/eclipse/deeplearning4j-examples)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/model-import/overview-3.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
