> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/model-import/overview-3.md).

# GGML/GGUF Import

### GGML/GGUF Model Import

Eclipse Deeplearning4j 1.0.0-rewrite introduces native support for loading GGUF and GGML model files directly into SameDiff. This enables the JVM ecosystem to consume the enormous library of community-quantized models distributed through Hugging Face and other repositories — LLaMA, Gemma, Mistral, Phi, Qwen, Whisper, and many others — without any Python tooling or intermediate conversion step.

The implementation is split across the `nd4j-ggml` Maven module (87 files) and three pipeline SPI modules (34 files) that provide a pluggable format layer on top of SameDiff.

***

### When to Use GGML Import

| Scenario                                            | Recommended approach                                         |
| --------------------------------------------------- | ------------------------------------------------------------ |
| Run a community quantized LLM (.gguf) on the JVM    | `GGMLModelImport.importModel(File)`                          |
| Inspect metadata and tensor layout before loading   | `GGMLModelImport.inspectModel(File)`                         |
| Convert to DL4J native format for repeated use      | `GGMLModelImport.convertToSDZ(src, dst)`                     |
| Export a SameDiff model back to GGUF                | `GGMLModelExport.exportModel(SameDiff, File, ExportOptions)` |
| Load split multimodal GGUF bundles (e.g., Qwen3-VL) | `MultimodalGGUFLoader`                                       |

***

### Maven Setup

The core import capability lives in `nd4j-ggml`. Add it alongside the ND4J backend for your platform.

```xml
<!-- GGML/GGUF model import -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-ggml</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- ND4J CPU backend (choose one) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native-platform</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- ND4J CUDA backend (alternative) -->
<!--
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-cuda-12.3-platform</artifactId>
    <version>${dl4j.version}</version>
</dependency>
-->
```

For pipeline integration (format-agnostic loading across GGUF, SafeTensors, and ONNX), add the relevant SPI modules:

```xml
<!-- Pipeline core (shared SPI interfaces) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>samediff-pipeline-core</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- GGUF pipeline adapter -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>samediff-pipeline-ggml</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- SafeTensors pipeline adapter (optional) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>samediff-pipeline-safetensors</artifactId>
    <version>${dl4j.version}</version>
</dependency>

<!-- ONNX pipeline adapter (optional) -->
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>samediff-pipeline-onnx</artifactId>
    <version>${dl4j.version}</version>
</dependency>
```

Replace `${dl4j.version}` with your project version, for example `1.0.0-rewrite`.

***

### Quick Start

#### Import a GGUF model into SameDiff

```java
import org.nd4j.ggml.GGMLModelImport;
import org.nd4j.autodiff.samediff.SameDiff;

import java.io.File;

File ggufFile = new File("llama-3-8b-instruct.Q4_K_M.gguf");

// Reads the GGUF file, dequantizes all tensors, and maps them to SameDiff variables
SameDiff model = GGMLModelImport.importModel(ggufFile);

System.out.println("Variables loaded: " + model.variables().size());
```

#### Inspect a model without loading all weights

```java
import org.nd4j.ggml.GGMLModelImport;
import org.nd4j.ggml.format.GGMLMetadata;

File ggufFile = new File("model.gguf");
GGMLMetadata metadata = GGMLModelImport.inspectModel(ggufFile);

System.out.println("Architecture : " + metadata.getArchitecture());
System.out.println("Tensor count : " + metadata.getTensorCount());
System.out.println("Context length: " + metadata.getContextLength());
System.out.println("Quant type   : " + metadata.getQuantizationType());
```

#### Convert to DL4J native format (SDZ)

```java
import org.nd4j.ggml.GGMLModelImport;

import java.io.File;

File src = new File("llama-3-8b-instruct.Q4_K_M.gguf");
File dst = new File("llama-3-8b-instruct.sdz");

// One-time conversion; subsequent loads from the .sdz are faster
GGMLModelImport.convertToSDZ(src, dst);
```

#### Run a forward pass

```java
import org.nd4j.autodiff.samediff.SameDiff;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.factory.Nd4j;
import org.nd4j.ggml.GGMLModelImport;

import java.io.File;
import java.util.Map;

SameDiff model = GGMLModelImport.importModel(new File("model.gguf"));

// Token IDs for a sample prompt (shape: [batch, seq_len])
INDArray inputIds = Nd4j.createFromArray(new long[][]{{1, 15043, 29892, 3186}});

Map<String, INDArray> outputs = model.output(
        Map.of("input_ids", inputIds),
        "logits"
);

INDArray logits = outputs.get("logits");
System.out.println("Logits shape: " + java.util.Arrays.toString(logits.shape()));
```

***

### GGUF Format Support

#### File Format Versions

`GGUFReader` supports all three released versions of the GGUF binary format:

| Version | Notes                              |
| ------- | ---------------------------------- |
| GGUF v1 | Original release format            |
| GGUF v2 | Adds alignment padding for tensors |
| GGUF v3 | Extended metadata KV type set      |

All versions share the same outer structure:

1. **Magic bytes** — `0x46554747` (`GGUF` in ASCII, little-endian)
2. **Version** — uint32
3. **Tensor count** — uint64
4. **Metadata KV count** — uint64
5. **Metadata KV pairs** — typed key-value entries (strings, scalars, arrays)
6. **Tensor descriptors** — name, shape, quantization type, offset
7. **Tensor data** — raw quantized bytes, padded to alignment boundary

`GGMLFormatDetector` reads the first four bytes of any file and selects either `GGUFReader` (magic `0x46554747`) or the legacy `GGMLReader` (older magic `0x67676d6c` / `0x67676d66`). You never need to choose the reader manually; `GGMLModelImport` calls the detector automatically.

#### Legacy GGML Format

For pre-GGUF models (GGML format v1–v3), `GGMLReader` and `GGMLWriter` provide compatible reading and writing. These files lack the structured metadata KV section; architecture detection falls back to heuristics based on tensor name patterns.

***

### Supported Architectures

Architecture detection is handled by `ArchitectureRegistry`, which uses `ServiceLoader` auto-discovery and a priority ordering. Each handler implements the `ModelArchitecture` interface:

```java
public interface ModelArchitecture {
    boolean isCompatible(GGMLMetadata metadata);
    SameDiff buildSameDiff(GGUFReader reader, ConversionOptions options);
}
```

The registry iterates handlers in priority order and delegates to the first compatible one. `GenericArchitecture` is always last and accepts any model as a fallback.

#### Architecture Handler Reference

| Architecture class     | Model families                         | Notes                                                 |
| ---------------------- | -------------------------------------- | ----------------------------------------------------- |
| `LLaMAArchitecture`    | LLaMA 1, LLaMA 2, LLaMA 3              | Standard dense transformer; RoPE positional encoding  |
| `Llama4Architecture`   | LLaMA 4                                | Interleaved mixture-of-experts (MoE) layers           |
| `GemmaArchitecture`    | Gemma 2, Gemma 3                       | Google's open models; grouped-query attention         |
| `MistralArchitecture`  | Mistral 7B, Mixtral                    | Sliding-window attention; optional MoE                |
| `PhiArchitecture`      | Phi-3, Phi-3.5                         | Microsoft; ROPE + QKV fused projection                |
| `GLMArchitecture`      | ChatGLM, CodeGLM                       | Zhipu AI; bidirectional prefix attention              |
| `GraniteArchitecture`  | IBM Granite                            | IBM Research code/language models                     |
| `LFM2Architecture`     | Liquid LFM-2                           | State-space model (SSM) hybrid                        |
| `NemotronArchitecture` | NVIDIA Nemotron                        | NVIDIA instruction-tuned models                       |
| `OLMoArchitecture`     | OLMo                                   | Allen AI; no bias in attention                        |
| `OpenELMArchitecture`  | OpenELM                                | Apple; layer-wise head count variation                |
| `Qwen3VLArchitecture`  | Qwen3-VL                               | Alibaba multimodal; loads split GGUF shards           |
| `SmolVLM2Architecture` | SmolVLM2                               | HuggingFace compact vision-language model             |
| `MiniCPMVArchitecture` | MiniCPM-V                              | ModelBest multimodal                                  |
| `WhisperArchitecture`  | Whisper (tiny/base/small/medium/large) | OpenAI ASR; encoder-decoder                           |
| `GenericArchitecture`  | Any                                    | Fallback; maps tensors by name without graph rewiring |

`LayerTensorDiscovery` resolves GGUF tensor naming conventions (e.g., `blk.0.attn_q.weight`, `blk.0.ffn_gate.weight`) to the canonical SameDiff variable names used by each architecture.

***

### Quantization Formats

#### Quantization Type Reference

`GGMLQuantType` enumerates every supported dtype with its bits-per-weight value. `GGMLDataType` provides the corresponding GGML integer dtype codes used in the binary format.

**Standard quantization types**

| Type     | Bits/weight | Block size | Notes                          | When to use                                  |
| -------- | ----------- | ---------- | ------------------------------ | -------------------------------------------- |
| `F32`    | 32.0        | —          | Full precision float32         | Accuracy-critical research; very large GPU   |
| `F16`    | 16.0        | —          | Half precision float16         | Good balance; standard GPU inference         |
| `Q8_0`   | 8.5         | 32         | 8-bit, zero-point offset       | Near-lossless; reference quality             |
| `Q8_K`   | 8.5         | 256        | 8-bit, K-quant block           | Used as intermediate for K-quant dequant     |
| `Q6_K`   | 6.5625      | 256        | 6-bit K-quant                  | Excellent quality; fits larger models in RAM |
| `Q5_K_M` | 5.6875      | 256        | 5-bit K-quant, mixed precision | Recommended for quality-size balance         |
| `Q5_K_S` | 5.5         | 256        | 5-bit K-quant, small           | Slightly smaller than Q5\_K\_M               |
| `Q5_1`   | 5.5         | 32         | 5-bit with non-zero min        | Legacy; prefer Q5\_K\_M                      |
| `Q5_0`   | 5.5         | 32         | 5-bit, zero min                | Legacy; prefer Q5\_K\_S                      |
| `Q4_K_M` | 4.8         | 256        | 4-bit K-quant, mixed precision | Most popular community choice                |
| `Q4_K_S` | 4.375       | 256        | 4-bit K-quant, small           | Compact; slightly lower quality              |
| `Q4_1`   | 4.5         | 32         | 4-bit with non-zero min        | Legacy; prefer Q4\_K\_M                      |
| `Q4_0`   | 4.0         | 32         | 4-bit, zero min                | Smallest standard quant; legacy use          |
| `Q3_K_L` | 3.4375      | 256        | 3-bit K-quant, large           | Very compressed; some quality loss           |
| `Q3_K_M` | 3.28125     | 256        | 3-bit K-quant, medium          | Aggressive compression                       |
| `Q3_K_S` | 3.0         | 256        | 3-bit K-quant, small           | Extreme compression                          |
| `Q2_K`   | 2.5625      | 256        | 2-bit K-quant                  | Maximum compression; significant degradation |

**I-quant types (importance-matrix quantization)**

I-quants use a calibration dataset to assign higher precision to the weights that matter most. They require an importance matrix (imatrix) generated during quantization and generally outperform equivalent-BPW standard quants.

| Type      | Bits/weight | Notes                              |
| --------- | ----------- | ---------------------------------- |
| `IQ4_XS`  | 4.25        | 4-bit imatrix; best 4-bit quality  |
| `IQ4_NL`  | 4.0         | 4-bit non-linear imatrix           |
| `IQ3_XXS` | 3.0625      | 3-bit ultra-small imatrix          |
| `IQ3_S`   | 3.4375      | 3-bit imatrix, standard            |
| `IQ2_XXS` | 2.0625      | 2-bit ultra-small imatrix          |
| `IQ2_XS`  | 2.3125      | 2-bit imatrix, extra small         |
| `IQ2_S`   | 2.5         | 2-bit imatrix, standard            |
| `IQ1_S`   | 1.5625      | 1-bit imatrix; extreme compression |
| `IQ1_M`   | 1.75        | 1-bit imatrix, mixed               |

**Ternary types**

| Type    | Bits/weight | Notes                                         |
| ------- | ----------- | --------------------------------------------- |
| `TQ1_0` | \~1.69      | Ternary quant v1                              |
| `TQ2_0` | \~2.06      | Ternary quant v2; better accuracy than TQ1\_0 |

#### Dequantization at Import Time

During `GGMLModelImport.importModel()`, each tensor is dequantized to `float32` (or `float16` depending on `ConversionOptions`) before being stored as a SameDiff variable. A dedicated dequantizer class handles each format:

* Standard: `Q4_0Dequantizer`, `Q4_1Dequantizer`, `Q5_0Dequantizer`, `Q5_1Dequantizer`, `Q8_0Dequantizer`, `Q8_KDequantizer`, `Q4_KDequantizer`, `Q5_KDequantizer`, `Q6_KDequantizer`, `Q2_KDequantizer`, `Q3_KDequantizer`
* I-quant: `IQ1_MDequantizer`, `IQ1_SDequantizer`, `IQ2_SDequantizer`, `IQ2_XSDequantizer`, `IQ2_XXSDequantizer`, `IQ3_SDequantizer`, `IQ3_XXSDequantizer`, `IQ4_NLDequantizer`, `IQ4_XSDequantizer`
* Ternary: `TQ1_0Dequantizer`, `TQ2_0Dequantizer`

`GGMLToSameDiffConverter` coordinates this process: it iterates the tensor descriptors from `GGUFReader`, dispatches to the appropriate dequantizer, and creates the resulting `SDVariable` in the target `SameDiff` graph.

***

### Adaptive Quantization

When exporting a SameDiff model back to GGUF, you rarely want uniform quantization across all layers. The adaptive quantization subsystem assigns per-layer quantization types to meet a target model size budget while preserving quality in the most sensitive weight matrices.

#### How it works

`AdaptiveLayerQuantizer` uses a two-pass algorithm:

1. **Analysis pass** — `DynamicQuantizationAnalyzer` inspects each weight tensor's value distribution (min, max, kurtosis, outlier ratio) and computes a recommended precision level.
2. **Budget pass** — `AdaptiveLayerQuantizer` maps the recommendations to concrete `GGMLQuantType` values, then adjusts to meet the total size budget. Attention projection matrices (`attn_q`, `attn_k`, `attn_v`, `attn_output`) are assigned a higher-quality quantization type than feed-forward matrices, since attention weights are more sensitive to precision loss.

#### Quantizer classes (for export)

The following quantizer classes are available for re-quantizing `float32` tensors during export:

| Class           | Output type |
| --------------- | ----------- |
| `Q4_0Quantizer` | Q4\_0       |
| `Q4_1Quantizer` | Q4\_1       |
| `Q4_KQuantizer` | Q4\_K       |
| `Q5_0Quantizer` | Q5\_0       |
| `Q5_1Quantizer` | Q5\_1       |
| `Q5_KQuantizer` | Q5\_K       |
| `Q6_KQuantizer` | Q6\_K       |
| `Q8_0Quantizer` | Q8\_0       |

#### Configuring adaptive quantization

```java
import org.nd4j.ggml.export.GGMLModelExport;
import org.nd4j.ggml.export.ExportOptions;
import org.nd4j.ggml.quantization.GGMLQuantType;

ExportOptions options = ExportOptions.builder()
        // Target model size in bytes (e.g., 4 GB)
        .targetSizeBytes(4L * 1024 * 1024 * 1024)
        // Default quant type for non-attention layers
        .defaultQuantType(GGMLQuantType.Q4_K_M)
        // Higher quality for attention projections
        .attentionQuantType(GGMLQuantType.Q6_K)
        // Enable per-layer dynamic analysis
        .adaptiveQuantization(true)
        .build();

GGMLModelExport.exportModel(sameDiff, new File("output.gguf"), options);
```

***

### Round-Trip Export

`GGMLModelExport` writes a SameDiff graph back to GGUF format, enabling workflows that:

* Fine-tune a model inside SameDiff, then re-export for use with llama.cpp or other GGUF-native runtimes.
* Quantize a float32 SameDiff model to GGUF for distribution.
* Convert between quantization levels (e.g., Q8\_0 -> Q4\_K\_M) without leaving the JVM.

Export counterpart architecture classes (e.g., `LLaMAExportArchitecture`) handle the reverse tensor name mapping from SameDiff variable names back to the GGUF `blk.N.*` naming convention. `ExportArchitectureRegistry` mirrors `ArchitectureRegistry` and uses the same `ServiceLoader` discovery mechanism.

#### Basic export

```java
import org.nd4j.ggml.export.GGMLModelExport;
import org.nd4j.ggml.export.ExportOptions;
import org.nd4j.ggml.quantization.GGMLQuantType;
import org.nd4j.autodiff.samediff.SameDiff;

import java.io.File;

SameDiff model = /* load or fine-tune your model */;

ExportOptions options = ExportOptions.builder()
        .defaultQuantType(GGMLQuantType.Q4_K_M)
        .build();

GGMLModelExport.exportModel(model, new File("finetuned-llama.gguf"), options);
```

`GGUFWriter` handles alignment padding so that the output file is compatible with any standard GGUF reader.

***

### Pipeline SPI Modules

The pipeline SPI provides a format-agnostic loading interface. When multiple format adapters are on the classpath, code written against `samediff-pipeline-core` works with any supported format without change.

#### Module overview

| Maven artifact                  | Purpose                                                               |
| ------------------------------- | --------------------------------------------------------------------- |
| `samediff-pipeline-core`        | Shared SPI interfaces (`ModelPipelineLoader`, `PipelineFormat`, etc.) |
| `samediff-pipeline-ggml`        | Adapts `nd4j-ggml` behind the SPI; auto-registers via `ServiceLoader` |
| `samediff-pipeline-safetensors` | Loads Hugging Face SafeTensors (`.safetensors`) files                 |
| `samediff-pipeline-onnx`        | Loads ONNX models through the SameDiff ONNX importer                  |

#### Using the pipeline API

```java
import org.nd4j.pipeline.ModelPipelineLoader;
import org.nd4j.pipeline.PipelineFormat;
import org.nd4j.autodiff.samediff.SameDiff;

import java.io.File;

// The loader auto-detects the format from the file magic bytes / extension.
// If both samediff-pipeline-ggml and samediff-pipeline-safetensors are on
// the classpath, this single call works for .gguf or .safetensors files.
SameDiff model = ModelPipelineLoader.load(new File("model.gguf"));
```

Explicit format selection:

```java
SameDiff model = ModelPipelineLoader.load(new File("model.gguf"), PipelineFormat.GGUF);
SameDiff model2 = ModelPipelineLoader.load(new File("model.safetensors"), PipelineFormat.SAFETENSORS);
SameDiff model3 = ModelPipelineLoader.load(new File("model.onnx"), PipelineFormat.ONNX);
```

***

### Multimodal Model Support

Several vision-language and audio models distribute as multiple GGUF shards — a language model shard and one or more projection/vision encoder shards. `MultimodalGGUFLoader` assembles these into a unified `SameDiff` graph.

#### Supported multimodal families

| Model family | Architecture handler   | Shards                    |
| ------------ | ---------------------- | ------------------------- |
| Qwen3-VL     | `Qwen3VLArchitecture`  | Language + vision encoder |
| SmolVLM2     | `SmolVLM2Architecture` | Language + vision encoder |
| MiniCPM-V    | `MiniCPMVArchitecture` | Language + vision encoder |

#### Loading a multimodal GGUF

```java
import org.nd4j.ggml.multimodal.MultimodalGGUFLoader;
import org.nd4j.autodiff.samediff.SameDiff;

import java.io.File;
import java.util.List;

// Provide the language model shard first, then vision encoder shard(s)
List<File> shards = List.of(
        new File("qwen3-vl-7b-language.gguf"),
        new File("qwen3-vl-7b-vision.gguf")
);

SameDiff model = MultimodalGGUFLoader.load(shards);
```

`MultimodalGGUFLoader` reads the metadata from each shard to identify its role, delegates to the appropriate `Qwen3VLArchitecture` (or equivalent), and merges the resulting SameDiff sub-graphs with shared variable namespaces.

***

### API Reference

#### `GGMLModelImport`

Primary entry point for all import operations.

| Method         | Signature                                                           | Description                                                                        |
| -------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
| `importModel`  | `static SameDiff importModel(File file)`                            | Reads GGUF or GGML file, dequantizes all tensors, returns populated SameDiff graph |
| `importModel`  | `static SameDiff importModel(File file, ConversionOptions options)` | Import with custom conversion options (target dtype, layer filter, etc.)           |
| `inspectModel` | `static GGMLMetadata inspectModel(File file)`                       | Reads header and metadata only; does not load tensor data                          |
| `convertToSDZ` | `static void convertToSDZ(File src, File dst)`                      | Converts GGUF to DL4J native SDZ format for fast subsequent loads                  |

#### `GGMLModelExport`

Round-trip export from SameDiff back to GGUF.

| Method        | Signature                                                               | Description                                                       |
| ------------- | ----------------------------------------------------------------------- | ----------------------------------------------------------------- |
| `exportModel` | `static void exportModel(SameDiff sd, File dst, ExportOptions options)` | Writes SameDiff graph to GGUF with the given quantization options |

#### `GGUFReader`

Low-level reader for the GGUF binary format.

| Method                             | Description                                                               |
| ---------------------------------- | ------------------------------------------------------------------------- |
| `readHeader()`                     | Parses magic, version, tensor count, KV count                             |
| `readMetadata()`                   | Returns the full metadata KV map                                          |
| `readTensorDescriptors()`          | Returns list of `TensorDescriptor` (name, shape, quant type, data offset) |
| `readTensorData(TensorDescriptor)` | Returns raw quantized bytes for a single tensor                           |

#### `GGMLMetadata`

Structured view of GGUF metadata KV entries.

| Method                  | Description                                                             |
| ----------------------- | ----------------------------------------------------------------------- |
| `getArchitecture()`     | Returns the `general.architecture` string (e.g., `"llama"`, `"gemma2"`) |
| `getTensorCount()`      | Total number of tensors in the file                                     |
| `getContextLength()`    | `llm.context_length` KV value                                           |
| `getQuantizationType()` | Most common quant type across all tensors                               |
| `get(String key)`       | Returns raw KV value by key                                             |

#### `ArchitectureRegistry`

| Method                                      | Description                                                    |
| ------------------------------------------- | -------------------------------------------------------------- |
| `findCompatible(GGMLMetadata)`              | Returns first compatible `ModelArchitecture` in priority order |
| `register(ModelArchitecture, int priority)` | Registers a custom architecture handler                        |
| `listAll()`                                 | Returns all registered handlers                                |

#### `GGMLQuantType`

Enum of quantization types with bits-per-weight metadata.

```java
GGMLQuantType qt = GGMLQuantType.Q4_K_M;
double bpw = qt.getBitsPerWeight();  // 4.8
int blockSize = qt.getBlockSize();   // 256
String name = qt.name();             // "Q4_K_M"
```

#### `AdaptiveLayerQuantizer`

```java
import org.nd4j.ggml.quantization.AdaptiveLayerQuantizer;
import org.nd4j.ggml.quantization.QuantizationBudget;

QuantizationBudget budget = QuantizationBudget.ofBytes(4L * 1024 * 1024 * 1024);
AdaptiveLayerQuantizer quantizer = new AdaptiveLayerQuantizer(budget);

// Returns a map of SameDiff variable name -> assigned GGMLQuantType
Map<String, GGMLQuantType> plan = quantizer.buildQuantizationPlan(sameDiff);
```

#### `DynamicQuantizationAnalyzer`

```java
import org.nd4j.ggml.quantization.DynamicQuantizationAnalyzer;
import org.nd4j.ggml.quantization.QuantizationRecommendation;
import org.nd4j.linalg.api.ndarray.INDArray;

DynamicQuantizationAnalyzer analyzer = new DynamicQuantizationAnalyzer();
INDArray weights = /* your weight tensor */;

QuantizationRecommendation rec = analyzer.analyze(weights);
System.out.println("Recommended type: " + rec.getRecommendedType());
System.out.println("Outlier ratio:    " + rec.getOutlierRatio());
```

#### `MultimodalGGUFLoader`

```java
SameDiff model = MultimodalGGUFLoader.load(List<File> shards);
SameDiff model = MultimodalGGUFLoader.load(List<File> shards, ConversionOptions options);
```

***

### Architecture Auto-Detection (ADR 0054)

When `GGMLModelImport.importModel()` is called, the first thing the importer does is read the GGUF metadata header and pass it to `ArchitectureRegistry.detectArchitecture()`. The registry resolves the correct handler without any user input.

#### How detection works

Every GGUF file written by a conforming tool sets the `general.architecture` key in its metadata. The value is a short ASCII string such as `"llama"`, `"mistral"`, `"bert"`, or `"gpt2"`. The detection sequence is:

1. Read `general.architecture` from the GGUF KV metadata.
2. Look up the string directly in the registry's name/variant map.
3. If found, call `canHandle(metadata)` on the candidate — some handlers check secondary fields (e.g., a `general.architecture` of `"llama"` with a `num_experts` field triggers the MoE handler for Llama 4).
4. If no direct match, iterate all registered handlers and call `canHandle()` on each.
5. If still unresolved, fall back to `GenericArchitecture`, which maps every tensor by its raw GGUF name.

The handler that wins provides two things: an `ArchitectureConfig` (derived from metadata fields like `llama.embedding_length`, `llama.block_count`, `llama.attention.head_count_kv`) and a `buildGraph()` implementation that constructs the SameDiff computational graph.

#### Inspecting the detected architecture before loading

```java
import org.nd4j.ggml.GGMLModelImport;
import org.nd4j.ggml.format.GGMLMetadata;

GGMLMetadata meta = GGMLModelImport.inspectModel(new File("model.gguf"));

// "general.architecture" from the KV metadata
System.out.println("Architecture : " + meta.getArchitecture());      // e.g. "llama"
System.out.println("Model name   : " + meta.get("general.name"));    // e.g. "Meta-Llama-3.1-8B"

// Architecture-specific hyperparameters
System.out.println("Hidden size  : " + meta.get("llama.embedding_length"));  // 4096
System.out.println("Layer count  : " + meta.get("llama.block_count"));       // 32
System.out.println("KV heads     : " + meta.get("llama.attention.head_count_kv")); // 8
```

#### Architecture-specific features extracted from metadata

| Metadata key                              | Description                     | Used by                  |
| ----------------------------------------- | ------------------------------- | ------------------------ |
| `general.architecture`                    | Primary architecture identifier | All handlers             |
| `{arch}.embedding_length`                 | Hidden dimension size           | LLaMA, Mistral, Gemma, … |
| `{arch}.block_count`                      | Number of transformer layers    | All transformer handlers |
| `{arch}.attention.head_count`             | Number of query heads           | All transformer handlers |
| `{arch}.attention.head_count_kv`          | Number of KV heads (GQA)        | LLaMA 3, Mistral, Gemma  |
| `{arch}.context_length`                   | Maximum sequence length         | All handlers             |
| `{arch}.rope.freq_base`                   | RoPE frequency base             | LLaMA, Mistral, Phi      |
| `{arch}.attention.layer_norm_rms_epsilon` | RMSNorm epsilon                 | LLaMA, Mistral           |

#### GGUF tensor naming to SameDiff variable mapping

Each architecture handler declares a `getTensorNamePatterns()` map that `LayerTensorDiscovery` uses to translate GGUF block-indexed names (e.g., `blk.0.attn_q.weight`) to canonical SameDiff names (e.g., `model.layers.0.self_attn.q_proj.weight`). The LLaMA handler's mapping is representative:

| GGUF tensor name         | SameDiff variable name                   |
| ------------------------ | ---------------------------------------- |
| `token_embd.weight`      | `model.embed_tokens.weight`              |
| `blk.0.attn_q.weight`    | `model.layers.0.self_attn.q_proj.weight` |
| `blk.15.ffn_gate.weight` | `model.layers.15.mlp.gate_proj.weight`   |
| `output_norm.weight`     | `model.norm.weight`                      |
| `output.weight`          | `lm_head.weight`                         |

***

### Quantization Handling During Import (ADR 0053)

#### The dequantization decision

ND4J does not support native quantized tensor operations, so the importer dequantizes each tensor to floating-point before creating the corresponding `SDVariable`. The target precision is controlled by `ConversionOptions`:

```java
import org.nd4j.ggml.GGMLModelImport;
import org.nd4j.ggml.format.ConversionOptions;
import org.nd4j.linalg.api.buffer.DataType;

// Default: dequantize to float32
SameDiff model = GGMLModelImport.importModel(new File("model.gguf"));

// Dequantize to float16 to halve memory usage
ConversionOptions opts = ConversionOptions.builder()
        .targetDtype(DataType.HALF)
        .build();
SameDiff model16 = GGMLModelImport.importModel(new File("model.gguf"), opts);
```

The available `QuantizationMode` values (set via `ConversionOptions`) are:

| Mode                     | Output dtype | When to use                           |
| ------------------------ | ------------ | ------------------------------------- |
| `DEQUANTIZE_TO_FLOAT32`  | FLOAT32      | Maximum accuracy; fine-tuning         |
| `DEQUANTIZE_TO_FLOAT16`  | FLOAT16      | Halved memory; good for inference     |
| `DEQUANTIZE_TO_BFLOAT16` | BFLOAT16     | Hardware-specific (e.g., Ampere GPUs) |

#### How dequantization works inside the importer

`GGMLToSameDiffConverter` iterates the tensor descriptors from `GGUFReader` and for each one:

1. Reads the raw quantized bytes with `GGUFReader.readTensorData(TensorDescriptor)`.
2. Looks up the matching `Dequantizer` in `DequantizerFactory` by `GGMLDataType`.
3. Calls `dequantizer.dequantizeToArray(bytes, shape, targetDtype)`, which decodes the block structure and reconstructs floating-point values.
4. Wraps the result in an `INDArray` and stores it as an `SDVariable` constant.

#### Block-level dequantization mechanics

Each quantization format packs values into fixed-size blocks with a shared scale (and sometimes a minimum). Two representative examples:

**Q4\_0** (legacy, 18 bytes per 32 values):

* 2 bytes: FP16 scale
* 16 bytes: 32 four-bit unsigned values (centered at 8, so value = `(nibble - 8) * scale`)

**Q4\_K** (K-quant, 144 bytes per 256 values):

* 4 bytes: super-block FP16 scale and min
* 12 bytes: 8 sub-block 6-bit scales
* 12 bytes: 8 sub-block 6-bit minimums
* 128 bytes: 256 four-bit values (value = `nibble * sub_scale + sub_min`)

#### Lazy vs eager dequantization

By default, `importModel` dequantizes all tensors **eagerly** — every tensor is decoded and loaded into memory before the method returns. For very large models, you can work at the reader level to dequantize one tensor at a time:

```java
import org.nd4j.ggml.format.GGUFReader;
import org.nd4j.ggml.format.TensorDescriptor;
import org.nd4j.linalg.api.ndarray.INDArray;

try (GGUFReader reader = new GGUFReader(new File("model.gguf"))) {
    reader.readHeader();
    reader.readMetadata();

    for (TensorDescriptor td : reader.readTensorDescriptors()) {
        // Dequantize one tensor at a time — only this tensor is in memory
        INDArray decoded = reader.readAndDequantize(td);
        // process decoded ...
    }
}
```

#### Typical reconstruction error by format

| Format | Mean absolute error | Max error |
| ------ | ------------------- | --------- |
| Q8\_0  | \~0.001             | \~0.01    |
| Q6\_K  | \~0.002             | \~0.02    |
| Q5\_K  | \~0.003             | \~0.03    |
| Q4\_K  | \~0.005             | \~0.05    |
| Q4\_0  | \~0.008             | \~0.08    |
| Q3\_K  | \~0.010             | \~0.10    |
| Q2\_K  | \~0.020             | \~0.20    |

These values match the llama.cpp reference implementation; they are well within acceptable range for inference and can be corrected during fine-tuning.

***

### GGUF Export Depth

The export path mirrors the import path with a dedicated set of classes:

* **`GGMLModelExport`** — public entry point; calls `ExportArchitectureRegistry` and `GGUFWriter`.
* **`SameDiffToGGMLConverter`** — coordinates the export: iterates SameDiff variables, dispatches to per-layer quantizers, writes tensor descriptors and data blocks.
* **`GGUFWriter`** — low-level binary writer; handles alignment padding (default 32 bytes) so output files are compatible with llama.cpp and other GGUF readers.
* **`ExportArchitectureRegistry`** — counterpart to `ArchitectureRegistry`; holds `ExportArchitecture` handlers (e.g., `LLaMAExportArchitecture`) that reverse-map SameDiff variable names back to the GGUF `blk.N.*` naming convention.

#### Round-trip workflow

```
GGUF file
    │ GGMLModelImport.importModel()
    ▼
SameDiff graph (float32 / float16 weights)
    │ [optional fine-tuning or modification]
    │
    │ GGMLModelExport.exportModel()
    │   └─ SameDiffToGGMLConverter
    │       ├─ ExportArchitectureRegistry.findHandler()
    │       ├─ per-variable quantization (Q4_K_M, Q6_K, …)
    │       └─ GGUFWriter.write()
    ▼
GGUF file (loadable by llama.cpp, Ollama, etc.)
```

The exported file includes re-generated GGUF metadata KV pairs derived from the SameDiff graph's variable shapes and the `ExportOptions` you supply.

***

### Adaptive Quantization: `AdaptiveLayerQuantizer`

When you export a SameDiff model to GGUF, uniform quantization across all layers is rarely optimal. `AdaptiveLayerQuantizer` implements a budget-aware assignment algorithm:

#### Budget-aware Q2\_K-to-F32 walk

Starting from the most aggressive compression (`Q2_K`), the quantizer walks up the precision ladder one step at a time — Q2\_K → Q3\_K\_S → Q3\_K\_M → Q4\_K\_S → Q4\_K\_M → Q5\_K\_S → Q5\_K\_M → Q6\_K → Q8\_0 → F16 → F32 — assigning higher precision to layers that have the highest sensitivity score until the cumulative model size would exceed the target budget.

Sensitivity is computed by `DynamicQuantizationAnalyzer`, which examines each weight tensor's value distribution (min, max, kurtosis, outlier ratio).

#### Protected layers

Two categories of layers are always assigned higher precision, regardless of budget:

* **Embedding matrices** (`token_embd.weight`, `output.weight`): these span the entire vocabulary and errors compound across all token positions.
* **First and last transformer blocks**: errors in the initial and final blocks have the largest impact on output quality.

All other layers compete in the budget walk.

#### Building a quantization plan without exporting

```java
import org.nd4j.ggml.quantization.AdaptiveLayerQuantizer;
import org.nd4j.ggml.quantization.QuantizationBudget;
import org.nd4j.ggml.quantization.GGMLQuantType;
import org.nd4j.autodiff.samediff.SameDiff;

import java.util.Map;

SameDiff model = /* your loaded or fine-tuned model */;

// Target 4 GB total size
QuantizationBudget budget = QuantizationBudget.ofBytes(4L * 1024 * 1024 * 1024);
AdaptiveLayerQuantizer quantizer = new AdaptiveLayerQuantizer(budget);

// Returns a map from variable name to assigned quant type
Map<String, GGMLQuantType> plan = quantizer.buildQuantizationPlan(model);

plan.forEach((name, type) ->
    System.out.printf("%-60s  %s%n", name, type));
```

#### Exporting with adaptive quantization

```java
import org.nd4j.ggml.export.GGMLModelExport;
import org.nd4j.ggml.export.ExportOptions;

ExportOptions opts = ExportOptions.builder()
        .targetSizeBytes(4L * 1024 * 1024 * 1024)   // 4 GB budget
        .defaultQuantType(GGMLQuantType.Q4_K_M)       // floor for most layers
        .attentionQuantType(GGMLQuantType.Q6_K)        // higher quality for attn projections
        .adaptiveQuantization(true)                    // enable budget-aware walk
        .build();

GGMLModelExport.exportModel(model, new File("output-adaptive.gguf"), opts);
```

***

### Unified Model Loading: AutoModel and the Pipeline SPI

The `samediff-pipeline-core` module provides a format-agnostic loading API modeled on HuggingFace's `from_pretrained` pattern. `AutoModel` is the single entry point; the actual loading is delegated to whichever `PipelineLoader` implementation handles the detected format.

#### AutoModel.fromPretrained

`AutoModel.fromPretrained` accepts a path to either a single model file or a directory containing a model with a manifest. It detects the format from the file extension and magic bytes and dispatches to the registered loader.

```java
import org.eclipse.deeplearning4j.pipeline.AutoModel;
import org.nd4j.autodiff.samediff.SameDiff;

// Single GGUF file
SameDiff llama = AutoModel.fromPretrained("/models/llama-3-8b.Q4_K_M.gguf");

// Single SafeTensors file
SameDiff bert = AutoModel.fromPretrained("/models/bert-base/model.safetensors");

// Directory with manifest (selects loader from manifest format field)
SameDiff model = AutoModel.fromPretrained("/models/my-model/");
```

#### Supported format values (ModelFormat enum)

| Enum constant | File extension(s)     | Description                  |
| ------------- | --------------------- | ---------------------------- |
| `GGUF`        | `.gguf`, `.ggml`      | GGML Universal Format        |
| `SAFETENSORS` | `.safetensors`        | HuggingFace SafeTensors      |
| `ONNX`        | `.onnx`               | Open Neural Network Exchange |
| `SDZ`         | `.sdz`                | DL4J native SameDiff ZIP     |
| `PYTORCH`     | `.pt`, `.pth`, `.bin` | PyTorch pickle/TorchScript   |

`ModelFormat.fromFilename(String)` and `ModelFormat.fromExtension(String)` resolve the enum from a file name or bare extension string.

#### Customizing load behavior

```java
import org.eclipse.deeplearning4j.pipeline.AutoModel;
import org.eclipse.deeplearning4j.pipeline.PipelineLoader;

PipelineLoader.LoadConfig config = PipelineLoader.LoadConfig.builder()
        .dataType("float16")          // dequantize to float16
        .device("cpu")                // target device hint
        .dequantize(true)             // dequantize quantized weights
        .cacheConvertedModel(true)    // cache converted .sdz alongside source
        .cacheDirectory(new File("/tmp/model-cache"))
        .build();

SameDiff model = AutoModel.fromPretrained("/models/llama-3.Q4_K_M.gguf", config);
```

When `cacheConvertedModel` is `true`, `AutoModel` saves the resulting SameDiff graph as a `.sdz` file next to the cache directory after the first conversion. Subsequent calls load from that cache, skipping dequantization entirely.

#### PipelineLoader SPI

Each format adapter implements `PipelineLoader` and registers itself via Java `ServiceLoader`. To add a custom format:

1. Implement `PipelineLoader`.
2. Add a `META-INF/services/org.eclipse.deeplearning4j.pipeline.PipelineLoader` file listing your class.
3. Put the jar on the classpath. `PipelineLoaderRegistry` discovers it automatically.

```java
import org.eclipse.deeplearning4j.pipeline.PipelineLoader;
import org.eclipse.deeplearning4j.pipeline.ModelFormat;
import org.eclipse.deeplearning4j.pipeline.ModelManifest;
import org.nd4j.autodiff.samediff.SameDiff;

import java.io.File;
import java.io.IOException;
import java.util.Map;

public class MyCustomLoader implements PipelineLoader {

    @Override
    public ModelFormat getFormat() {
        // Return an existing ModelFormat constant or extend if needed
        return ModelFormat.UNKNOWN;
    }

    @Override
    public boolean supports(ModelFormat format) {
        return format == ModelFormat.UNKNOWN; // match your format
    }

    @Override
    public SameDiff loadModel(ModelManifest manifest, LoadConfig config) throws IOException {
        // load from manifest.getPrimaryWeightFile()
        return loadModel(manifest.getPrimaryWeightFile(), config);
    }

    @Override
    public SameDiff loadModel(File file, LoadConfig config) throws IOException {
        SameDiff sd = SameDiff.create();
        // populate sd from file ...
        return sd;
    }

    @Override
    public Map<String, SameDiff> loadPipeline(ModelManifest manifest, LoadConfig config) throws IOException {
        return Map.of("model", loadModel(manifest, config));
    }
}
```

#### Loading a multi-component pipeline

For multimodal models that expose several sub-graphs (language model + vision encoder), use `pipelineFromPretrained`:

```java
import org.eclipse.deeplearning4j.pipeline.AutoModel;
import org.eclipse.deeplearning4j.pipeline.Pipeline;

Pipeline pipeline = AutoModel.pipelineFromPretrained("/models/qwen3-vl/");

SameDiff languageModel = pipeline.getComponents().get("language_model");
SameDiff visionEncoder = pipeline.getComponents().get("vision_encoder");
```

***

### SafeTensors Import

HuggingFace SafeTensors (`.safetensors`) files can be loaded through the unified `AutoModel` API or directly using `SafeTensorsReader`. Add the pipeline adapter to your Maven dependencies:

```xml
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>samediff-pipeline-safetensors</artifactId>
    <version>${dl4j.version}</version>
</dependency>
```

#### Via AutoModel (recommended)

```java
SameDiff model = AutoModel.fromPretrained("/models/bert-base-uncased/model.safetensors");
```

`AutoModel` detects the `.safetensors` extension, resolves `SafeTensorsPipelineLoader`, and returns a populated `SameDiff` graph.

#### Via SafeTensorsReader directly

`SafeTensorsReader` gives access to the raw tensor map without constructing a SameDiff graph, which is useful for inspection or custom assembly:

```java
import org.eclipse.deeplearning4j.safetensors.SafeTensorsReader;
import org.nd4j.linalg.api.ndarray.INDArray;

import java.io.File;
import java.util.Map;

File file = new File("/models/model.safetensors");

// Inspect: see all tensor names and shapes without loading data
try (SafeTensorsReader reader = SafeTensorsReader.open(file)) {
    System.out.println("Tensor count: " + reader.getTensorCount());
    reader.getTensorNames().forEach(name -> {
        var info = reader.getTensorInfo(name);
        System.out.printf("  %-50s  %s  %s%n",
                name, info.getSafeTensorsDtype(), java.util.Arrays.toString(info.getShape()));
    });
}

// Load all tensors into memory
Map<String, INDArray> tensors = SafeTensorsReader.loadFile(file);

// Load multiple shards (e.g., model-00001-of-00002.safetensors, …)
Map<String, INDArray> sharded = SafeTensorsReader.loadFiles(java.util.List.of(
        new File("/models/model-00001-of-00002.safetensors"),
        new File("/models/model-00002-of-00002.safetensors")
));
```

`SafeTensorsReader` reads the 8-byte little-endian header length, parses the JSON tensor index, then seeks directly to each tensor's data region using a `RandomAccessFile` + `FileChannel` — no intermediate copy of the full file into RAM.

#### Supported SafeTensors dtypes

| SafeTensors dtype | ND4J DataType |
| ----------------- | ------------- |
| F32               | FLOAT         |
| F16               | HALF          |
| BF16              | BFLOAT16      |
| F64               | DOUBLE        |
| I64               | LONG          |
| I32               | INT           |
| I16               | SHORT         |
| I8                | BYTE          |
| U8                | UBYTE         |
| BOOL              | BOOL          |

***

### TorchScript Import

PyTorch `.pt` ZIP archives and `.safetensors` weight files can be imported via the `nd4j-torchscript` module. Add it to your Maven dependencies:

```xml
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-torchscript</artifactId>
    <version>${dl4j.version}</version>
</dependency>
```

#### Importing a .pt file

TorchScript `.pt` files are ZIP archives containing a `data.pkl` (Python pickle) and numbered tensor data files. `PickleParser` reads the pickle stream to reconstruct the model graph, and `TorchScriptReader` maps each tensor to its data file.

```java
import org.nd4j.torchscript.TorchScriptModelImport;
import org.nd4j.autodiff.samediff.SameDiff;

// Auto-detects format from extension
SameDiff resnet = TorchScriptModelImport.importModel("resnet50.pt");

// With options
import org.nd4j.torchscript.convert.ConversionOptions;

ConversionOptions opts = ConversionOptions.builder()
        .targetDataType(org.nd4j.linalg.api.buffer.DataType.FLOAT)
        .forTraining(false)
        .build();

SameDiff efficientnet = TorchScriptModelImport.importModel("efficientnet_b0.pt", opts);
```

#### Inspecting a file before import

```java
import org.nd4j.torchscript.format.TorchScriptMetadata;

TorchScriptMetadata meta = TorchScriptModelImport.inspectModel("model.pt");

System.out.println("Format              : " + meta.getFormat());
System.out.println("Detected arch       : " + meta.getArchitecture());
System.out.println("Total parameters    : " + meta.getTotalParameters());
for (var tensor : meta.getTensors()) {
    System.out.printf("  %-40s  %s%n",
            tensor.getName(), java.util.Arrays.toString(tensor.getShape()));
}
```

#### Supported architectures (TorchScript)

Architecture detection in `TorchScriptToSameDiffConverter` is pattern-based — it looks for characteristic weight names in the tensor map:

| Architecture | Detection signals                                                                   |
| ------------ | ----------------------------------------------------------------------------------- |
| ResNet       | `layer1.0.conv1.weight`, `layer1.0.downsample.0`, `fc.weight`                       |
| VGG          | `features.0.weight`, `classifier.0.weight`                                          |
| EfficientNet | `_conv_stem.weight`, `_blocks.0._expand_conv.weight`, `_blocks.0._se_reduce.weight` |
| Generic CNN  | Fallback for unrecognized patterns                                                  |

#### Convert a .pt model to SDZ for faster repeated loading

```java
TorchScriptModelImport.convertToSDZ("resnet50.pt", "resnet50.sdz");

// Subsequent loads skip the conversion entirely
SameDiff model = SameDiff.load(new File("resnet50.sdz"), true);
```

#### PyTorch-to-ND4J weight layout transformations

PyTorch and ND4J use different memory layouts for convolution and linear weights. `TorchScriptToSameDiffConverter` applies these automatically:

| Layer type     | PyTorch shape       | ND4J shape          | Transformation        |
| -------------- | ------------------- | ------------------- | --------------------- |
| Conv2D weights | `[out, in, kH, kW]` | `[kH, kW, in, out]` | `permute(2, 3, 1, 0)` |
| Linear weights | `[out, in]`         | `[in, out]`         | `transpose()`         |

***

### Format Detection and Custom Architectures

#### Adding a custom architecture

Implement `ModelArchitecture` and register it either programmatically or via `ServiceLoader`:

```java
public class MyCustomArchitecture implements ModelArchitecture {

    @Override
    public boolean isCompatible(GGMLMetadata metadata) {
        return "my-arch".equals(metadata.getArchitecture());
    }

    @Override
    public SameDiff buildSameDiff(GGUFReader reader, ConversionOptions options) {
        SameDiff sd = SameDiff.create();
        // map tensors from reader to SameDiff variables
        for (TensorDescriptor td : reader.readTensorDescriptors()) {
            INDArray data = reader.readAndDequantize(td);
            sd.var(td.getName(), data);
        }
        return sd;
    }
}
```

Register programmatically (highest priority wins):

```java
ArchitectureRegistry.getInstance().register(new MyCustomArchitecture(), 100);
```

Or register via `ServiceLoader` by adding a file:

```
src/main/resources/META-INF/services/org.nd4j.ggml.architecture.ModelArchitecture
```

containing the fully qualified class name of your implementation. The registry picks it up automatically at startup.

***

### Troubleshooting

**`UnsupportedFormatException: Not a GGUF file`**: the file does not start with magic bytes `0x46554747`. Verify the file is a valid GGUF and was not corrupted during download. Use `GGMLFormatDetector.detect(file)` to check before importing.

**`UnknownArchitectureException`**: none of the registered architecture handlers returned `true` from `isCompatible()`. The model's `general.architecture` metadata key may be absent or set to an unrecognized value. `GenericArchitecture` should handle this as a fallback; if it does not, inspect the metadata with `GGMLModelImport.inspectModel()` and register a custom handler.

**Out of memory during dequantization**: dequantizing to float32 expands each tensor significantly. A 4-bit quantized 8B model is \~4 GB on disk but expands to \~16 GB in float32. Use `ConversionOptions.targetDtype(DataType.FLOAT16)` to halve the memory footprint, or process tensors one at a time using `GGUFReader` directly.

**Slow first load**: `importModel` dequantizes every tensor synchronously. For repeated use, call `convertToSDZ` once to produce a native `.sdz` file, which loads approximately 3-5x faster on subsequent runs.

**Split shard ordering**: `MultimodalGGUFLoader` expects shards in a specific order (language model first). Pass the files in the order listed in the model card. Incorrect ordering results in mismatched tensor namespaces.

**`ServiceLoader` not finding pipeline adapters**: ensure the `samediff-pipeline-ggml` jar is on the runtime classpath (not just compile scope). The SPI registration in `META-INF/services/` must be present in the deployed artifact.

***

### Further Reading

* [Model Import Overview](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/model-import/overview/README.md) — comparison of all import paths in DL4J
* [SameDiff TF/ONNX Import](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/model-import/samediff-import/overview/README.md) — SameDiff-native import for TF and ONNX
* [ONNX Runtime](https://github.com/KonduitAI/deeplearning4j-docs/blob/en-1.0.0-rewrite/docs/m2.1/model-import/onnx-runtime/overview/README.md) — direct ONNX inference without graph conversion
* [nd4j-ggml source code](https://github.com/eclipse/deeplearning4j/tree/master/nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/ggml)
* [GGUF format specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
* [DL4J examples repository](https://github.com/eclipse/deeplearning4j-examples)