> For the complete documentation index, see [llms.txt](https://deeplearning4j.konduit.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/core-concepts/neural-net-fundamentals.md).

# Neural Network Fundamentals

This page covers the building blocks used to construct neural networks in the DL4J ecosystem. These components — layers, activations, loss functions, weight initialization, and regularization — are shared between `MultiLayerNetwork`, `ComputationGraph`, and `SameDiff`.

## Layers

A layer is the fundamental unit of a neural network. Each layer takes an `INDArray` as input, applies a transformation (often a linear operation followed by a non-linear activation), and produces an `INDArray` as output. Trainable layers have parameters (weights and biases) that are also `INDArray`s, updated during training.

In DL4J, layers are configured using builder objects in `org.deeplearning4j.nn.conf.layers`:

```java
new DenseLayer.Builder()
    .nIn(784)                        // number of inputs
    .nOut(256)                       // number of outputs
    .activation(Activation.RELU)     // activation function
    .weightInit(WeightInit.XAVIER)   // weight initialization
    .build()
```

### Layer Types Overview

**Feed-Forward Layers:**

| Layer                   | Class                    | Use Case                               |
| ----------------------- | ------------------------ | -------------------------------------- |
| Dense (fully connected) | `DenseLayer`             | General-purpose hidden layer           |
| Output                  | `OutputLayer`            | Final layer with loss function         |
| Loss                    | `LossLayer`              | Applies loss without parameters        |
| Dropout                 | `DropoutLayer`           | Regularization via random zeroing      |
| Activation              | `ActivationLayer`        | Applies activation without parameters  |
| Embedding               | `EmbeddingLayer`         | Lookup table for discrete inputs (NLP) |
| Embedding Sequence      | `EmbeddingSequenceLayer` | Embedding for sequence inputs          |

**Convolutional Layers:**

| Layer                 | Class                                                          | Use Case                             |
| --------------------- | -------------------------------------------------------------- | ------------------------------------ |
| 1D Convolution        | `Convolution1DLayer`                                           | Temporal/sequence convolution        |
| 2D Convolution        | `ConvolutionLayer`                                             | Image feature extraction             |
| 3D Convolution        | `Convolution3D`                                                | Video / volumetric data              |
| 2D Deconvolution      | `Deconvolution2D`                                              | Upsampling / transpose convolution   |
| 3D Deconvolution      | `Deconvolution3D`                                              | Volumetric upsampling                |
| Depthwise Conv2D      | `DepthwiseConvolution2D`                                       | Efficient mobile convolutions        |
| Separable Conv2D      | `SeparableConvolution2D`                                       | Depthwise + pointwise                |
| Subsampling (Pooling) | `SubsamplingLayer`                                             | Max/average pooling (2D)             |
| 1D Subsampling        | `Subsampling1DLayer`                                           | Pooling for 1D data                  |
| 3D Subsampling        | `Subsampling3DLayer`                                           | Pooling for 3D data                  |
| Global Pooling        | `GlobalPoolingLayer`                                           | Pool across entire spatial dimension |
| Upsampling 1D/2D/3D   | `Upsampling1D`, `Upsampling2D`, `Upsampling3D`                 | Nearest-neighbor upsampling          |
| Zero Padding          | `ZeroPaddingLayer`, `ZeroPadding1DLayer`, `ZeroPadding3DLayer` | Padding input                        |
| Cropping              | `Cropping1D`, `Cropping2D`, `Cropping3D`                       | Crop spatial dimensions              |
| Batch Normalization   | `BatchNormalization`                                           | Normalize activations per mini-batch |
| Local Response Norm   | `LocalResponseNormalization`                                   | Cross-channel normalization          |
| Locally Connected     | `LocallyConnected1D`, `LocallyConnected2D`                     | Unshared-weight convolution          |
| Space to Batch        | `SpaceToBatchLayer`                                            | Spatial rearrangement                |
| Space to Depth        | `SpaceToDepthLayer`                                            | Trade spatial for depth              |

**Recurrent Layers:**

| Layer                  | Class                       | Use Case                                        |
| ---------------------- | --------------------------- | ----------------------------------------------- |
| LSTM                   | `LSTM`                      | Long short-term memory (default RNN choice)     |
| GravesLSTM             | `GravesLSTM`                | Original Graves LSTM variant                    |
| Simple RNN             | `SimpleRnn`                 | Basic recurrent layer                           |
| Bidirectional          | `Bidirectional`             | Wrapper — runs any RNN in both directions       |
| Last Time Step         | `LastTimeStep`              | Extracts final time step output                 |
| Time Distributed       | `TimeDistributed`           | Applies a layer independently to each time step |
| RNN Output             | `RnnOutputLayer`            | Output layer for sequence-to-sequence           |
| RNN Loss               | `RnnLossLayer`              | Loss layer for sequence outputs                 |
| Self-Attention         | `SelfAttentionLayer`        | Multi-head self-attention                       |
| Learned Self-Attention | `LearnedSelfAttentionLayer` | Attention with learned queries                  |
| Recurrent Attention    | `RecurrentAttentionLayer`   | Attention over RNN outputs                      |

**Generative / Specialized:**

| Layer                   | Class                    | Use Case                                     |
| ----------------------- | ------------------------ | -------------------------------------------- |
| Autoencoder             | `AutoEncoder`            | Unsupervised feature learning                |
| Variational Autoencoder | `VariationalAutoencoder` | Generative model with latent space           |
| Capsule Layer           | `CapsuleLayer`           | Capsule networks (dynamic routing)           |
| Primary Capsules        | `PrimaryCapsules`        | Initial capsule layer                        |
| YOLO2 Output            | `Yolo2OutputLayer`       | Object detection output                      |
| CNN Loss                | `CnnLossLayer`           | Per-pixel loss for segmentation              |
| Center Loss Output      | `CenterLossOutputLayer`  | Face recognition / metric learning           |
| OCNN Output             | `OCNNOutputLayer`        | One-class neural network (anomaly detection) |

**SameDiff-Backed Custom Layers:**

| Layer           | Class                 | Use Case                          |
| --------------- | --------------------- | --------------------------------- |
| SameDiff Layer  | `SameDiffLayer`       | Custom layer with autodiff        |
| SameDiff Output | `SameDiffOutputLayer` | Custom output layer with autodiff |
| SameDiff Lambda | `SameDiffLambdaLayer` | Stateless transform via SameDiff  |
| SameDiff Vertex | `SameDiffVertex`      | Custom graph vertex with autodiff |

## Activation Functions

Activation functions introduce non-linearity. Without them, stacking layers would be equivalent to a single linear transformation.

The `Activation` enum is at `org.nd4j.linalg.activations.Activation`. Use it in layer builders:

```java
.activation(Activation.RELU)
```

### Available Activations in M2.1

| Activation       | Enum Value        | Typical Use                         |
| ---------------- | ----------------- | ----------------------------------- |
| ReLU             | `RELU`            | Default for hidden layers           |
| Sigmoid          | `SIGMOID`         | Binary classification output        |
| Softmax          | `SOFTMAX`         | Multi-class classification output   |
| Tanh             | `TANH`            | Hidden layers (alternative to ReLU) |
| Identity         | `IDENTITY`        | Regression output (linear)          |
| Leaky ReLU       | `LEAKYRELU`       | Avoids dying ReLU problem           |
| ELU              | `ELU`             | Smooth alternative to ReLU          |
| SELU             | `SELU`            | Self-normalizing networks           |
| GELU             | `GELU`            | Transformer networks                |
| Mish             | `MISH`            | Smooth, self-regularized            |
| Swish            | `SWISH`           | Smooth alternative to ReLU          |
| ReLU6            | `RELU6`           | Capped ReLU for mobile nets         |
| Hard Sigmoid     | `HARDSIGMOID`     | Fast approximation of sigmoid       |
| Hard Tanh        | `HARDTANH`        | Fast approximation of tanh          |
| Softplus         | `SOFTPLUS`        | Smooth ReLU approximation           |
| Softsign         | `SOFTSIGN`        | Alternative to tanh                 |
| RReLU            | `RRELU`           | Randomized leaky ReLU               |
| Thresholded ReLU | `THRESHOLDEDRELU` | ReLU with custom threshold          |
| Cube             | `CUBE`            | x^3 activation                      |
| Rational Tanh    | `RATIONALTANH`    | Fast tanh approximation             |
| Rectified Tanh   | `RECTIFIEDTANH`   | max(0, tanh(x))                     |
| PReLU            | Via `PReLULayer`  | Learned leaky ReLU slope            |

For custom activations, implement the `IActivation` interface at `org.nd4j.linalg.activations.IActivation`.

### Common Pairings

* **Hidden layers (feed-forward, CNN):** `RELU` with `WeightInit.RELU`
* **Hidden layers (RNN):** `TANH` with `WeightInit.XAVIER`
* **Multi-class output:** `SOFTMAX` with `LossMCXENT`
* **Binary output:** `SIGMOID` with `LossBinaryXENT`
* **Regression output:** `IDENTITY` with `LossMSE`

## Loss Functions

Loss functions measure how far the network's predictions are from the true labels. In M2.1, loss functions are instances of `ILossFunction` (at `org.nd4j.linalg.lossfunctions.ILossFunction`).

Use them in output layers:

```java
// Using ILossFunction instance (preferred in M2.1)
new OutputLayer.Builder(new LossMCXENT())
    .activation(Activation.SOFTMAX)
    .nIn(256).nOut(10)
    .build()

// Using the convenience enum (still works)
new OutputLayer.Builder(LossFunctions.LossFunction.MSE)
    .activation(Activation.IDENTITY)
    .nIn(256).nOut(1)
    .build()
```

### Available Loss Functions

| Loss                           | Class                       | Use Case                                  |
| ------------------------------ | --------------------------- | ----------------------------------------- |
| Multi-class cross entropy      | `LossMCXENT`                | Multi-class classification (with softmax) |
| Binary cross entropy           | `LossBinaryXENT`            | Binary / multi-label classification       |
| Negative log likelihood        | `LossNegativeLogLikelihood` | Similar to MCXENT                         |
| Sparse MCXENT                  | `LossSparseMCXENT`          | MCXENT with integer labels (not one-hot)  |
| Mean Squared Error             | `LossMSE`                   | Regression                                |
| Mean Absolute Error            | `LossMAE`                   | Regression (robust to outliers)           |
| Mean Squared Log Error         | `LossMSLE`                  | Regression on log scale                   |
| Mean Absolute Percentage Error | `LossMAPE`                  | Percentage-based regression               |
| L1 Loss                        | `LossL1`                    | Sparse regression                         |
| L2 Loss                        | `LossL2`                    | Standard regression                       |
| Hinge Loss                     | `LossHinge`                 | SVM-style classification                  |
| Squared Hinge                  | `LossSquaredHinge`          | Smooth hinge loss                         |
| Poisson                        | `LossPoisson`               | Count data regression                     |
| KL Divergence                  | `LossKLD`                   | Distribution matching                     |
| Cosine Proximity               | `LossCosineProximity`       | Similarity learning                       |
| Wasserstein                    | `LossWasserstein`           | GAN training                              |
| F-Measure                      | `LossFMeasure`              | Optimize F1 score directly                |
| Multi-Label                    | `LossMultiLabel`            | Multi-label classification                |
| Mixture Density                | `LossMixtureDensity`        | Mixture density networks                  |

## Weight Initialization

Weight initialization determines the starting values for a layer's parameters. Poor initialization can lead to vanishing or exploding gradients.

The `WeightInit` enum is at `org.nd4j.weightinit.WeightInit`:

```java
.weightInit(WeightInit.XAVIER)
```

### Available Initializers

| Initializer                | Enum Value                   | When to Use                          |
| -------------------------- | ---------------------------- | ------------------------------------ |
| Xavier (Glorot)            | `XAVIER`                     | Default for sigmoid/tanh activations |
| Xavier Uniform             | `XAVIER_UNIFORM`             | Uniform variant of Xavier            |
| ReLU (He)                  | `RELU`                       | ReLU and variants                    |
| ReLU Uniform               | `RELU_UNIFORM`               | Uniform variant for ReLU             |
| LeCun Normal               | `LECUN_NORMAL`               | SELU activations                     |
| LeCun Uniform              | `LECUN_UNIFORM`              | Uniform variant for SELU             |
| Variance Scaling (Fan In)  | `VAR_SCALING_NORMAL_FAN_IN`  | General purpose                      |
| Variance Scaling (Fan Out) | `VAR_SCALING_NORMAL_FAN_OUT` | General purpose                      |
| Variance Scaling (Fan Avg) | `VAR_SCALING_NORMAL_FAN_AVG` | General purpose                      |
| Normal                     | `NORMAL`                     | N(0, 1) — rarely used alone          |
| Uniform                    | `UNIFORM`                    | U(-1, 1) — rarely used alone         |
| Zero                       | `ZERO`                       | All zeros (use for biases)           |
| Ones                       | `ONES`                       | All ones                             |
| Identity                   | `IDENTITY`                   | Identity matrix (square layers only) |
| Constant                   | `CONSTANT`                   | User-specified constant value        |
| Supplied                   | `SUPPLIED`                   | User-provided INDArray               |

**Rules of thumb:**

* `XAVIER` for tanh/sigmoid networks
* `RELU` for ReLU/Leaky ReLU/ELU networks
* `LECUN_NORMAL` for SELU networks

## Regularization

Regularization prevents overfitting by constraining the model's parameters.

### L1 and L2 Regularization

Applied via the `NeuralNetConfiguration.Builder`:

```java
new NeuralNetConfiguration.Builder()
    .l2(1e-4)         // L2 regularization on all layers
    .l1(1e-5)         // L1 regularization on all layers
    // ...
```

L2 penalizes large weights (encourages small, distributed weights). L1 encourages sparsity (drives some weights to zero). L2 is more common.

`WeightDecay` is an alternative to L2 that decouples regularization from the learning rate:

```java
.l2(0)  // disable L2
.weightDecay(1e-4, true)  // use weight decay instead
```

### Dropout

Randomly zeros a fraction of activations during training:

```java
// In the configuration builder (applies to all layers):
.dropOut(0.5)  // 50% dropout rate

// Or as a specific layer:
new DropoutLayer.Builder(0.5).build()
```

Variants available in `org.deeplearning4j.nn.conf.dropout`:

| Dropout Type    | Class             | Description                          |
| --------------- | ----------------- | ------------------------------------ |
| Standard        | `Dropout`         | Randomly zero with probability p     |
| Gaussian        | `GaussianDropout` | Multiply by N(1, rate)               |
| Gaussian Noise  | `GaussianNoise`   | Add N(0, stddev) noise               |
| Alpha Dropout   | `AlphaDropout`    | For SELU networks                    |
| Spatial Dropout | `SpatialDropout`  | Drops entire feature maps (for CNNs) |

### Weight Noise

Adds noise to weights during training:

```java
.weightNoise(new WeightNoise(new NormalDistribution(0, 0.01)))
```

`DropConnect` is a variant that randomly zeros weight values (not activations):

```java
.weightNoise(new DropConnect(0.5))
```

### Parameter Constraints

Constrain parameter values after each update:

```java
new DenseLayer.Builder()
    .constrainWeights(new MaxNormConstraint(2.0, 1))    // max L2 norm of 2.0
    .constrainBias(new NonNegativeConstraint())           // biases >= 0
    .build()
```

Available constraints: `MaxNormConstraint`, `MinMaxNormConstraint`, `NonNegativeConstraint`, `UnitNormConstraint`.

### Gradient Normalization

Prevents exploding gradients:

```java
new NeuralNetConfiguration.Builder()
    .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue)
    .gradientNormalizationThreshold(1.0)
    // ...
```

Options: `None`, `RenormalizeL2PerLayer`, `RenormalizeL2PerParamType`, `ClipElementWiseAbsoluteValue`, `ClipL2PerLayer`, `ClipL2PerParamType`.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deeplearning4j.konduit.ai/en-1.0.0-rewrite/core-concepts/neural-net-fundamentals.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
