[Layer] Dense

안암동컴맹·2024년 4월 23일
0

Deep Learning

목록 보기
21/31

Dense Layer

Introduction

The dense layer, or fully connected layer, is an essential building block in many neural network architectures used for a broad range of machine learning tasks. These layers are characterized by their fully connected nature, where each input is connected to every output via a set of learnable weights and biases. Understanding how backpropagation works in dense layers is critical for optimizing neural network training, as it allows for the efficient adjustment of these parameters based on the loss gradient.

Background and Theory

Dense Layer Structure

A dense layer transforms its input through a linear combination followed by a nonlinear activation function. Mathematically, the output y\mathbf{y} of a dense layer given an input vector x\mathbf{x} is:

y=σ(Wx+b)\mathbf{y} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})

where:

  • W\mathbf{W} is the weight matrix,
  • b\mathbf{b} is the bias vector,
  • σ\sigma is the nonlinear activation function.

Backpropagation Overview

Backpropagation is a method used to compute the gradient of the loss function of a neural network with respect to its weights and biases. It involves two main phases:

  1. Forward Pass: Compute the output of the network, layer by layer, using the input data.
  2. Backward Pass: Propagate the error back through the network to compute the gradients.

Mathematical Formulation

Forward Pass

During the forward pass, each neuron in the dense layer receives an input, applies a weighted sum with its weights and bias, and then passes the result through an activation function. The output y\mathbf{y} of the dense layer is computed as mentioned previously.

Backward Pass (Backpropagation)

The backward pass involves calculating the gradients of the loss function with respect to each parameter (weights and biases).

Let LL denote the loss function of the network. The gradients are calculated as follows:

Gradient w.r.t. Weights

The gradient of the loss LL with respect to the weight matrix W\mathbf{W} is given by:

LW=LyyW\frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{W}}

Using the chain rule, the derivative yW\frac{\partial \mathbf{y}}{\partial \mathbf{W}} can be computed, where Ly\frac{\partial L}{\partial \mathbf{y}} is the gradient of the loss with respect to the output of the layer, and x\mathbf{x} is the input to the layer.

Therefore:

yW=σ(Wx+b)x\frac{\partial \mathbf{y}}{\partial \mathbf{W}} = \sigma'(\mathbf{W}\mathbf{x} + \mathbf{b}) \otimes \mathbf{x}^\top

Gradient w.r.t. Biases

Similarly, the gradient of the loss with respect to the bias vector b\mathbf{b} is:

Lb=Lyyb\frac{\partial L}{\partial \mathbf{b}} = \frac{\partial L}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{b}}

Since the derivative of y\mathbf{y} with respect to b\mathbf{b} is the derivative of the activation function:

yb=σ(Wx+b)\frac{\partial \mathbf{y}}{\partial \mathbf{b}} = \sigma'(\mathbf{W}\mathbf{x} + \mathbf{b})

Update Rule

The weights and biases are updated using the gradients calculated during backpropagation, typically with an update rule such as:

W=WηLW,b=bηLb\mathbf{W} = \mathbf{W} - \eta \frac{\partial L}{\partial \mathbf{W}}, \quad \mathbf{b} = \mathbf{b} - \eta \frac{\partial L}{\partial \mathbf{b}}

where η\eta is the learning rate.

Implementation

Parameters

  • in_features: int
    Number of input features
  • out_features: int
    Number of output features
  • activation: FuncType, default = ‘relu’
    Activation function
  • initializer: InitType, default = ‘auto’
    Type of weight initializer
  • optimizer: Optimizer, default = None
    Optimizer for weight update
  • lambda_: float, default = 0.0
    L2-regularization strength

Applications

  • Image Recognition: Dense layers aggregate learned features from convolutional layers to classify images.
  • Natural Language Processing: Used in neural networks to interpret and make predictions based on textual data.
  • Regression Tasks: Predict continuous variables using patterns learned from data.

Strengths and Limitations

Strengths

  • Universal Approximation: Capable of approximating any continuous function given sufficient neurons and layers.
  • Simplicity: Straightforward implementation and integration into various architectures.

Limitations

  • Computationally Intensive: Requires a lot of computational power for large networks.
  • Prone to Overfitting: Especially in networks with a large

number of parameters relative to the amount of training data.

Advanced Topics

Optimization Enhancements

Improving backpropagation includes techniques like momentum, RMSprop, and Adam, which help in converging faster and more reliably than standard stochastic gradient descent.

Regularization Techniques

Methods like dropout, L1/L2 regularization, and early stopping are crucial for preventing overfitting and improving the generalization of dense layers in large networks.

References

  1. Goodfellow, Ian, et al. "Deep Learning." MIT press, 2016.
  2. LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521.7553 (2015): 436-444.
profile
𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀

0개의 댓글