[Paper Review] RNN + Long Short-Term Memory

6X10·2024년 4월 13일

NLP

목록 보기
2/3

Before the beginning,,,


What is Recurrent Neural Networks (RNN)?

RNNs use the idea of processing sequential information. The term “recurrent” applies as they perform the same task over each instance of the sequence such that the output is dependent on the previous computations and results. Generally, a fixed-size vector is produced to represent a sequence by feeding tokens one by one to a recurrent unit. In a way, RNNs have “memory” over previous computations and use this information in current processing.

  • Structure of Simple RNN

xtx_t: the input to the network at time step tt
sts_t: the hidden state at time step tt

Caculation of sts_t is based as per the equation:

-> sts_t is caculated based on the current input and the previous time step's hidden state
-> sts_t is considered as the network’s memory element that accumulates information from other time steps

The function ff: a non-linear transformation such as tanh,ReLUtanh, ReLU
U,V,WU, V, W: weights that are shared across time

  • Properties of RNN
    • Pros
      • Given that an RNN performs sequential processing by modeling units in sequence, it has the ability to capture the inherent sequential nature present in language, where units are characters, words or even sentences.
      • RNNs have flexible computational steps that provide better modeling capability and create the possibility to capture unbounded context.
    • Cons
      • Simple RNN networks suffer from the infamous vanishing gradient problem, which makes it really hard to learn and tune the parameters of the earlier layers in the network.

Conventional BPTT (e.g., Williams and Zipser 1992)

  • Network Architecture

    • Let the network have nn units, with mm external input lines.

    • Let y(t)y(t) denote the nn-tuple of outputs of the units in the network at time tt.

    • Let xnet(t)x^{net}(t) denote the mm-tuple of external input signals to the network at time t.

    • We also define x(t)x(t) to be the (m+n)(m+ n)-tuple obtained by concatenating xnet(t)x^{net}(t) and y(t)y(t) in some convenient fashion.

    • Let UU denote the set of indices kk such that xkx_k the kthk^{th} component of xx, is the output of a unit in the network.

    • Let II denote the set of indices kk for which xkx_k is an external input.

    • Furthermore, we assume that the indices on yy and xnetx^{net} are chosen to correspond to those of x, so that

    • Let WW denote the weight matrix for the network, with a unique weight between every pair of units and also from each input line to each unit.

    • The element wijw_{ij} represents the weight on the connection to the ithi^{th} unit from either the jthj^{th} unit, if jUj \in U, or the jthj^{th} input line, if jIj \in I.

    • Furthermore, note that to accommodate a bias for each unit we simply include among the mm input lines one input whose value is always 1; the corresponding column of the weight matrix contains as its ithi^{th} element the bias for unit ii. In general, our naming convention dictates that we regard the weight wijw_{ij} as having xjx_j as its “presynaptic” signal and yjy_j as its “postsynaptic” signal.

    • For each k, the intermediate variable sk(t)s_{k}(t) represents the net input to the kthk^{th} unit at time tt. Its value at time t+1t+1 is computed in terms of both the state of and input to the network at time tt by

      The longer one clarifies how the unit outputs and the external inputs are both used in the computation, while the more compact expression illustrates why we introduced xx and the corresponding indexing convention above.

    • The output of such a unit at time t+1t+1 is then expressed in terms of the net input by

      where fkf_k is the unit's squashing function.

    • In those cases where a specific assumption (differentiable) about these squashing functions is required, it will be assumed that all units use the logistic function.

  • Network Performance Measure

    • Assume that the task to be performed by the network is a sequential supervised learning task, meaning that certain of the units’ output values are to match specified target values (which we also call teacher signals) at specified times.
    • Let T(t)T(t) denote the set of indices kUk \in U for which there exists a specified target value dk(t)d_k(t) that the output of the kthk^{th} unit should match at time tt. Then define a time-varying nn-tuple ee by
      -> Note that this formulation allows for the possibility that target values are specified for different units at different times.
    • Denote the negative of the overall network error at time t.

    • A natural objective of learning might be to maximize the negative of the total error oversome appropriate time period (t,t](t',t].

    • One natural wat to make the weight changes is along a constant positive multiple of the performance measure gradient, so that

      for each ii and jj, where η\eta is a positive learning rate parameter.

LSTM


1. Introduction

  • With conventional "Back-Propagation Through Time" (BPTT) or "Real-Time Recurrent Learning" (RTRL), error signals "flowing backwards in time" tend to either (1) blow up or (2) vanish
    (BPTT 참고: https://velog.io/@jody1188/BPTT)
    -> LSTM is designed to overcome these error back-flow problems. It can learn to bridge time intervals in excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short time lag capabilities.

3. Constant Error Backprop

3.1 Exponentially Decaying Error

3.2 Constant Error Flow: NAIVE APPROACH

  • A single unit
    • To avoid vanishing error signals, at time tt, jj's local error back flow is
    • To enfore constant error flow through jj, we require
  • The constant error carrousel
    • Interating the differential equation above, we obtain
    • This means: fjf_j has to be linear, and unit jj's activation has to remain constant:
    • In the experiments, this will be ensured by using the identity function fjf_j:fj(x)=x,xf_j(x)=x, \forall x, and by setting wjj=1.0w_{jj}=1.0. We refer to this as the constant error carrousel (CEC).

4. LONG SHORT-TERM MEMORY

  • Memory cells and gate units
    • To construct an architecture that allows for constant error flow through special, self-connected units without the disadvantages of the naive approach, we extend the constant error carrousel CEC embodied by the self-connected, linear unit jj from Section 3.2 by introducin additional features.
    • A multiplicative inputinput gategate unitunit is introduced to protect the momory contents stored in jj from pertubation by irrelevant inputs. Likewise, a multiplicative outputoutput gategate unitunit is introduced which protects other units from perturbation by currently irrelevant memory contents stored in jj.

  • cjc_j: $t

<References>

0개의 댓글