Week4 Day3

김종영·2021년 2월 18일

📋 $Sequence$ $to$ $Sequence$ $with$ $Attention$

📌 $Seq2Seq$ $Model$

$encoder$ 와 $decoder$ 로 구성되고 단어의 $sequence$ 를 입력으로 받아 단어의 $sequence$ 를 출력으로 내뱉는다.
단점: $sequence$ 의 길이가 길어져도 고정된 크기의 $vector$ 에 정보를 담아야 한다 + $(Long$ $term$ $dependency)$

📌 $Seq2Seq$ $Model$ $with$ $Attention$

$Attention$ $mechanism$ 을 이요해 매 타임 스탭의 $decoder$ 가 일부 중요한 $source$ $sequence$ 에 집중할 수 있도록한다.
이를 통해 $bottleneck$ 문제도 해소할 수 있다.
각각의 $encoder$ $hidden$ $state$ 와 $decoder$ $hidden$ $state$ 를 내적해서 얻은 확률 분포를 이용한 $encoder$ $hidden$ $state$ 가중합을 구한다.
그리고 $decoder$ $hidden$ $state$ 와의 $concat$ 을 출력값으로 사용한다.
이를 통해 $loss$ 를 통해 계산되는 $gradient$ 가 $encoder$ $hidden$ $state$ 로 바로 전달될 수 있게된다.

class DotAttention(nn.Module):
  def __init__(self):
    super().__init__()

  def forward(self, decoder_hidden, encoder_outputs):  # (1, B, d_h), (S_L, B, d_h)
    query = decoder_hidden.squeeze(0)  # (B, d_h)
    key = encoder_outputs.transpose(0, 1)  # (B, S_L, d_h)

    energy = torch.sum(torch.mul(key, query.unsqueeze(1)), dim=-1)  # (B, S_L)

    attn_scores = F.softmax(energy, dim=-1)  # (B, S_L)
    attn_values = torch.sum(torch.mul(encoder_outputs.transpose(0, 1), attn_scores.unsqueeze(2)), dim=1)  # (B, d_h)

    return attn_values, attn_scores
    
class Decoder(nn.Module):
  def forward(self, batch, encoder_outputs, hidden):  
    outputs, hidden = self.rnn(batch_emb, hidden)  # (1, B, d_h), (1, B, d_h)

    attn_values, attn_scores = self.attention(hidden, encoder_outputs)  # (B, d_h), (B, S_L)
    concat_outputs = torch.cat((outputs, attn_values.unsqueeze(0)), dim=-1)  # (1, B, 2d_h)

    return self.output_linear(concat_outputs).squeeze(0), hidden  # (B, V), (1, B, d_h)

📌 $Differenct$ $Attention$ $Mechanisms$

내적이 아닌 추가적인 학습가능한 파라미터 연산을 포함하여 유사도를 측정할 수 있도록하는 $Attention$ $Mechanism$ 도 존재한다.

class ConcatAttention(nn.Module):
  def __init__(self):
    super().__init__()

    self.w = nn.Linear(2*hidden_size, hidden_size, bias=False)
    self.v = nn.Linear(hidden_size, 1, bias=False)

  def forward(self, decoder_hidden, encoder_outputs):  # (1, B, d_h), (S_L, B, d_h)
    src_max_len = encoder_outputs.shape[0]

    decoder_hidden = decoder_hidden.transpose(0, 1).repeat(1, src_max_len, 1)  # (B, S_L, d_h)
    encoder_outputs = encoder_outputs.transpose(0, 1)  # (B, S_L, d_h)

    concat_hiddens = torch.cat((decoder_hidden, encoder_outputs), dim=2)  # (B, S_L, 2d_h)
    energy = torch.tanh(self.w(concat_hiddens))  # (B, S_L, d_h)

    attn_scores = F.softmax(self.v(energy), dim=1)  # (B, S_L, 1)
    attn_values = torch.sum(torch.mul(encoder_outputs, attn_scores), dim=1)  # (B, d_h)

    return attn_values, attn_scores
    
class Decoder(nn.Module):
  def __init__(self, attention):
    super().__init__()

    self.embedding = nn.Embedding(vocab_size, embedding_size)
    self.attention = attention
    self.rnn = nn.GRU(
        embedding_size + hidden_size,
        hidden_size
    )
    self.output_linear = nn.Linear(hidden_size, vocab_size)

  def forward(self, batch, encoder_outputs, hidden):  # batch: (B), encoder_outputs: (S_L, B, d_h), hidden: (1, B, d_h)  
    batch_emb = self.embedding(batch)  # (B, d_w)
    batch_emb = batch_emb.unsqueeze(0)  # (1, B, d_w)

    attn_values, attn_scores = self.attention(hidden, encoder_outputs)  # (B, d_h), (B, S_L)

    concat_emb = torch.cat((batch_emb, attn_values.unsqueeze(0)), dim=-1)  # (1, B, d_w+d_h)

    outputs, hidden = self.rnn(concat_emb, hidden)  # (1, B, d_h), (1, B, d_h)

    return self.output_linear(outputs).squeeze(0), hidden  # (B, V), (1, B, d_h)

📋 $Beam$ $search$

이상적으로 $translation$ 에서 입력문장이 주어졌을 때 출력 문장 단어들의 $joint$ $probability$ 가 최대가 되는 선택을 하고싶다.
그러나 현재 상태의 최고를 선택하는 $Greedy$ 한 방법은 최종적으로 좋은 선택이 될 수 없고
모든 상황을 고려하는 $Exhaustive$ $search$ 의 경우 너무 많은 연산량이 필요로하다.

📌 $Beam$ $search$

$decoder$ 의 매 $time$ 스탭마다, $k$ 개의 가능성있는 일부 $translation$ 을 $tracking$ 한다.

📋 $BLEU$ $score$

📌 $Precision$ $and$ $Recall$

$precision$ : 예측된 결과가 노출되었을 때 실질적으로 느끼는 정확도
$recall$ : 실제로 존재하는 정답에 부합하는 정보가 얼마나 빠짐없이 예측되었는지를 나타내는 정확도
$F$ - $measure$ : $precision$ 과 $recall$ 의 조화평균에 해당 둘중 작은 값에 조금더 가중치를 매기는 측정방법
그러나 이 방법들은 어순에 대한 측정이 불가능함

📌 $BiLingual$ $Evaluation$ $Understudy$ $(BLEU)$

$N-gram$ $overlap$ 에 대해 측정을 하고
4가지 사이즈의 $n$ - $gram$ 에 대한 $precision$ 측정
짧게 예측한 단어가 $precision$ 에 대해 유리하기 때문에 $brevity$ $penalty$ 추가

김종영

이전 포스트

Week4 Day2

다음 포스트

Week4 Day3

📋 $Sequence$ $to$ $Sequence$ $with$ $Attention$

📌 $Seq2Seq$ $Model$

📌 $Seq2Seq$ $Model$ $with$ $Attention$

📌 $Differenct$ $Attention$ $Mechanisms$

📋 $Beam$ $search$

📌 $Beam$ $search$

📋 $BLEU$ $score$

📌 $Precision$ $and$ $Recall$

📌 $BiLingual$ $Evaluation$ $Understudy$ $(BLEU)$

Week4 Day2

Week5 Day1

0개의 댓글

관련 채용 정보

Week4 Day3

📋 SequenceSequenceSequence tototo SequenceSequenceSequence withwithwith AttentionAttentionAttention

📌 Seq2SeqSeq2SeqSeq2Seq ModelModelModel

📌Seq2SeqSeq2SeqSeq2Seq ModelModelModel withwithwith AttentionAttentionAttention

📌 DifferenctDifferenctDifferenct AttentionAttentionAttention MechanismsMechanismsMechanisms

📋 BeamBeamBeam searchsearchsearch

📌 BeamBeamBeam searchsearchsearch

📋 BLEUBLEUBLEU scorescorescore

📌 PrecisionPrecisionPrecision andandand RecallRecallRecall

📌 BiLingualBiLingualBiLingual EvaluationEvaluationEvaluation UnderstudyUnderstudyUnderstudy (BLEU)(BLEU)(BLEU)

Week4 Day2

Week5 Day1

0개의 댓글

관련 채용 정보

📋 $Sequence$ $to$ $Sequence$ $with$ $Attention$

📌 $Seq2Seq$ $Model$

📌 $Seq2Seq$ $Model$ $with$ $Attention$

📌 $Differenct$ $Attention$ $Mechanisms$

📋 $Beam$ $search$

📌 $Beam$ $search$

📋 $BLEU$ $score$

📌 $Precision$ $and$ $Recall$

📌 $BiLingual$ $Evaluation$ $Understudy$ $(BLEU)$