KlueBERT를 활용한 뉴스 세 줄 요약 서비스_4(ft.평가)

shooting star·2023년 3월 29일

KlueBERT를 활용한 뉴스 세 줄 요약 서비스

목록 보기

4/4

들어가며

모델을 만들었다면 이제는 이 모델이 제대로 된 모델인지, 또 잘못된 부분은 어떤 것이 있는지 확인하기 위해서 평가를 해야 한다. 평가에 대한 여러 지표가 있지만 추출 요약에서는 Rouge를 사용할 것이다. 그렇기 때문에 이번 포스팅에서는 Rouge 지표를 왜 선택했으며 Rouge 지표 중 가장 적절한 것은 무엇인지 알아보고 평가해 보도록 하겠다. 추가적으로 Rouge2.0이 프로젝트 개발 환경인 Colab에서 동작하지가 않았다. 그러면 Local로 해야 하는데 필자의 Local 컴퓨터는 웹/앱 개발용으로 구매했던 터라 GPU가 없다. 그래서 Colab에서 평가하기 위해서 코드를 좀 뜯어고쳤는데 이 부분에 대해서도 살펴보고자 한다.

1. Rouge

(1) Rouge란?

Rouge는 Recall-Oriented Understudy for Gisting Evaluation으로 텍스트 요약 및 기계번역 등과 같은 텍스트 생성 태스크를 평가하기 위해 사용하는 대표적인 metric이다. Rouge는 정답문과 생성문 사이의 n-gram Recall에 기반하여 계산한다.

(2) Rouge의 종류

ROUGE-1, ROUGE-2, ROUGE-L

평가 지표는 ROUGE 사용하였다. ROUGE에는 대표적으로 ROUGE-1, ROUGE-2, ROUGE-L이 있다.

ROUGE-1: ROUGE-1은 문장 레벨의 n-gram 겹침 (overlap)을 기반으로 한다. n-gram은 문장의 개별 단어를 나열한 것이다. ROUGE-1은 실제 요약과 생성된 요약의 유사도를 평가하며, 단어 순서는 고려하지 않는다.
ROUGE-2: ROUGE-2는 ROUGE-1과 비슷하지만 bigram (2-gram)을 기반으로 한다. 즉, ROUGE-2는 단어 순서를 고려한다.
ROUGE-L: ROUGE-L은 longest common subsequence (LCS)를 기반으로 한다. LCS는 실제 요약과 생성된 요약에서 같은 단어를 순서대로 나열한 것이다. ROUGE-L은 단어 순서를 고려하여 요약의 유사도를 평가한다.

recall, precision, F1-score

추가적으로 recall, precision, F1-score가 있다.

recall이 더 중요한 경우는 양성인 데이터를 음성으로 잘못 판단하게 되면 큰 영향을 미치는 경우
precision이 더 중요한 경우는 음성인 데이터를 양성으로 잘못 판단하게 되면 큰 영향을 미치는 경우

최종 평가지표

최종 평가지표는 recall과 precision의 조화를 갖춰 F1-score이다. F1-score를 선택한 이유는 Precision과 Recall의 조화 평균으로 계산되어 중요한 정보를 포함한 추출된 요약문과 원본 문서 간의 정확한 매칭을 고려하기 때문이다.

2. 평가 코드

(1) 라이브러리

우선 기존에 사용하려던 pyrouge 2.0 대신에 다른 라이브러리를 사용하였다.

pip install rouge

(2) 데이터 불러오기

이 부분은 앞서 데이터 파트에서 보았던 "Text 및 Extracive 추출 및 카테고리 통합" 부분의 코드와 같다. 이를 통해서 데이터를 불러올 건데 추출을 수행하는 데 시간이 오래 걸리기 때문에 각 카테고리별로 0.05%의 데이터를 가져오도록 수정하였다.

def data_load(DATAPATH):
    filenames = [x for x in os.listdir (DATAPATH) if x.endswith('json')]
    filenames.sort()
    filenames
    list_dic = []

    for file in filenames:
      filelocation = os.path.join(DATAPATH, file)

      with open(filelocation, 'r') as json_file:
        data = json.load(json_file)['documents']
        data_len = round(len(data) * 0.05)
        data = data[:data_len]
        for x in tqdm (range(len(data))):
          text = data[x]['text']
          text = str(text).replace('"', "'")

          extractive = data[x]['extractive']
          for index, value in enumerate(extractive):
            if value == None:
              extractive[index] = 0

          p = re.compile('(?<=sentence\'\: \')(.*?)(?=\'highlight_indices)')
          texts = p.findall(text)

          sentences = []
          for t in texts:
            sentence = t[:-3]
            sentences.append(sentence)

          mydict = {}
          mydict['text'] = sentences
          mydict['extractive'] = extractive
          list_dic.append(mydict)

    return list_dic

(3) 원문 추출

추출 요약을 진행할 원문을 추출하였다.

input_text2 = []
for i in tqdm(range(len(list_dic))):
    input_text2.append(list_dic[i]['text'])

(4) 요약문 추출

그리고 평가를 진행하기 위해 요약문을 추출하였다.

selected_texts = []
for j in tqdm(range(len(list_dic))):
    selected_texts.append([list_dic[j]['text'][i] for i in list_dic[j]['extractive']])

(5) 원문 추출 요약

import sys 
from SRC.train import new_inference
test_from = "MODEL/KLUE/bert_transformer_result/model_step_50000.pt"

hypotheses = []
for i in tqdm(range(len(input_text2))):
    hypotheses.append(new_inference(input_text2[i], test_from, "transformer", "0", "0",1))

(6) 평가하기

평가를 진행하기 이전에 추출문을 평가하기 위한 형태로 변경시켜준다. 3개의 값으로 구성되어 있던 하나의 세 줄 요약문을 하나의 값으로 변경하여 전체 구조를 이중 리스트에서 리스트로 변경시켜준다.

references_list = []
hypotheses_list = []

for re, hy in tqdm(zip(selected_texts, hypotheses)):
    references_list.append(' '.join(re))
    hypotheses_list.append(' '.join(hy))

마지막 평가를 진행하는 코드이다.

def rouge_f_r_summary(hypotheses, references):
    rouge = Rouge()

    scores = rouge.get_scores(hypotheses, references, avg=True)
    rouge_f_1 = scores['rouge-1']['f']
    rouge_r_1 = scores['rouge-1']['r']
    rouge_f_2 = scores['rouge-2']['f']
    rouge_r_2 = scores['rouge-2']['r']
    rouge_f_L = scores['rouge-l']['f']
    rouge_r_L = scores['rouge-l']['r']
    return rouge_f_1, rouge_f_L, rouge_f_2, rouge_r_1, rouge_r_2, rouge_r_L

rouge_f_1, rouge_f_L, rouge_f_2, rouge_r_1, rouge_r_2, rouge_r_L= rouge_f_r_summary(hypotheses_list, references_list)
print("--------------------------Rouge-F---------------------------")
print("Rouge-F_1:", rouge_f_1)
print("Rouge-F_2:", rouge_f_2)
print("Rouge-F_L:", rouge_f_L)
print("")
print("--------------------------Rouge-R---------------------------")
print("Rouge-R_1:", rouge_r_1)
print("Rouge-R_2:", rouge_r_2)
print("Rouge-R_L:", rouge_r_L)

3. 평가를 위해 코드 수정

.txt를 넣어야 하나씩 넣어야 가능했던 코드를 리스트 형태의 여러 원문을 넣어 추출 요약할 수 있도록 argparse를 뜯어가며 수정했다. 그래서 기존의 추출 요약에 필요했던 함수 혹은 클래스 앞에 new_를 붙여 새로운 클래스 함수를 구현하였다. 모든 수정본을 업로드하면 너무 많으니 시작이 되는 부분만 코드를 올리도록 하겠다.

새롭게 수정된 함수 및 클래스는 다음과 같다. 하지만 new_inference()만 보면 어떤 흐름으로 돌아가는지 알 수 있을 것이다.

new_inference()
new_text2input()
new_Summarizer()
new_Dataloader()
new_DataIterator()
new_build_trainer()
nwe_Trainer()

def new_inference(input_data, test_from, encoder, visible_gpus, gpu_ranks, world_size): # 이 부분을 with로 파일을 받는게 아니라 리스트를 받아서 돌리면 되지 않을까? infer2만들어서
    temp_dir = test_from
    encoder = encoder
    ff_size = 2048
    heads = 4
    dropout = 0.1 
    inter_layers = 2
    rnn_size = 512
    hidden_size = 128
    param_init = 0
    param_init_glorot = True
    
    batch_size = 1000

    use_interval = True

    visible_gpus = visible_gpus
    accum_count = 1
    world_size = world_size
    gpu_ranks = gpu_ranks
    model_path = '../models/'
    report_every = 1

    # print(input_data)

    input_list = new_txt2input(input_data)
   
    device = "cpu" if visible_gpus == '-1' else "cuda"
    device_id = 0 if device == "cuda" else -1

    cp = test_from

    try:
        step = int(cp.split('.')[-2].split('_')[-1])
    except:
        step = 0
    device = "cpu" if visible_gpus == '-1' else "cuda"

    #logger.info('Loading checkpoint from %s' % test_from)

    checkpoint = torch.load(test_from, map_location=lambda storage, loc: storage)
    opt = vars(checkpoint['opt'])
    
    config = BertConfig.from_pretrained('klue/bert-base')



    model = new_Summarizer(temp_dir, encoder, ff_size, heads, dropout, inter_layers,
                 rnn_size, hidden_size, param_init, param_init_glorot, 
                 device, load_pretrained_bert=False, bert_config = config)
    
    model.load_cp(checkpoint)
    model.eval()
    
    test_iter = data_loader.new_Dataloader(use_interval, _lazy_dataset_loader(input_list),
                                batch_size, device,
                                shuffle=False, is_test=True)
    trainer = new_build_trainer(visible_gpus, accum_count, 
                                world_size, gpu_ranks, temp_dir,
                                model_path, report_every, device_id, model, None)
    result = trainer.summ(test_iter,step)
    
    final = [list(filter(None, input_data))[i] for i in result[0][:3]]
    # print(final)
    return

마치며

지금까지 논문, 데이터, 모델, 학습, 평가까지 알아보았다. 평가 부분을 조금 개선한다면 어절 단위의 Rouge F1을 측정하는 것이다. 물론 추출 요약이기 때문에 예측문과 실제문을 몇 개 맞췄느냐로도 판단할 수는 있다. 하지만 정확히 같은 문장을 추출하지 못했더라도 상관 관계가 높은 유사한 문장을 추출했을 수도 있다. 이러한 상황까지 모두 고려하여 보다 더 정밀한 Rouge F1을 구할 수 있을 것이다. 그리고 이제 남은 작업은 해당 모델을 서비스에 사용할 수 있도록 배포를 하는 것이다. 다음 포스팅은 서비스 배포에 대해서 구현해 보도록 하겠다.

github로 이동하기 : KlueBERT를 활용한 뉴스 세 줄 요약 서비스

shooting star

이전 포스트

KlueBERT를 활용한 뉴스 세 줄 요약 서비스_4(ft.평가)