[Dataset] CNN Daily Mail

Wonkwang·2023년 8월 14일

1. Introduction

  • CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN and Daily Mail websites as questions (with one of the entities hidden), and stories as the corresponding passages from which the system is expected to answer the fill-in the-blank question.
  • Download:
    • tensorflow datasets: Link

2. 특징

  • 언어: English
  • 요약문을 포함하는 뉴스기사로, 2개의 features 로 구성
    • Article: text of news article, used as the document to be summarized
    • Highlights: joined text of highlights with around each highlight, which is the target summary
  • 평균 tokens 수
    • Article: 781 tokens
    • Highlights: 56 tokens
  • 기본적으로는 abstractive summarization datasets 이나, gold summary 와 원문 문장간의 유사도가 매우 높은 편이라서 ROUGE Score 등을 기반으로 정답 문장과 가장 유사한 문장을 선택하여 extractive summarization datasets 으로도 많이 활용됨
  • Datasets 크기 (Tensorflow datasets 기준)
    • train: 287,113
    • val: 13,368
    • test: 11,490
    • 데이터 크기: (download: 558.32MB, dataset: 1.27GB)
  • 예시
    • Article
[Article]: By. Associated Press. PUBLISHED:. 14:11 EST, 25 October 2013. |. UPDATED:. 15:36 EST, 25 October 2013. The bishop of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. Bishop John Folda (pictured) of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A. State Immunization Program Manager Molly Howell says the risk is low, but officials feel it's important to alert people to the possible exposure. The diocese announced on Monday that Bishop John Folda is taking time off after being diagnosed with hepatitis A. The diocese says he contracted the infection through contaminated food while attending a conference for newly [...]
[Highlights]: Bishop John Folda, of North Dakota, is taking time off after being diagnosed. He contracted the infection through contaminated food in Italy. Church members in Fargo, Grand Forks and Jamestown could have been exposed.
ML/DL Engineer 입니다. 유용한 정보들을 기록해두려 합니다.

2개의 댓글

2023년 8월 14일

좋은 글 감사합니다. 자주 방문할게요 :)

1개의 답글

관련 채용 정보