Coreference resolution
it 과 같은 대명사처리를 위해 본문 텍스트를 전처리해줘 보았다.
wiki Korea 본문의 처음 6문장을 가지고 테스트함.
text = "Korea (officially the \"Korean Peninsula\") is a region in East Asia. Since 1945 it has been divided into the two parts which soon became the two sovereign states: North Korea (officially the \"Democratic People's Republic of Korea\") and South Korea (officially the \"Republic of Korea\"). Korea consists of the Korean Peninsula, Jeju Island, and several minor islands near the peninsula. It is bordered by China to the northwest and Russia to the northeast. It is separated from Japan to the east by the Korea Strait and the Sea of Japan (East Sea). During the first half of the 1st millennium, Korea was divided between the three competing states of Goguryeo, Baekje, and Silla, together known as the Three Kingdoms of Korea.";
Properties props = PropertiesUtils.asProperties(
"annotators", "tokenize,ssplit,pos,lemma,ner,parse,coref"
);
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
tokenize
, ssplit
, pos
, lemma
, ner
, parse
, coref
꼭 순서대로 넣어줘야함 (여기서 삽질 많이함)
단순히 . 으로 끊을시 Dr. Kim 이런 단어에서 문제가 생김
→ sentence splitting을 활용해야함
Annotation docu = new Annotation(text);
pipeline.annotate(docu);
List<String> sentList = new ArrayList<>();
for (CoreMap sentence : docu.get(CoreAnnotations.SentencesAnnotation.class)) {
sentList.add(sentence.get(CoreAnnotations.TextAnnotation.class));
}
나온 문장들은 sentList에 저장
Collection<CorefChain> values = docu.get(CorefCoreAnnotations.CorefChainAnnotation.class).values();
for (CorefChain cc : values) {
System.out.println("\t" + cc);
}
CorefChain 은 대명사들의 연결관계를 나타내줌
CHAIN20-["the Korean Peninsula" in sentence 3, "the peninsula" in sentence 3]
CHAIN29-["Japan" in sentence 5, "Japan" in sentence 5]
CHAIN31-["Korea" in sentence 3, "It" in sentence 4, "It" in sentence 5, "Korea" in sentence 6, "Korea" in sentence 6]
이런식으로 체인 리스트가 나오는데 한 체인안에 단어들이 다 같은 의미이다.
cc.getMentionsInTextualOrder()
체인의 이 메소드를 활용하면 텍스트 순서로 정렬된 리스트를 얻을 수 있음
["the Korean Peninsula" in sentence 3, "the peninsula" in sentence 3]
["Japan" in sentence 5, "Japan" in sentence 5]
["Korea" in sentence 3, "It" in sentence 4, "It" in sentence 5, "Korea" in sentence 6, "Korea" in sentence 6]
String newText = "";
Collection<CorefChain> values = docu.get(CorefCoreAnnotations.CorefChainAnnotation.class).values();
for (CorefChain cc : values) {
//System.out.println("\t" + cc.getMentionsInTextualOrder());
List<CorefChain.CorefMention> mentionsInTextualOrder = cc.getMentionsInTextualOrder();
String coreWord = "";
for (int i = 0; i < mentionsInTextualOrder.size(); i++){
if (i == 0){
coreWord = mentionsInTextualOrder.get(i).mentionSpan; // 첫번째 명사를 원래 명사로 지정
}
String mention = mentionsInTextualOrder.get(i).mentionSpan; // 대명사 가져오기
int sentNum = mentionsInTextualOrder.get(i).sentNum -1; //문장 번호 가져오기
String modiSent = sentList.get(sentNum); // 수정될 문장 가져오고
modiSent = modiSent.replaceAll(mention,coreWord); // mention(대명사를) coreWord(원래단어)로 바꿔주고
sentList.set(sentNum,modiSent); // 수정된 문자열로 바꿔줌
}
}
for (String s : sentList) {
newText += s + " ";
}
System.out.println(text);
System.out.println("--------------------------------------------");
System.out.println(newText); System.out.println(newText);
저 체인 리스트의 맨 처음을 코어명사로 지정하고
그 뒤에 나오는 명사들을 mention으로 해준뒤
문장내 mention을 코어명사로 바꿔주는 작업을 했음
Korea (officially the "Korean Peninsula") is a region in East Asia. Since 1945 it has been divided into the two parts which soon became the two sovereign states: North Korea (officially the "Democratic People's Republic of Korea") and South Korea (officially the "Republic of Korea"). Korea consists of the Korean Peninsula, Jeju Island, and several minor islands near the peninsula. It is bordered by China to the northwest and Russia to the northeast. It is separated from Japan to the east by the Korea Strait and the Sea of Japan (East Sea). During the first half of the 1st millennium, Korea was divided between the three competing states of Goguryeo, Baekje, and Silla, together known as the Three Kingdoms of Korea.
--------------------------------------------
Korea (officially the "Korean Peninsula") is a region in East Asia. Since 1945 it has been divided into the two parts which soon became the two sovereign states: North Korea (officially the "Democratic People's Republic of Korea") and South Korea (officially the "Republic of Korea"). Korea consists of the Korean Peninsula, Jeju Island, and several minor islands near the Korean Peninsula. Korea is bordered by China to the northwest and Russia to the northeast. Korea is separated from Japan to the east by the Korea Strait and the Sea of Japan (East Sea). During the first half of the 1st millennium, Korea was divided between the three competing states of Goguryeo, Baekje, and Silla, together known as the Three Kingdoms of Korea.
보면 두번째 문장 빼고 다 잘 바뀌어 있음
props = PropertiesUtils.asProperties(
"annotators", "tokenize,ssplit,pos,lemma,parse,natlog,openie"
);
props.setProperty("openie.max_entailments_per_clause","100");
props.setProperty("openie.triple.strict","false");
pipeline = new StanfordCoreNLP(props);
tokenize
, ssplit
, pos
, lemma
, ner
, parse
, coref
, natlog
, openie
꼭 순서대로 넣어줘야함 (여기서 삽질 많이함)
openie.max_entailments_per_clause
: 중복 개수를 줄여줌 기본값 1000인데 100으로 줄여줌
openie.triple.strict
: strict 룰을 설정할지 말지. false로 해야 더 정확한 값이 나옴
docu = new Annotation(newText);
pipeline.annotate(docu);
int sentNo = 0;
for (CoreMap sentence : docu.get(CoreAnnotations.SentencesAnnotation.class)) {
System.out.println("Sentence #" + ++sentNo + ": " + sentence.get(CoreAnnotations.TextAnnotation.class));
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples = sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
System.out.println(triple.confidence + "\t" +
"<"+triple.subjectGloss()+">" + "\t" +
"<"+triple.relationGloss()+">" + "\t" +
"<"+triple.objectGloss()+">");
}
System.out.println("\n");
}
다시 newText를 파이프라인에 넣어주고
openie로 트리플을 생성하면
전처리 하기 전
Sentence #1: Korea (officially the "Korean Peninsula") is a region in East Asia.
1.0 <region> <is in> <East Asia>
1.0 <Korea> <is region in> <East Asia>
1.0 <Korea> <is> <region>
Sentence #2: Since 1945 it has been divided into the two parts which soon became the two sovereign states: North Korea (officially the "Democratic People's Republic of Korea") and South Korea (officially the "Republic of Korea").
1.0 <it> <has> <has divided>
1.0 <it> <has divided Since> <1945>
1.0 <it> <has divided into> <two parts>
Sentence #3: Korea consists of the Korean Peninsula, Jeju Island, and several minor islands near the peninsula.
1.0 <Korea> <consists of> <Korean Peninsula>
Sentence #4: It is bordered by China to the northwest and Russia to the northeast.
1.0 <It> <is bordered by> <China>
1.0 <It> <is bordered to> <northeast>
Sentence #5: It is separated from Japan to the east by the Korea Strait and the Sea of Japan (East Sea).
1.0 <It> <is separated from> <Japan>
1.0 <It> <is separated to> <east by Korea Strait>
1.0 <It> <is separated to> <east>
1.0 <It> <is> <separated>
Sentence #6: During the first half of the 1st millennium, Korea was divided between the three competing states of Goguryeo, Baekje, and Silla, together known as the Three Kingdoms of Korea.
1.0 <Korea> <was divided During> <first half>
1.0 <Korea> <was> <divided>
1.0 <Korea> <was divided between> <three competing states of Goguryeo>
1.0 <Korea> <was divided During> <half of millennium>
1.0 <Korea> <was divided During> <half>
1.0 <Korea> <was divided During> <first half of 1st millennium>
1.0 <Korea> <was divided between> <three states of Goguryeo>
1.0 <Korea> <was divided between> <three competing states>
1.0 <Korea> <was divided During> <half of 1st millennium>
1.0 <Korea> <was divided During> <first half of millennium>
1.0 <Korea> <was divided between> <three states>
전처리 하고난 후
Sentence #1: Korea (officially the "Korean Peninsula") is a region in East Asia.
1.0 <region> <is in> <East Asia>
1.0 <Korea> <is region in> <East Asia>
1.0 <Korea> <is> <region>
Sentence #2: Since 1945 it has been divided into the two parts which soon became the two sovereign states: North Korea (officially the "Democratic People's Republic of Korea") and South Korea (officially the "Republic of Korea").
1.0 <it> <has> <has divided>
1.0 <it> <has divided Since> <1945>
1.0 <it> <has divided into> <two parts>
Sentence #3: Korea consists of the Korean Peninsula, Jeju Island, and several minor islands near the Korean Peninsula.
1.0 <Korea> <consists of> <Korean Peninsula>
Sentence #4: Korea is bordered by China to the northwest and Russia to the northeast.
1.0 <Korea> <is bordered to> <northeast>
1.0 <Korea> <is bordered by> <China>
Sentence #5: Korea is separated from Japan to the east by the Korea Strait and the Sea of Japan (East Sea).
1.0 <Korea> <is separated to> <east by Korea Strait>
1.0 <Korea> <is separated to> <east>
1.0 <Korea> <is> <separated>
1.0 <Korea> <is separated from> <Japan>
Sentence #6: During the first half of the 1st millennium, Korea was divided between the three competing states of Goguryeo, Baekje, and Silla, together known as the Three Kingdoms of Korea.
1.0 <Korea> <was divided During> <first half>
1.0 <Korea> <was> <divided>
1.0 <Korea> <was divided between> <three competing states of Goguryeo>
1.0 <Korea> <was divided During> <half of millennium>
1.0 <Korea> <was divided During> <half>
1.0 <Korea> <was divided During> <first half of 1st millennium>
1.0 <Korea> <was divided between> <three states of Goguryeo>
1.0 <Korea> <was divided between> <three competing states>
1.0 <Korea> <was divided During> <half of 1st millennium>
1.0 <Korea> <was divided During> <first half of millennium>
1.0 <Korea> <was divided between> <three states>
370개의 문장중 4~5개의 문장이 it이 들어가 있었다.
이정도 정확도면 쓸만하다고 생각되나
wikipedia Korea페이지 전체 본문을 돌렸을 경우 4분이 넘게 걸린다.
본문을 요약해야하나 고민이 된다.