๐Ÿ” 2. Ranked Retrieval ๊ฒ€์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜

๊น€์ง€์œคยท2023๋…„ 10์›” 22์ผ
0

์ •๋ณด๊ฒ€์ƒ‰

๋ชฉ๋ก ๋ณด๊ธฐ
2/11
  • Boolean Retrieval์€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ๋„ˆ๋ฌด ์ ๊ฑฐ๋‚˜, ๋„ˆ๋ฌด ๋งŽ๊ฑฐ๋‚˜ ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • Boolean Retrieval ์งˆ์˜ ๋ฐฉ๋ฒ•์€ ์ผ๋ฐ˜์‚ฌ์šฉ์ž๊ฐ€ ์‚ฌ์šฉํ•˜๊ธฐ ๋ถˆํŽธํ•˜๋‹ค.

  • ์ด๋Ÿฌํ•œ ๋‹จ์ ์„ ๋ณด์™„ํ•œ ๊ฒƒ์ด Ranked Retrieval


๐Ÿ” Query-document matching scores

  • ์ฟผ๋ฆฌ์™€ ๋ฌธ์„œ์˜ ๋งค์นญ ์ •๋„๋ฅผ ์ ์ˆ˜๋กœ ๋งค๊น€

  • ์งˆ์˜ํ•˜๋Š” term์ด ์—ฌ๋Ÿฌ๋ฒˆ ๋“ฑ์žฅํ• ์ˆ˜๋ก ์ ์ˆ˜๊ฐ€ ๋” ๋†’์Œ

  • "term document incidence matrix(binary incidence matrix)"๋ฅผ ํ™•์žฅํ•˜์—ฌ term ๋“ฑ์žฅ ํšŸ์ˆ˜๋ฅผ ๊ธฐ๋กํ–ˆ๋‹ค.

  • count matrix๋ผ๊ณ  ๋ถ€๋ฆ„

  • Bag of words model : term์˜ ์œ„์น˜๋Š” ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ๋“ฑ์žฅ ํšŸ์ˆ˜๋งŒ ๊ณ ๋ คํ•œ ๋ชจ๋ธ




๐Ÿ” tf - term frequency

  • ๊ณ„์‚ฐ์‹

  • ์งˆ์˜์— ์—ฌ๋Ÿฌ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•  ๋•Œ
    ๊ฐ๊ฐ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋”ํ•œ๋‹ค.




๐Ÿ” idf

  • stop word(๊ฒ€์ƒ‰์— ์˜ํ–ฅ์„ ์ฃผ์ง€์•Š๋Š”)๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๊ณ„์‚ฐํ•˜๋Š” ๋ฒ•

  • ์ „์ฒด collection์—์„œ tf๊ฐ€ ๋†’์œผ๋ฉด ํ”ํ•œ ๋‹จ์–ด์ด๋‹ค.

  • ๊ทธ๋Ÿฌ๋ฏ€๋กœ df (term์ด ๋“ฑ์žฅํ•˜๋Š” ๋ฌธ์„œ ๊ฐœ์ˆ˜)๋ฅผ ๊ตฌํ•œ๋‹ค.

    ex) 100๊ฐœ์˜ ๋ฌธ์„œ์—์„œ df๊ฐ€ 100์ด๋ฉด ๋งค์šฐ ํ”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— idf๋Š” 0์ด๋‹ค.




๐Ÿ” tf-idf weights (logarithm)

  • ๊ฒฐ๋ก ์€ ๋‘˜์ด ํ•ฉ์ณ์ง„ ์ด ๊ณ„์‚ฐ์‹์„ ์‚ฌ์šฉํ•ด ์ ์ˆ˜๋ฅผ ์ค€๋‹ค.

๊ฒฐ๊ตญ ranking์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒƒ์„ tf์ด๋‹ค. ํ•˜์ง€๋งŒ term์ด ๋งŽ์•„์กŒ์„ ๋•Œ, idf๊ฐ€ ์ค‘์š”๋„๋ฅผ ํŒ๋‹จํ•ด์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.




๐Ÿ” tf-idf โžก vector๋กœ ํ‘œํ˜„

  • ํ•˜๋‚˜์˜ ๋‹จ์–ด๋Š” x,y ๋“ฑ์˜ ํ•˜๋‚˜์˜ ์ถ•์„ ๋‹ด๋‹นํ•œ๋‹ค.

  • document๋Š” ์ ์ด๋‚˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค.

  • ์งˆ์˜์–ด๋„ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ˜•ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • ์งˆ์˜์–ด g์™€ document ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ์งง์„์ˆ˜๋ก ranking์ด ๋†’์•„์ง„๋‹ค. ์œ ์‚ฌ๋„๊ฐ€ ๋†’๋‹ค.




๐Ÿ” ์งˆ์˜์™€ ์œ ์‚ฌํ•œ ๋ฌธ์„œ๋ฅผ ๊ตฌํ•˜๋Š” ๊ณ„์‚ฐ๋ฒ• (์งˆ์˜์™€ ๋ฌธ์„œ์˜ ์œ ์‚ฌ๋„ ๊ตฌํ•˜๊ธฐ)

  • ์œ ํด๋ฆฌ๋””์•ˆ ๊ฑฐ๋ฆฌ ๊ตฌํ•˜๋Š” ๊ณต์‹์€ ์ •๋ณด๊ฒ€์ƒ‰์—์„œ ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค.

    • ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ๋งŽ์ด ๋‚˜์˜ฌ ์ˆ˜๋ก vector์˜ ์œ„์น˜๊ฐ€ ๋ฉ€์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์—
  • ๋‘ ๋ฒกํ„ฐ ๊ฐ„์˜ ๊ฐ๋„๊ฐ€ ์ž‘์„์ˆ˜๋ก ์œ ์‚ฌํ•˜๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.
    (์›์ ์—์„œ ๊ฐ๋„๋ฅผ ๋ณด๋ฉด๋œ๋‹ค)

    (์ด ๊ทธ๋ฆผ์€ d์˜ ๋‚ด์šฉ์ด ๋‘๋ฒˆ ๋ฐ˜๋ณต๋œ d'๋ฌธ์„œ๋ฅผ ๋งŒ๋“ค์–ด ์‹คํ—˜ํ•œ ๊ฒƒ)


  • ์ฆ‰, cosฮธ๊ฐ€ ํด์ˆ˜๋ก ์œ ์‚ฌ๋„๊ฐ€ ๋†’๋‹ค.

  • length normalization

  • document์™€ query์ด ์ตœ์ข… weight๋ฅผ ๊ฐ™์€ term๋ผ๋ฆฌ ๊ฐ๊ฐ ๊ณฑํ•˜๊ณ  ์ „๋ถ€ ๋”ํ•ด ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•œ๋‹ค.

profile
๊พธ์ค€ํ•˜๊ฒŒ ๊ณต๋ถ€ํ•˜๊ณ  ๊ธฐ๋กํ•˜๋Š” ๊ฐœ๋ฐœ์ž

0๊ฐœ์˜ ๋Œ“๊ธ€