๐Ÿ“– Mistral 7B

oceannยท2024๋…„ 1์›” 27์ผ

๐Ÿ—ž๏ธ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

๋ชฉ๋ก ๋ณด๊ธฐ
1/2
post-thumbnail

citation
Jiang, Albert Q., et al. "Mistral 7B."ย arXiv preprint arXiv:2310.06825ย (2023).

Abstract

  • Mistral 7B ์†Œ๊ฐœ
  • Llama 2 13B ์„ฑ๋Šฅ์„ ๋›ฐ์–ด๋„˜์Œ
    • ์ถ”๋ก , ์ˆ˜ํ•™, ์ฝ”๋“œ ์ƒ์„ฑ ์˜์—ญ์—์„œ
  • GQA(Grouped-Query Attention) + SWA(Sliding Window Attention)

1. Introduction

Mistral AI์— ์–ด๋–ค ์˜ํ–ฅ?

GQA

  • ์ถ”๋ก  ์†๋„ ํ–ฅ์ƒ
  • ๋””์ฝ”๋”ฉ ๊ณผ์ •์—์„œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ๋Ÿ‰ ๊ฐ์†Œ
  • ์‹ค์‹œ๊ฐ„ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ค‘์š”ํ•œ ์‚ฌํ•ญ์ธ ๋†’์€ batch size โ†’ ๋†’์€ throughput

SWA

  • ๋” ๊ธด ์‹œํ€€์Šค์— ๋Œ€ํ•ด ์ปดํ“จํŒ… ์ž์›์„ ์ ˆ์•ฝํ•˜๋ฉฐ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Œ

2. Architectural Details

๊ธฐ๋ณธ ๊ตฌ์กฐ

transformer ๊ธฐ๋ฐ˜

Table 1: Model Architecture

Sliding Window Attention

  • transformer์˜ ๊ฐ layer๋ฅผ ๋…ธ์ถœํ•˜์—ฌ W(window size)๋ณด๋‹ค ํฐ ์ •๋ณด๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ
  • hi(k๋ฒˆ์งธ layer์˜ i๋ฒˆ์งธ ์œ„์น˜์˜ ์ƒํƒœ)๋Š” i๋ฒˆ์งธ ์œ„์น˜ ์ด์ „ layer์˜ ๋ชจ๋“  ์ˆจ๊ฒจ์ง„ ์ƒํƒœ์™€ ๊ด€๋ จ ์žˆ์Œ
  • ๋ฐ˜๋ณต์ ์œผ๋กœ, hi๋Š” ์ž…๋ ฅ layer์˜ token๋“ค์— W*k๊นŒ์ง€์˜ ์œ„์น˜์— ์ ‘๊ทผ ๊ฐ€๋Šฅ(Figure 1)
  • ๋งˆ์ง€๋ง‰ layer์—์„œ๋Š” W=4096์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ก ์ ์œผ๋กœ ์•ฝ 131K๊ฐœ token์˜ attention span์ด ์žˆ์Œ
  • ์‹ค์ œ๋กœ ์ ์šฉํ–ˆ์„ ๋•Œ, 16K ๊ธธ์ด์˜ ์‹œํ€€์Šค์™€ W=4096์— ๋Œ€ํ•ด์„œ vanilla attention baseline์— ๋น„ํ•ด FlashAttention๊ณผ xFormers์˜ ์†๋„๊ฐ€ 2๋ฐฐ ๊ฐœ์„ ๋จ

Figure 1: Sliding Window Attention
  • vanilla attention์—์„œ์˜ ์—ฐ์‚ฐ๋Ÿ‰์€ ์‹œํ€€์Šค ๊ธธ์ด์— ๋Œ€ํ•ด 4๋ฐฐ, ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์€ ํ† ํฐ์˜ ๊ฐœ์ˆ˜์— ๋Œ€ํ•ด ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€
  • cache availability๊ฐ€ ๊ฐ์†Œํ•˜์—ฌ ์ถ”๋ก  ์‹œ๊ฐ„์— ๋Œ€ํ•ด ๋” ํฐ latency์™€ ์ž‘์€ throughput ๋ฐœ์ƒ
  • ํ•˜์ง€๋งŒ, Mistral 7B๋Š” sliding window attention์„ ์‚ฌ์šฉํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ ํ† ํฐ์€ ์ด์ „ layer์—์„œ ์ตœ๋Œ€ W๊ฐœ์˜ token์— ์ ‘๊ทผ ๊ฐ€๋Šฅ
    • ์ด ๋•Œ sliding window ๋ฐ”๊นฅ์ชฝ token๋“ค์€ ์—ฌ์ „ํžˆ ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก์— ์˜ํ–ฅ์„ ์คŒ
  • ๊ฐ attention layer์—์„œ ์ •๋ณด๊ฐ€ W๊ฐœ์˜ token๋“ค ์•ž์œผ๋กœ ์ด๋™ ๊ฐ€๋Šฅ โ†’ ๋”ฐ๋ผ์„œ k๊ฐœ์˜ attention layers ์ดํ›„์— ์ •๋ณด๊ฐ€ k*W๊ฐœ token๋“ค ์•ž์œผ๋กœ ์ด๋™ ๊ฐ€๋Šฅ

Rolling Buffer Cache

  • fixed attention span: rolling buffer cache๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ตœ๋Œ€ cache ์‚ฌ์ด์ฆˆ๋ฅผ ์ •ํ•˜๋Š” ๊ธฐ๋ฒ•
Figure 2: Rolling buffer cache
  • cache ํฌ๊ธฐ๊ฐ€ W=4๋กœ ๊ณ ์ •๋˜์–ด ์žˆ์„ ๋•Œ, i๋ฒˆ์งธ ๋‹จ์–ด๋Š” i mod W๋ฒˆ์งธ์˜ cache์— ์ €์žฅ
  • i๊ฐ€ W๋ณด๋‹ค ํฌ๋‹ค๋ฉด ํ•ด๋‹น ์œ„์น˜์˜ cache๋Š” overwritten
  • 32k์˜ ์‹œํ€€์Šค ๊ธธ์ด์— ๋Œ€ํ•ด์„œ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ•ด์น˜์ง€ ์•Š์œผ๋ฉด์„œ cache memory ์‚ฌ์šฉ๋Ÿ‰์„ 8x๋กœ ์ค„์ด๋Š” ํšจ๊ณผ

Pre-fill and Chunking

  • ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•  ๋•Œ ํ† ํฐ์„ ์ผ๋Œ€์ผ๋กœ ์˜ˆ์ธกํ•˜๋Š”๋ฐ, ์ด ๋•Œ ๊ฐ ํ† ํฐ์€ ์ด์ „ ํ† ํฐ์— ์˜ํ–ฅ์„ ๋ฐ›์Œ
  • ์‚ฌ์ „ ์ž‘์„ฑ๋œ prompt๋ฅผ ํ†ตํ•ด (k, v) cache๋ฅผ ๋ฏธ๋ฆฌ ์ฑ„์šธ ์ˆ˜ ์žˆ์Œ(?)
    • prompt๊ฐ€ ๋งค์šฐ ํฌ๋ฉด ์ž‘์€ ์กฐ๊ฐ(chunk)์œผ๋กœ ๋ฏธ๋ฆฌ ์ž๋ฅธ ๋’ค cache๋ฅผ ๊ฐ chunk๋กœ ์ฑ„์›€
  • ํšจ๊ณผ: chunk size๋กœ window size๋ฅผ ์ •ํ•  ์ˆ˜ ์žˆ์Œ
  • ๊ฐ chunk์— ๋Œ€ํ•ด์„œ cache์™€ chunk์— ๋Œ€ํ•œ attention์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•จ

Figure 3: Pre-fill and chunking

3. Results

Mistral 7B vs. Llama

  • re-run all benchmarks
  • outperformed

Size and Efficiency

  • ์ถ”๋ก , ์ดํ•ด, STEM ์ถ”๋ก ์—์„œ Llama 2์™€ ๋น„๊ตํ•˜์—ฌ 3๋ฐฐ ์ด์ƒ ์ž‘์€ ์‚ฌ์ด์ฆˆ๋กœ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ
  • Knowledge ๋ฒค์น˜๋งˆํฌ์—์„œ๋Š” 1.9x ์ •๋„ ๋‚ฎ์€ ์„ฑ๋Šฅ
    • ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ €์žฅํ•  ์ˆ˜ ์žˆ๋Š” ์ง€์‹์˜ ์–‘์— ํ•œ๊ณ„ ์กด์žฌ

Evaluation Differences

  • ์ผ๋ถ€ ๋ฒค์น˜๋งˆํฌ์— ๋Œ€ํ•ด์„œ Llama 2์™€ ๋ณธ ๋…ผ๋ฌธ์˜ ํ‰๊ฐ€ ์ง€ํ‘œ์˜ ์ฐจ์ด์ 
    • MBPP์—์„œ hand-verified subset ์‚ฌ์šฉ sanitized-mbpp.json
    • TriviaQA์—์„œ ์œ„ํ‚คํ”ผ๋””์•„๋ฅผ ์ œ๊ณต X

4. Instruction Finetuning

  • ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ง์ ‘ fine-tuned ๊ฒฐ๊ณผ
  • ๋Œ€์ถฉ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ๋„ ์ข‹์•˜๋‹ค๋Š” ์ด์•ผ๊ธฐ
    • Hugging Face์˜ ์ด ํŽ˜์ด์ง€ ์—์„œ Mistral 7B์™€ Llama 2์˜ ์„ฑ๋Šฅ ์ง์ ‘ ๋น„๊ต ๊ฐ€๋Šฅ

5. Adding Guardrails for Front-Facing Applications

  • AI ์‹œ๋Œ€๊ฐ€ ๋‹ค๊ฐ€์˜ค๋ฉฐ guardrail ๊ฐ•ํ™” ๋Šฅ๋ ฅ์˜ ์ค‘์š”์„ฑ ๋Œ€๋‘

5.1 System prompt to enforce guardrails

Always assist with care, respect, and truth.
Respond with utmost utility yet securely.
Avoid harmful, unethical, prejudiced, or negative content.
Ensure replies promote fairness and positivity.
  • ์œ„์™€ ๊ฐ™์€ 175๊ฐœ์˜ prompt๋ฅผ ์ ์šฉํ•จ์œผ๋กœ์จ ์•ˆ์ •์„ฑ ์ถ”๊ตฌ
  • ๋˜ํ•œ โ€˜How to kill a linux processโ€™๋ผf๋Š” ์งˆ๋ฌธ์— ๋Œ€ํ•ด์„œ ์ ์ ˆํ•œ ์‘๋‹ต ์ œ๊ณต
    • Mistral 7B๋Š” โ€˜killโ€™ ๋ช…๋ น์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฆฌ๋ˆ…์Šค๋ฅผ ์ข…๋ฃŒํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…
    • Llama 2๋Š” โ€˜killโ€™์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๋„๋•์ ์ด์ง€ ๋ชปํ•˜๋‹ค๋ฉฐ ์ ์ ˆํ•˜์ง€ ์•Š์€ ์‘๋‹ต ์ œ๊ณต

5.2 Content moderation with self-reflection

  • prompt ํ˜น์€ ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์‘๋‹ต์— ๋Œ€ํ•ด ์Šค์Šค๋กœ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์žˆ๋„๋ก self-reflection prompt ์ œ์ž‘
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ recall 95.6%, accuracy 99.4% ๋„๋‹ฌ

6. Conclusion

  • Mistral 7B๋Š” ์–ธ์–ด ๋ชจ๋ธ์ด ๊ธฐ์กด์˜ ์ƒ๊ฐ์„ ์••์ถ•ํ•˜๊ธฐ๋ณด๋‹ค ์ง€์‹์„ ์••์ถ•ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•œ๋‹ค.
  • ๊ธฐ์กด: 2์ฐจ์›(๋ชจ๋ธ ์„ฑ๋Šฅ, ํ•™์Šต ๋น„์šฉ) ๋‚ด์—์„œ ๊ธฐ์ค€์„ ์ •ํ•˜๋Š” ๊ฒƒ์„ ์ค‘์š”์‹œ
  • Mistral 7B: ์‹ค์ œ์ ์ธ ๋ฌธ์ œ๋Š” 3์ฐจ์›(๋ชจ๋ธ ์„ฑ๋Šฅ, ํ•™์Šต ๋น„์šฉ, ์ถ”๋ก  ๋น„์šฉ)์ด๋ฉฐ, ๊ฐ€๋Šฅํ•œ ์ž‘์€ ๋ชจ๋ธ๋กœ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์— ์žˆ๋‹ค.
profile
๐ŸŒˆ๐ŸŒผ๐ŸŒธโ˜€๏ธ

0๊ฐœ์˜ ๋Œ“๊ธ€