Skip to main content

Command Palette

Search for a command to run...

Working with text data

Build a Large Language Model: Chapter 2

Updated
โ€ข3 min read
Working with text data
๐Ÿ’ก
์ด ๊ธ€์€ Build a Large Language Model์„ ์ฝ๊ณ  ๊ฐœ์ธ์ ์œผ๋กœ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

This chapter covers

  • ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ ํ›ˆ๋ จ์„ ์œ„ํ•œ ํ…์ŠคํŠธ ์ค€๋น„

  • ํ…์ŠคํŠธ๋ฅผ ๋‹จ์–ด์™€ ์„œ๋ธŒ์›Œ๋“œ ํ† ํฐ์œผ๋กœ ๋ถ„ํ• 

  • ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š” ๋ณด๋‹ค ์ง„๋ณด๋œ ๋ฐฉ๋ฒ•์ธ ๋ฐ”์ดํŠธ ํŽ˜์–ด ์ธ์ฝ”๋”ฉ

  • ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ ํ›ˆ๋ จ ์˜ˆ์ œ ์ƒ˜ํ”Œ๋ง

  • ํ† ํฐ์„ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜๊ธฐ ์œ„ํ•œ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜

2.1. Understanding word embeddings

  • ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒกํ„ฐ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฐœ๋…์„ ์ž„๋ฒ ๋”ฉ์ด๋ผ๊ณ  ํ•จ

  • ์ž„๋ฒ ๋”ฉ ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

  • ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ํ˜•์‹์—๋Š” ๊ฐ๊ธฐ ๋‹ค๋ฅธ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์ด ํ•„์š”

    • e.g. ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์€ ์˜ค๋””์˜ค or ๋น„๋””์˜ค ์ž„๋ฒ ๋”ฉ์— ๋ถ€์ ์ ˆ
  • Word2Vec๊ณผ ๊ฐ™์€ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์“ธ ์ˆ˜๋„ ์žˆ๋Š”๋ฐ, ์ผ๋ฐ˜์ ์œผ๋กœ ์ž์ฒด ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•จ

2.2. Tokenizing text

  • regex๋กœ ํ† ํฌ๋‚˜์ด์ง•

      re.split(r'([,.]|\\s)', text)
    
  • Input text โ†’ Tokenized text โ†’ Token IDs โ†’ Token embeddings

2.3. Converting tokens into token IDs

  • ํ† ํฐ ID๋กœ ๋งคํ•‘ํ•˜๋ ค๋ฉด ์–ดํœ˜(vocabulary)๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•จ

  • Vocabulary: ๋ชจ๋ธ์ด ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ๋‹จ์–ด/ํ† ํฐ๋“ค์˜ ์ „์ฒด ์ง‘ํ•ฉ

    • ๊ฐ ๊ณ ์œ  ๋‹จ์–ด์™€ ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ๊ณ ์œ ํ•œ ์ •์ˆ˜์— ๋งคํ•‘ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ •์˜ํ•จ

    • ์˜ˆ์ œ์—์„œ๋Š” ๊ทธ๋ƒฅ ์•ŒํŒŒ๋ฒณ์ˆœ์œผ๋กœ ์ •๋ ฌํ•จ

  • Dictionary: ๋งคํ•‘ ์ ์šฉ ํ›„ ํ…Œ์ด๋ธ”

  • Encoding: dictionary๋ฅผ ์‚ฌ์šฉํ•ด ํ…์ŠคํŠธ๋ฅผ ID๋กœ ๋ณ€ํ™˜

  • Decoding: dictionary๋ฅผ ์‚ฌ์šฉํ•ด ID๋ฅผ ํ…์ŠคํŠธ๋กœ ์—ญ๋ณ€ํ™˜

2.4. Adding special context tokens

  • ํŠน์ • ๋ฌธ๋งฅ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํŠน์ˆ˜ ํ† ํฐ ์ถ”๊ฐ€ ํ•„์š”

  • GPT ๋ชจ๋ธ์€ ๋‹จ์ˆœํ™”๋ฅผ ์œ„ํ•ด <|endoftext|> ํ† ํฐ๋งŒ ์‚ฌ์šฉ

    • ํ† ํฐ๋“ค์€ ํŠน์ • ๊ตฌ๊ฐ„์˜ ์‹œ์ž‘์ด๋‚˜ ๋์„ ์•Œ๋ฆฌ๋Š” ๋งˆ์ปค ์—ญํ• 
  • GPT ๋ชจ๋ธ์€ ๋‹จ์–ด๋ฅผ ํ•˜์œ„ ๋‹จ์–ด ๋‹จ์œ„๋กœ ๋ถ„ํ•ดํ•˜๋Š” Byte pair encoding ์‚ฌ์šฉ

2.5. Byte pair encoding

  • ๊ธฐ์กด์— ์žˆ๋˜ ๋‹จ์–ด๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • character ๋‹จ์œ„์—์„œ ์ ์ฐจ์ ์œผ๋กœ vocabulary๋ฅผ ๋งŒ๋“ค์–ด ๋‚ด๋Š” bottom up

  • ๋” ์ž‘์€ ํ•˜์œ„ ๋‹จ์–ด or ๊ฐœ๋ณ„ ๋ฌธ์ž๋กœ ๋ถ„ํ•ด ํ›„ vocabulary์— ์ถ”๊ฐ€

  • ์—ฐ์†์ ์œผ๋กœ ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ ๊ธ€์ž ์Œ์„ ์ฐพ์•„์„œ ํ•˜๋‚˜์˜ ๊ธ€์ž๋กœ ๋ณ‘ํ•ฉ

  • tiktoken: A fast BPE tokeniser

2.6. Data sampling with a sliding window

  • ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ: ํ…์ŠคํŠธ๋ฅผ ์ผ์ • ํฌ๊ธฐ์˜ ๊ฒน์น˜๋Š” ๊ตฌ๊ฐ„์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ž…๋ ฅ๊ณผ ํƒ€๊ฒŸ์„ ์ƒ์„ฑ

  • max_length: ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ํ† ํฐ ์ˆ˜, ์ถ”๋ก  ์‹œ ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ์˜ ์ƒํ•œ

  • stride: ์œˆ๋„์šฐ๊ฐ€ ์ด๋™ํ•˜๋Š” ๊ฐ„๊ฒฉ

  • stride < max_length์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ๊ณผ์ ํ•ฉ(overfitting)์„ ๋ฐฉ์ง€

2.7. Creating token embeddings

  • ์ž„๋ฒ ๋”ฉ์ด ํ•„์š”ํ•œ ์ด์œ 

    • ๋”ฅ๋Ÿฌ๋‹ ์‹ ๊ฒฝ๋ง์€ ์—ฐ์†์ ์ธ ๋ฒกํ„ฐ ํ‘œํ˜„ ํ•„์š”

    • ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ํ›ˆ๋ จ๋˜๊ธฐ ๋•Œ๋ฌธ

    • ์ •์ˆ˜ํ˜• ํ† ํฐ ID๋กœ๋Š” gradient ๊ณ„์‚ฐ ๋ถˆ๊ฐ€

  • ์ž‘๋™ ๋ฉ”์ปค๋‹ˆ์ฆ˜

    • ์กฐํšŒ ์ž‘์—…(Lookup): ํ† ํฐ ID๋กœ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์—์„œ ํ•ด๋‹น ํ–‰ ์„ ํƒ

    • e.g. ํ† ํฐ ID 3 โ†’ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์˜ 4๋ฒˆ์งธ ํ–‰(0-indexed)

    • One-hot ์ธ์ฝ”๋”ฉ + ํ–‰๋ ฌ ๊ณฑ์…ˆ์˜ ํšจ์œจ์  ๊ตฌํ˜„

2.8. Encoding word positions

  • ํ† ํฐ ์ž„๋ฒ ๋”ฉ์˜ ํ•œ๊ณ„

    • Self-attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ํ† ํฐ์˜ ์œ„์น˜๋‚˜ ์ˆœ์„œ๋ฅผ ์ดํ•ดํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ

    • ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋Š” ํ† ํฐ ID๊ฐ€ ์–ด๋А ์œ„์น˜์— ์žˆ๋“  ํ•ญ์ƒ ๋™์ผํ•œ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜

  • ์œ„์น˜ ์ž„๋ฒ ๋”ฉ์˜ ํ•„์š”์„ฑ

    • ์ ˆ๋Œ€์  ์œ„์น˜ ์ž„๋ฒ ๋”ฉ: ์‹œํ€€์Šค ๋‚ด ํŠน์ • ์œ„์น˜์™€ ์—ฐ๊ฒฐ๋˜๋ฉฐ, ๊ฐ ์œ„์น˜๋งˆ๋‹ค ๊ณ ์œ ํ•œ ์ž„๋ฒ ๋”ฉ์„ ํ† ํฐ ์ž„๋ฒ ๋”ฉ์— ๋”ํ•จ

    • ์ƒ๋Œ€์  ์œ„์น˜ ์ž„๋ฒ ๋”ฉ: ํ† ํฐ ๊ฐ„์˜ ์ƒ๋Œ€์  ๊ฑฐ๋ฆฌ์— ์ค‘์ ์„ ๋‘๋ฉฐ, ํ›ˆ๋ จ ์ค‘ ๋ณด์ง€ ๋ชปํ•œ ๊ธธ์ด์˜ ์‹œํ€€์Šค์—๋„ ์ž˜ ์ผ๋ฐ˜ํ™”๋ฉ๋‹ˆ๋‹ค