[DataScience] re ๋ชจ๋“ˆ

Maryยท2024๋…„ 5์›” 16์ผ

DataScience

๋ชฉ๋ก ๋ณด๊ธฐ
1/1
post-thumbnail

๐Ÿ“ข ์ •๊ทœํ‘œํ˜„์‹์ด๋ž€?

ํŠน์ •ํ•œ ๊ทœ์น™์„ ๊ฐ€์ง„ ๋ฌธ์ž์—ด์˜ ์ง‘ํ• ์„ ํ‘œํ˜„ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•˜๋Š” ํ˜•์‹ ์–ธ์–ด

๐Ÿ“ข re ๋ชจ๋“ˆ

A built-in package called re, which can be used to work with Regular Expression

โš™๏ธ ์‚ฌ์šฉ๋ฒ•

import re
text = "The rain in Spain"
x = re.search("^The.*Spain$, txt)

โ–ถ๏ธ RegEx Functions

FunctionDescription
findallReturns a list containing all matches
searchReturns a Match object if there is a match anywhere in the string
splitReturns a list where the string has been split at each match
subReplaces one or many matches with a string

๐Ÿ“ข re ๋ชจ๋“ˆ ํ•จ์ˆ˜

โš™๏ธ match(ํŒจํ„ด, ๋ฌธ์ž์—ด, ํ”Œ๋ž˜๊ทธ)

๋ฌธ์ž์—ด์˜ ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์„œ ์ž‘์„ฑํ•œ ํŒจํ„ด์ด ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธ

import re

print(re.match('a','ab'))
print(re.match('a','bba'))

# <re.Match object; span=(0, 1), match='a'>
# None

โš™๏ธ search(ํŒจํ„ด, ๋ฌธ์ž์—ด, ํ”Œ๋ž˜๊ทธ)

match์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ํŒจํ„ด์ด ๋ฌธ์ž์—ด์˜ ์ฒ˜์Œ๊ณผ ์ผ์น˜ํ•˜์ง€ ์•Š์•„๋„ ๋จ

import re

print(re.match('a','ab'))
print(re.match('a','abba'))

# <re.Match object; span=(0, 1), match='a'>
# <re.Match object; span=(2, 3), match='a'>

โš™๏ธ findall(ํŒจํ„ด, ๋ฌธ์ž์—ด, ํ”Œ๋ž˜๊ทธ)

๋ฌธ์ž์—ด ์•ˆ์— ํŒจํ„ด์— ๋งž๋Š” ์ผ€์ด์Šค๋ฅผ ์ „๋ถ€ ์ฐพ์•„์„œ ๋ฆฌ์ŠคํŠธ๋กœ ๋ฐ˜ํ™˜

import re

print(re.findall('a','a'))
print(re.findall('a','aba'))
print(re.findall('\d', '์ˆซ์ž123์ด ์ด๋ ‡๊ฒŒ56 ์žˆ๋‹ค8'))
print(re.findall('\d+', '์ˆซ์ž123์ด ์ด๋ ‡๊ฒŒ56 ์žˆ๋‹ค8'))

# ['a']
# ['a', 'a']
# ['1', '2', '3', '5', '6', '8']
# ['123', '56', '8']

โš™๏ธ finditer(ํŒจํ„ด, ๋ฌธ์ž์—ด, ํ”Œ๋ž˜๊ทธ)

findall()๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ ํŒจํ„ด์— ๋งž๋Š” ๋ฌธ์ž์—ด์˜ ๋ฆฌ์ŠคํŠธ๊ฐ€ ์•„๋‹Œ iterator ํ˜•์‹์œผ๋กœ ๋ฐ˜ํ™˜

import re

re_iter = re.finditer('a', 'baa')
for s in re_iter:
    print(s)

# <re.Match object; span=(1, 2), match='a'>
# <re.Match object; span=(2, 3), match='a'>

โš™๏ธ fullmatch(ํŒจํ„ด, ๋ฌธ์ž์—ด, ํ”Œ๋ž˜๊ทธ)

๋ฌธ์ž์—ด์— ์‹œ์ž‘๊ณผ ๋์ด ์ •ํ™•ํ•˜๊ฒŒ ํŒจํ„ด๊ณผ ์ผ์น˜ํ•  ๋•Œ ๋ฐ˜ํ™˜

match()๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ํŒจํ„ด์— ๋งž์œผ๋ฉด ๋ฐ˜ํ™˜์„ ํ•˜์ง€๋งŒ ํ•ด๋‹น ํ•จ์ˆ˜๋Š” ์‹œ์ž‘๊ณผ ๋์ด ์ •ํ™•ํ•˜๊ฒŒ ์ผ์น˜ํ•ด์•ผ ํ•จ

import re

print(re.fullmatch('a', 'a'))
print(re.fullmatch('a', 'aaa'))
print(re.fullmatch('a', 'ab'))
print(re.fullmatch('a', 'ba'))
print(re.fullmatch('a', 'baa'))


# <re.Match object; span=(0, 1), match='a'>
# None
# None
# None
# None

โš™๏ธ split(ํŒจํ„ด, ๋ฌธ์ž์—ด, ์ตœ๋Œ€ split ์ˆ˜, ํ”Œ๋ž˜๊ทธ)

๋ฌธ์ž์—ด์—์„œ ํŒจํ„ด์ด ๋งž์œผ๋ฉด ์ด๋ฅผ ๊ธฐ์ ์œผ๋กœ ๋ฆฌ์ŠคํŠธ๋กœ ์ชผ๊ฐœ๋Š” ํ•จ์ˆ˜

๋งŒ์•ฝ 3๋ฒˆ์งธ ์ธ์ž(์ตœ๋Œ€ split ์ˆ˜)๋ฅผ ์ง€์ •ํ•˜๋ฉด ๋ฌธ์ž์—ด์„ ์ง€์ •ํ•œ ์ˆ˜ ๋งŒํผ ์ชผ๊ฐœ๊ณ  ๊ทธ ์ˆ˜๊ฐ€ ๋„๋‹ฌํ•˜๋ฉด ์ชผ๊ฐœ์ง€ ์•Š์Œ

import re

print(re.split('a', 'abaabca'))
print(re.split('a', 'abaabca', 2))


# ['', 'b', '', 'bc', '']
# ['', 'b', 'abca']

โš™๏ธ sub(ํŒจํ„ด, ๊ต์ฒดํ•  ๋ฌธ์ž์—ด, ์ตœ๋Œ€ ๊ต์ฒด ์ˆ˜, ํ”Œ๋ž˜๊ทธ)

๋ฌธ์ž์—ด์— ๋งž๋Š” ํŒจํ„ด์„ 2๋ฒˆ์งธ ์ธ์ž(๊ต์ฒดํ•  ๋ฌธ์ž์—ด)๋กœ ๊ต์ฒด

์ตœ๋Œ€ ๊ต์ฒด ์ˆ˜๋ฅผ ์ง€์ •ํ•˜๋ฉด ๋ฌธ์ž์—ด์— ๋งž๋Š” ํŒจํ„ด์„ ๊ต์ฒดํ•  ๋ฌธ์ž์—ด๋กœ ๊ต์ฒดํ•˜๊ณ  ๊ทธ ์ˆ˜๊ฐ€ ๋„๋‹ฌํ•˜๋ฉด ๋”์ด์ƒ ๊ต์ฒดํ•˜์ง€ ์•Š์Œ

import re


print(re.sub('a', 'z', 'ab'))
print(re.sub('a', 'zxc', 'ab'))
print(re.sub('a', 'z', 'aaaab'))
print(re.sub('a', 'z', 'aaaab', 1))


# zb
# zxcb
# zzzzb
# zaaab

โš™๏ธ subn(ํŒจํ„ด, ๊ต์ฒดํ•  ๋ฌธ์ž์—ด, ์ตœ๋Œ€ ๊ต์ฒด ์ˆ˜, ํ”Œ๋ž˜๊ทธ)

sub()์™€ ๋™์ž‘์€ ๋™์ผํ•˜์ง€๋งŒ ๋ฐ˜ํ™˜ ๊ฒฐ๊ณผ๊ฐ€ ๊ฒฐ๊ณผ(๋ฌธ์ž์—ด, ๋งค์นญํšŸ์ˆ˜) ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜๋จ

import re


print(re.subn('a', 'z', 'ab'))
print(re.subn('a', 'zxc', 'ab'))
print(re.subn('a', 'z', 'aaaab'))
print(re.subn('a', 'z', 'aaaab', 1))

# ('zb', 1)
# ('zxcb', 1)
# ('zzzzb', 4)
# ('zaaab', 1)

โš™๏ธ compile(ํŒจํ„ด, ํ”Œ๋ž˜๊ทธ)

๋งŒ์•ฝ ํŒจํ„ด๊ณผ ํ”Œ๋ž˜๊ทธ๊ฐ€ ๋™์ผํ•œ ์ •๊ทœ์‹์„ ์—ฌ๋Ÿฌ๋ฒˆ ์‚ฌ์šฉํ•˜๋ ค๋ฉด compile()๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ ์ง€์ •ํ•œ ๋‹ค์Œ, ์œ„์˜ ํ•จ์ˆ˜๋“ค์„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

import re

c = re.compile('a')

print(c.sub('zxc', 'abcdefg'))
print(c.search('vcxdfsa'))


# zxcbcdefg
# <re.Match object; span=(6, 7), match='a'>

โš™๏ธ purge()

์œ„ complie()๋กœ ๋งŒ๋“ค์–ด ๋†“์€ ๊ฐ์ฒด๋Š” ์บ์‹œ์— ๋ณดํ†ต 100๊ฐœ๊นŒ์ง€ ์ €์žฅ๋˜๊ณ  ๊ทธ ์ˆ˜๋ฅผ ๋„˜์–ด๊ฐ€๋ฉด ์ดˆ๊ธฐํ™” ๋จ

purge()๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด 100๊ฐœ๊ฐ€ ๋„˜์–ด๊ฐ€์ง€ ์•Š์•„๋„ ์บ์‹œ๋ฅผ ์ดˆ๊ธฐํ™” ํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.

import re


re.purge()

โš™๏ธ escape(ํŒจํ„ด)

ํŒจํ„ด์„ ์ž…๋ ฅ ๋ฐ›์œผ๋ฉด ํŠน์ˆ˜๋ฌธ์ž๋“ค์— ์ต์Šค์ผ€์ดํ”„(๋ฐฑ์Šฌ๋ž˜์‰ฌ)์ฒ˜๋ฆฌ๋ฅผ ํ•œ ๋‹ค์Œ ๋ฐ˜ํ™˜

import re


print(re.escape('(\d)'))


# \(\\d\)

โš™๏ธ match object method()

findall()๋ฅผ ์ œ์™ธํ•˜๊ณ  ๋ชจ๋“  ํ•จ์ˆ˜๋“ค์˜ ๋ฐ˜ํ™˜์€ match object๋กœ ๋ฐ˜ํ™˜๋จ

mathch object์—์„œ๋Š” group(),start(), end(),span()๊ณผ ๊ฐ™์ด ์ฐพ์€ ํŒจํ„ด์ด ๋ฌธ์ž์—ด์˜ ์œ„์น˜๋‚˜ ๋งค์นญ ๋ฌธ์ž์—ด์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜ ์ œ๊ณต

์˜ˆ๋ฅผ ๋“ค์–ด search()๋กœ ํŒจํ„ด์— ๋งž๋Š” ๋ฌธ์ž์—ด์„ ์ฐพ์•˜๋‹ค ํ•˜๋ฉด

  • group() ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•ด ํŒจํ„ด์— ๋งž๋Š” ๋ฌธ์ž์—ด์„ ์ถ”์ถœ

  • start()๋ฅผ ์‚ฌ์šฉํ•ด ๋ฌธ์ž์—ด์—์„œ ์–ด๋””๋ถ€ํ„ฐ ํŒจํ„ด์— ๋งž๋Š” ๋ฌธ์ž๊ฐ€ ์‹œ์ž‘ํ–ˆ๋Š”์ง€

  • end()๋ฅผ ํ†ตํ•ด ์–ด๋””๊นŒ์ง€์ธ์ง€

  • span()์œผ๋กœ ์–ด๋””๋ถ€ํ„ฐ ์–ด๋””๊นŒ์ง€์ธ์ง€

    ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

import re

result = re.search('aa', 'baab')
print(result.group())
print(result.start())
print(result.end())
print(result.span())


# aa
# 1
# 3
# (1, 3)

โš™๏ธ groupdict()

groupdict()๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ํŒจํ„ด์— ๋งž๋Š” ๊ฒฐ๊ณผ์— ์ด๋ฆ„์„ ์ฃผ์–ด์•ผํ•จ

ํŒจํ„ด์— ์ด๋ฆ„์„ ์ฃผ๋ ค๋ฉด (?P<์ด๋ฆ„>) ํ˜•์‹์ด ๋˜์–ด์•ผ๋งŒ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋„ ์†Œ๊ด„ํ˜ธ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š์œผ๋ฉด ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒ

import re

result = re.match('(?P<front>\d{2})-(?P<middle>\d{3,4})-(?P<rear>\d{4})', '02-123-1234')

print(result.groupdict())
print(result.groups())
print(result.group(1))
print(result.group('front'))


# {'front': '02', 'middle': '123', 'rear': '1234'}
# ('02', '123', '1234')
# 02
# 02

๐Ÿ“ข ํ”Œ๋ž˜๊ทธ (=์ˆ˜์ •์ž)

์œ„ ํ•จ์ˆ˜๋“ค์˜ ๊ฐ€์žฅ ๋งˆ์ง€๋ง‰ ์ธ์ž์—๋Š” ํŒจํ„ด์„ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ re๋ชจ๋“ˆ์€ ์•„๋ž˜์™€ ๊ฐ™์€ ํŒจํ„ด์„ ์ง€์›

AcronymNameDescription
re.IIGNORECASE์•ŒํŒŒ๋ฒณ ๋Œ€์†Œ๋ฌธ์ž ๊ตฌ๋ถ„X
re.MMULTILINE์—ฌ๋Ÿฌ ์ค„์˜ ๋ฌธ์ž์—ด์— ๋Œ€ํ•ด ํŒจํ„ด ํƒ์ƒ‰ ๊ฐ€๋Šฅ
re.SDOTALL.๋ฉ”ํƒ€ ๋ฌธ์ž๊ฐ€ ๊ฐœํ–‰ ๋ฌธ์ž์™€ ๋งค์น˜๋˜๋„๋ก ํ•จ
re.XVERBOSE์ •๊ทœ์‹ ์•ˆ์˜ ๊ณต๋ฐฑ ๋ฌด์‹œ
re.UUNICODE๋ฌธ์ž ๋ถ„๋ฅ˜์— Unicode Encoding์„ ์ง€์ •ํ•จ

์‚ฌ์šฉ์˜ˆ์‹œ

import re

s = """
c
b
A
"""
print(re.search('a', s, re.M|re.I))


# <re.Match object; span=(5, 6), match='A'>

๐Ÿ“Œ ์—ฌ๋Ÿฌ ํŒจํ„ด(ํ”Œ๋ž˜๊ทธ)๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด |์‚ฌ์šฉ



์ถœ์ฒ˜

์ฐธ๊ณ ์ž๋ฃŒ1
์ฐธ๊ณ ์ž๋ฃŒ2
์ฐธ๊ณ ์ž๋ฃŒ3

0๊ฐœ์˜ ๋Œ“๊ธ€