import re [regex] - 정규표현식

star_is_mine·2022년 12월 10일

자주 사용하는 문자 클래스

문자 클래스 설명
\d : 숫자 [0-9]와 같다.
\D : 비숫자 [^0-9]와 같다.
\w : 숫자 + 문자 [a-zA-Z0-9]와 같다. 💖💖💖
\W : 숫자 + 문자가 아닌 것 [^a-zA-Z0-9]와 같다.
\s : 공백 [ \t\n\r\f\v]와 같다.
\S : 비공백 [^ \t\n\r\f\v]와 같다.
\b : 단어 경계 (\w와 \W의 경계)
\B : 비단어 경계

. 모든 문자

. : 줄바꿈 문자인 \n 을 제외한 모든 문자와 매치된다.
[] 사이에 .을 사용할 경우 [.] 문자 원래의 의미인 마침표( . )가 된다.

a.b # 'a + 모든 문자 + b'를 뜻함

aab # a와 b 사이의 a는 모든 문자에 포함되므로 매치
a0b # a와 b 사이의 0은 모든 문자에 포함되므로 매치
abc # a와 b 사이에 문자가 없기 때문에 매치되지 않음

a[.]b 위와 같이 a와 b 사이의 모든 문자를 의미하는 것이 아니라 마침표 (.)를 의미합니다.

a.b # a와 b 사이에 마침표가 있으므로 매치
a0b # a와 b 사이에 마침표가 없으므로 매치 안됨

문자열에서 특수문자만 제거하는 방법

re.sub(pattern, replacement, string)은 string에서 정규표현식의 pattern과 일치하는 내용을 replacement로 변경합니다. 만약 빈 문자열("")로 변경하면 패턴에 해당하는 문자만 제거하게 됩니다.

아래 예제는 문자열에서 특수문자만 제거합니다. (정확히는 한글, 영어, 숫자, 공백(스페이스)를 제외한 다른 문자를 모두 제거합니다.)

표현식 앞에 ^를 붙이면 not의 의미
\uAC00-\uD7A30 : 모든 한글 음절(가-힣)
a-z : 영어 소문자
A-Z : 영어 대문자
0-9 : 숫자
\s : 띄어쓰기

import re

// example 01 - 예제1
str = "AA**BB#@$CC 가나다-123"
new_str = re.sub(r"[^\uAC00-\uD7A30-9a-zA-Z\s]", "", str)
print(new_str)
// Output: AABBCC 가나다123

// example 02 - 예제2
str = "🌱영어회화 | 기초 영어 - 문장 만들기-  Lesson 002"
new_str = re.sub(r"[^\uAC00-\uD7A30-9a-zA-Z\s]", "", str)
print(new_str)
// Output: 영어회화  기초 영어  문장 만들기  Lesson 002

re.sub(r"[^\uAC00-\uD7A30-9a-zA-Z\s]", "", str)

위 명령문을 해석하면 다음과 같습니다.

str 문자열에서 아래에 해당하는 것을 찾아라!

\uAC00-\uD7A30 - 모든 한글
a-z - 모든 알파벳 소문자
A-Z - 모든 알파벳 대문자
0-9 - 모든 숫자
\s - 띄어쓰기

까지 모두 찾은 다음
^ - not 즉 찾은것을 제외한 나머지를 모두 "" 빈문자열로 치환하라. 는 명령어 입니다.

날짜만 추출하는 방법

oooo-oo-oo 형태의 날짜 추출

import re
from datetime import datetime

match = re.search(r'\d{4}-\d{2}-\d{2}', text)
date = datetime.strptime(match.group(), '%Y-%m-%d').date()

이메일 주소 추출하는법

str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())  ## 'b@google'
# 위의 방법으로는 슬래쉬(-) 앞의 이메일 주소인 alice 를 추출할 수 없다. 따라서 아래방법 참조.
    
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())  ## 'alice-b@google.com'
    

# 위의 코드에서 아주 '조금만' 더 배워보자. 
# match.group 기능이다. 
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)

여러개의 이메일주소가 하나의 문자열에 포함되어 있는 경우

## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
    # do something with each found email string
    print(email)

파일안에 있는 문자패턴 찾기

# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())

특정 문자열로 시작하는 경우 찾기

사실 이건 순수 python 으로도 구현이 가능하다.

s = "Techie"
word = "Tech"

res = s.startswith(word)
print(res)    # True

정규식 사용법은 아래와 같다.

import re
 
if __name__ == '__main__':
 
    s = "Techie"
 
    res = re.match(r'^Tech', s) is not None
    print(res)    # True

💖 강추 PythonVerbalExpressions

Link

코딩애플을 통해서 배운 내용입니다.
유튜브 원본 동영상 참조

# Create an example of how to test for correctly formed URLs
verbal_expression = VerEx()
tester = (verbal_expression.
            start_of_line().
            find('http').
            maybe('s').
            find('://').
            maybe('www.').
            anything_but(' ').
            end_of_line()
)

# Create an example URL
test_url = "https://www.google.com"

# Test if the URL is valid
if tester.match(test_url):
    print "Valid URL"

# Print the generated regex
print tester.source() # => ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$