[MIT] Missing Semester 4주차: Data Wrangling

승톨·2021년 1월 8일

regex

요약

data wrangling: data 만지작 거리는 일.

문자열을 다룰 수도 있고, 이미지 같은 binary 데이터를 다룰 수도 있다.

정규 표현식 다루기

서버 로그 같은거 보면서 특정 유저의 로그만 보고 싶으면 어떻게 할까? 문자열을 조작하면 된다.

'journalctl | grep ssh | grep "Disconnected from"' | less : 특정 로그들(접속 끊김)만 볼 수 있게 문자열을 조작하는 명령어

less는 pager라고 하는데 좀 더 자세히 알아보기

cat ssh.log | less : 스크롤할 수 있는 공간이 생김.(vim 같음)

cat ss.log | sed 's/.*Disconnected from//' : Disconnected~~ 전에 오는 문자는 다 날리고 보여줘라. ⇒ 여기에 sed expression 을 쓴 것.

이 expression은 slash로 구분을 하는데 1번째는 search string이고, 2번째는 replacement string이다. 2번째는 비어있음.

echo 'aba' | sed 's/[ab]//' : a 혹은 b를 만나면 nothing과 replace 해라.

output : ba

echo 'abac' | sed 's/[ab]//g' : 횟수 상관없이 그 라인에 있는 모든 a혹은 b를 replace해라.

output : c

echo 'abcaba' | sed 's/$ab$*//g' : 0개 이상의 'ab'를 nothing으로 replace해라. \ 는 () 를 수식의 의미로 가질 수 있게 만들어주는 것.

output: ca

cat ssh.log | sed -E 's/^. *Disconnected from (invalid )?user (.*)* [0-9.]+ port [0-9]+$//' : ^ 는 문자의 시작을 의미, $ 는 문자의 끝을 의미. ? 는 if ~가 맞으면의 의미.

() 를 넣으면 capture group이라는 의미. group은 replace할 공간(2번째 슬래쉬 부분)에서 사용할 수 있음. ex. matching해서 찾은 문자열들을 2번째 group의 내용만 한정해서 replace 해줘.

uniq -c : uniq는 갖고 있는 문자열 세트에서 유니크한 것만 추출한다. -c 는 중복되는 건 다 제거하고, 중복된 개수를 세준다.

어떤 user name이 가장 많이 사용됐는지 알려면?

cat ssh.log | sed -E 's/^. *Disconnected from (invalid )?user (.*)* [0-9.]+ port [0-9]+$/\2/'| sort | uniq -c | sort -nk1,1 : n은 numeric sort로서 현재 찾은 문자열의 첫번째 열이 숫자라서 순서 정렬이 가능함. k는 location 기준으로 정렬을 할 수 있는 옵션. (1,1)라고 하면 첫번째 컬럼에서 시작해서 첫번째 컬럼에서 정렬을 끝내라 라는 의미. sort의 기본값은 오름차순 정렬. 아무 옵션이 없으면 alphabetic 정렬인 것 같다.

awk '{print $2} : awk는 columnar 형태의 data를 다루는 명령어. sed는 input으로 들어온 data를 다루는 거라 조금 다르다. awk는 공백으로 분리된 행렬 형태의 데이터를 다룬다. 이 명령의 의미는 2번째 컬럼을 프린트해라. 이다.

paste -sd, : paste는 concatenate하는 명령어라고 볼 수 있는데, -s 는 single line으로 만들어라는 거고, d, 는 comma를 구분자로 삼아라. 이다.

gnuplot : 그래프를 그려주는 명령어인듯.

이번 수업의 교훈 : raw 데이터를 원하는 형태로 만지작 거리고 싶으면 자신있게 이번 수업에서 배운 걸 활용해 다뤄보자!

연습문제

Take this short interactive regex tutorial.
- Lesson2가 흥미로움.
  - 4글자 단어 중에 . 으로 끝나는 단어들을 매칭하려면 ...\. 라는 패턴 생성하면 됨.
- Lesson4 : og는 같은 글자지만 b는 걸러야 하기 때문에 [^b]og 를 사용.
Find the number of words (in /usr/share/dict/words) that contain at least three as and don’t have a 's ending. What are the three most common last two letters of those words?
sed’s y command, or the tr program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?

/usr/share/dict/words 이 경로에 words가 없을 수도 있다. 이럴 땐 크롬에/usr/share/dict/words 를 타이핑하면 단어들이 나올 것이다. 그걸 복사해도 된다.

경로에 있는 단어의 개수를 찾아라. 단, 글자 내에 최소 3개의 a를 가지고 있고, 's 로 끝나지 않아야 한다.

tr 은 문자열 translate 혹은 delete하는 프로그램
- tr "[:upper:]" "[:lower:]" : 대문자를 소문자로 바꿔라.
- grep -E "^([^a]*a){3}.*$" : 단어 내(^~$)의 3개의 a(캡쳐그룹)를 가지고 있는 단어
- grep -v "\'s$" : 's로 끝나지 않는 단어를 찾아라. (참고로 -v 는 matches only those lines that do not contain the given word.)
종합 : cat words | tr "[:upper:]" "[:lower:]" | grep -E "^([^a]*a){3}.*$" | grep -v "\'s$"

위의 단어 중에 가장 흔하게 나오는 마지막 2글자를 3개 찾아라.

sed -E "s/.*([a-z]{2})$/\1/" : 일단 글자의 마지막 2글자만 가지고 온다. capture group 활용
sort | uniq -c | sort -n | tail -n3 : 알파벳 순으로 정렬을 하고, 중복을 제거하고, 중복으로 나오는 개수 기준으로 정렬한다. 그리고 마지막 3개를 가지고 온다.
sed -E "s/.*([a-z]{2})$/\1/" | sort | uniq -c | sort -n | tail -n3
- uniq 앞의 sort를 해주지 않으면 완벽하게 유니크하게 단어를 추출할 수 없다. 이건 uniq 프로그램의 특성 때문이다.
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also, comparisons honor the rules specified by 'LC_COLLATE'.

그 조합 중에 글자에 없는 조합은 무엇인가? : skip 했다.

그 2글자로 만들 수 있는 조합의 개수는 몇개인가? : skip했다.

To do in-place substitution it is quite tempting to do something like sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a bad idea, why? Is this particular to sed? Use man sed to find out how to accomplish this.
- 원 문자열을 조작하면, 다시 그 문자열이 필요할 때 제대로 못 쓰게 될 수 있으니 항상 복사본을 만들고 조작해야 한다.
- -i option을 쓰면 백업을 만들고 조작할 수 있게 해준다.

참고 : Quick note about sed's edit in place option

Find your average, median, and max system boot time over the last ten boots. Use journalctl on Linux and log show on macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like:

Logs begin at ...

and

systemd[577]: Startup finished in ...

On macOS, look for:

=== system boot:

and

Previous shutdown cause: 5

최종 명령어 : journalctl | grep -E "systemd[577]: Startup finished in" > bootlog

`cat bootlog | awk '{print $2}' | sed -E "s/^.*:(.*)/\1/" | xargs -n2 | awk '{print $2"-"$1}' | bc | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'`

Look for boot messages that are not shared between your past three reboots (see journalctl’s b flag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can use sed '0,/STRING/d' to remove all lines previous to one that matches STRING. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (uniq is your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots).
- AWS 가상 머신을 한번도 reboot한 적이 없어서 skip

Find an online data set like this one, this one. or maybe one from here. Fetch it using curl and extract out just two columns of numerical data. If you’re fetching HTML data, [pup](https://github.com/EricChiang/pup) might be helpful. For JSON data, try [jq](https://stedolan.github.io/jq/). Find the min and max of one column in a single command, and the sum of the difference between the two columns in another.
- skip

승톨

소프트웨어 엔지니어링을 연마하고자 합니다.

이전 포스트

MIT:Missing Semester - 2주차

다음 포스트

[MIT] Missing Semester 4주차: Data Wrangling

요약

정규 표현식 다루기

연습문제

MIT:Missing Semester - 2주차

B트리,B+트리, B*트리 개념 정리

0개의 댓글

관련 채용 정보