엠벌크는 플러그인을 사용해서 여러 형태의 데이터 소스에서 데이터를 가져와 병렬로 로딩하는 데이터 로더이다.
curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="HOME/.embulk/bin:HOME/.embulk/bin:PATH"' >> ~/.bashrc
source ~/.bashrc
확인
embulk -version
embulk example ./try1
위와 같이 명령어를 작성하면 엠벌크가 준비해놓은 예제 파일들이 다운이 된다.
2022-06-28 21:20:02.682 +0900: Embulk v0.9.24
Creating ./try1 directory...
Creating ./try1/
Creating ./try1/csv/
Creating ./try1/csv/sample_01.csv.gz
Creating ./try1/seed.yml
Run following subcommands to try embulk:
1. embulk guess ./try1/seed.yml -o config.yml
2. embulk preview config.yml
3. embulk run config.yml
seed.yml, sample_01.csv.gz 파일이 생성된 걸 확인할 수 있다.
seed.yml 파일을 살펴보면
in: # source - 읽어올 곳
type: file # 읽어올 곳의 타입
path_prefix: '/Users/jongpil-won/./try1/csv/sample_' # 경로 prefix 설정
out: # target - 로딩할 곳
type: stdout # 로딩할 곳의 타입
와 같이 작성되어 있다.
아까 embulk example 명령어를 치고 나서 나온 following subcommands 를 살펴보자.
Run following subcommands to try embulk:
1. embulk guess ./try1/seed.yml -o config.yml
2. embulk preview config.yml
3. embulk run config.yml
embulk guess ./try1/seed.yml -o config.yml
jongpil-won@jongpil-won ~ % embulk guess ./try1/seed.yml -o config.yml
2022-06-28 21:26:56.972 +0900: Embulk v0.9.24
2022-06-28 21:26:57.652 +0900 [WARN] (main): DEPRECATION: JRuby org.jruby.embed.ScriptingContainer is directly injected.
2022-06-28 21:26:59.874 +0900 [INFO] (main): Gem's home and path are set by default: "/Users/jongpil-won/.embulk/lib/gems"
2022-06-28 21:27:00.865 +0900 [INFO] (main): Started Embulk v0.9.24
2022-06-28 21:27:00.978 +0900 [INFO] (0001:guess): Listing local files at directory '/Users/jongpil-won/./try1/csv' filtering filename by prefix 'sample_'
2022-06-28 21:27:00.979 +0900 [INFO] (0001:guess): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2022-06-28 21:27:00.980 +0900 [INFO] (0001:guess): Loading files [/Users/jongpil-won/./try1/csv/sample_01.csv.gz]
2022-06-28 21:27:00.992 +0900 [INFO] (0001:guess): Try to read 32,768 bytes from input source
2022-06-28 21:27:01.108 +0900 [INFO] (0001:guess): Loaded plugin embulk (0.9.24)
2022-06-28 21:27:01.130 +0900 [INFO] (0001:guess): Loaded plugin embulk (0.9.24)
2022-06-28 21:27:01.154 +0900 [INFO] (0001:guess): Loaded plugin embulk (0.9.24)
2022-06-28 21:27:01.175 +0900 [INFO] (0001:guess): Loaded plugin embulk (0.9.24)
in:
type: file
path_prefix: /Users/jongpil-won/./try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
escape: '"'
null_string: 'NULL'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out: {type: stdout}
Created 'config.yml' file.
생성된 config.yml 파일을 살펴보면
in: # Input plugin options.
type: file
path_prefix: /Users/jongpil-won/./try1/csv/sample_
decoders:
- {type: gzip} # If the input is file-based, decoder plugin decodes compression or encryption (built-in gzip, bzip2, zip, tar.gz, etc).
parser: # If the input is file-based, parser plugin parses a file format (built-in csv, json, etc).
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
escape: '"'
null_string: 'NULL'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out: {type: stdout} # Output plugin options.
와 같이 컬럼의 타입과 정의 그리고 그 외의 세팅값들이 자동으로 들어가 있는 것을 확인할 수 있다.
embulk preview try1/config.yml
config.yml 파일의 실행 결과를 미리 볼 수 있다. 입력 소스에서 샘플 버퍼를 읽고 콘솔에 출력합니다
2022-06-28 21:27:43.919 +0900: Embulk v0.9.24
2022-06-28 21:27:44.601 +0900 [WARN] (main): DEPRECATION: JRuby org.jruby.embed.ScriptingContainer is directly injected.
2022-06-28 21:27:46.543 +0900 [INFO] (main): Gem's home and path are set by default: "/Users/jongpil-won/.embulk/lib/gems"
2022-06-28 21:27:47.513 +0900 [INFO] (main): Started Embulk v0.9.24
2022-06-28 21:27:47.609 +0900 [INFO] (0001:preview): Listing local files at directory '/Users/jongpil-won/./try1/csv' filtering filename by prefix 'sample_'
2022-06-28 21:27:47.610 +0900 [INFO] (0001:preview): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2022-06-28 21:27:47.611 +0900 [INFO] (0001:preview): Loading files [/Users/jongpil-won/./try1/csv/sample_01.csv.gz]
2022-06-28 21:27:47.617 +0900 [INFO] (0001:preview): Try to read 32,768 bytes from input source
+---------+--------------+-------------------------+-------------------------+----------------------------+
| id:long | account:long | time:timestamp | purchase:timestamp | comment:string |
+---------+--------------+-------------------------+-------------------------+----------------------------+
| 1 | 32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 2 | 14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 3 | 27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 4 | 11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | |
+---------+--------------+-------------------------+-------------------------+----------------------------+
기본적으로 32KB만큼 콘솔에 출력하지만 세팅을 통해 변경할 수 있다.
exec:
preview_sample_buffer_bytes: 65536 # 64KB
in:
type: ...
...
out:
type: ...
...
embulk run try1/config.yml 은 config 파일을 실행하여 결과물을 만들어내는 명령어다.
2022-06-28 21:28:03.938 +0900: Embulk v0.9.24
2022-06-28 21:28:04.894 +0900 [WARN] (main): DEPRECATION: JRuby org.jruby.embed.ScriptingContainer is directly injected.
2022-06-28 21:28:08.307 +0900 [INFO] (main): Gem's home and path are set by default: "/Users/jongpil-won/.embulk/lib/gems"
2022-06-28 21:28:10.362 +0900 [INFO] (main): Started Embulk v0.9.24
2022-06-28 21:28:10.523 +0900 [INFO] (0001:transaction): Listing local files at directory '/Users/jongpil-won/./try1/csv' filtering filename by prefix 'sample_'
2022-06-28 21:28:10.524 +0900 [INFO] (0001:transaction): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2022-06-28 21:28:10.526 +0900 [INFO] (0001:transaction): Loading files [/Users/jongpil-won/./try1/csv/sample_01.csv.gz]
2022-06-28 21:28:10.639 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=20 / output tasks 10 = input tasks 1 * 10
2022-06-28 21:28:10.647 +0900 [INFO] (0001:transaction): {done: 0 / 1, running: 0}
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,
2022-06-28 21:28:10.773 +0900 [INFO] (0001:transaction): {done: 1 / 1, running: 0}
2022-06-28 21:28:10.783 +0900 [INFO] (main): Committed.
2022-06-28 21:28:10.783 +0900 [INFO] (main): Next config diff: {"in":{"last_path":"/Users/jongpil-won/./try1/csv/sample_01.csv.gz"},"out":{}}
위의 결과에서 마지막 줄에 last_path: /Users/jongpil-won/./try1/csv/sample_01.csv.gz 라는 설정값이 들어가 있는것을 확인할 수 있는데 파일의 이름을 기반으로 config 파일에 last_path 설정값을 설정하면 다음번에 해당 config가 실행 될 때는 sample_01.csv.gz 이후 파일부터 읽어들어올 수 있.
embulk run try1/config.yml -c try1/diff.yml
diff.yml
in: {last_path: /Users/jongpil-won/./try1/csv/sample_01.csv.gz}
out: {}
in:
type: file
path_prefix: /path/to/files/sample_
last_path: /path/to/files/sample_01.csv # try1/diff.yml 에 들어가 있는 설정값을 넣어주면 앞으로 별도의 수정 없이 사용할 수 있다.
parser:
...
`-- path
`-- to
`-- files
|-- sample_01.csv -> skip
|-- sample_02.csv -> read
|-- sample_03.csv -> read
|-- sample_04.csv -> read
변수사용
- 엠벌크는 liquid template 엔진을 사용하여 변수를 넣을 수 있다.
- 이후에 삭제되거나 변경될 수 있다고 공식 홈페이지에 적혀있음- include할 파일의 앞에 '_' 를 prefix로 붙인다.
- _diff.yml.liquid 파일을 include할때는{% include 'diff %}처럼 사용