Grunt
: sort of a command line interpreterScript
: you can save a Pig script to a file and run that from the command lineAmbari/Hue
MapReudce의 Mapping 단계와 유사
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
DUMP ratings;
# 결과 sample: (660,229,2,891406212)
# {relation_name} = LOAD '{HDFS_path}' AS {schema};
{relation_name}
: 관계의 이름
LOAD
: 디스크에서 데이터를 불러올때 쓰는 함수
{HDFS_path}
: HDFS에서 읽어올 함수의 경로
{schema}
: 데이터를 어떤 형태로 가져올지
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING
PigStorage('|')AS (movieID:int, movieTitle:chararray,
releaseDate:chararray, videoRelease:chararray,
imdbLink:chararray);
DUMP metadata;
# 결과 sample: (1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%20(1995))
nameLookup = FOREACH metadata GENERATE movieID, movieTitle,
ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;
DUMP nameLookup;
# 결과 sample: (1,Toy Story (1995),788918400)
USING PigStorage('delimeter')
: 구분자를 지정해줄 수 있음
DUMP
: 결과를 보여줌. 보통 디버깅할때 씀
|
으로 구분되어 있어 pig에게 알려줘야함MapReduce의 Shuffle and sort와 유사한 단계
ratingsByMovie = GROUP ratings BY movieID;
DUMP ratingsByMovie;
# 결과 sample: (1,{(807,1,4,892528231),(554,1,3,876231938), … })
DESCRIBE ratingsByMovie;
# 결과: ratingsByMovie: {group: int,ratings: {(userID: int,movieID: int,rating: int,ratingTime: int)}}
GROUP {relation} BY {attirbute};
DESCRIBE
: relation의 구조 및 type을 보여줌MapReduce의 Reducing과 유사
avgRatings = FOREACH ratingsByMovie GENERATE group AS movieID, AVG(ratings.rating) AS avgRating;
DUMP avgRatings;
# 결과 sample: (1,3.8783185840707963)
fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;
FILTER
: filtering things out into a relation based on some boolean expression
fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;
DESCRIBE fiveStarsWithData;
# 결과: fiveStarsWithData: {fiveStarMovies::movieID: int,fiveStarMovies::avgRating: double,
nameLookup::movieID: int,nameLookup::movieTitle: chararray,nameLookup::releaseTime: long}
JOIN ~ BY ~ (join type), ~ BY ~
(join type)
으로 LEFT OUTER, RIGHT, FULL OUTERoldestFiveStarMovies = ORDER fiveStarsWithData BY
nameLookup::releaseTime;
DUMP oldestFiveStarMovies;
# 결과 sample: (493,4.15,493,Thin Man, The (1934),-1136073600)
STORE
DISTINCT
FOREACH/GENERATE
MAPREDUCE
STREAM
SAMPLE
COGROUP
CROSS
: Cartesian productCUBE
RANK
: assigns a rank number to each row.LIMIT
: if you don't want to dump the entire thing out you can create a new relation using limitUNION
: squishes two relations togetherSPLIT
: splits it up into more than one relation.EXPLAIN
: give you a little insight into how Pig intends to actually execute a given queryILLUSTRATE
: takes a sample from each relation and shows you exactly what it's doing with each piece of dataREGISTER
: import a jar file that contains user defined functionsDEFINE
: assigns names to those functions so you can actually use them within your Pig scriptsIMPORT
: macros for Pig file so you can actually have reusable bits of Pig code that you save office macros and you can import those macros into other Pig scripts so it makes a little bit easierTextLoader
JsonLoader
: JSONAvroStorage
ParquetLoader
: column oriented data format. OrcStorage
: compressed formatHBaseStorage
: HBase