pySpark1 - Word count

박성현·2024년 6월 2일

pySpark

목록 보기

2/17

word.txt

hello world
hello world
hello world
hello world
hello world


import pyspark


sc = pyspark.SparkContext.getOrCreate()
test_file = "file:///home/jovyan/work/sample/word.txt"
text_file = sc.textFile(test_file)

lines = text_file.flatMap(lambda line : line.split(" ")).map(lambda a : (a,1)).reduceByKey(lambda a, b : a+b)
lines.collect()

[('hello', 5), ('world', 5)]

step 1

a= text_file.flatMap(lambda line : line.split(" "))
print(a.collect())

flatMap(): RDD의 각 요소에 함수를 적용한 후, 결과를 다시 평평하게 펼쳐 새로운 RDD를 만듭니다.
lambda line: line.split(" "): 무명 함수 (lambda 함수)로 각 라인(line)을 공백(" ")을 기준으로 단어 단위로 분리합니다.

['hello', 'world', 'hello', 'world', 'hello', 'world', 'hello', 'world', 'hello', 'world']

step 2

a = text_file.flatMap(lambda line : line.split(" ")).map(lambda a : (a,1))
print(a.collect())

map(lambda a: (a, 1))

map: RDD의 각 요소에 함수를 적용하여 새로운 RDD를 만듭니다.
lambda a: (a, 1): 각 단어(a)를 받아 (단어, 1) 형태의 튜플로 변환하는 무명 함수입니다.
이는 각 단어가 한 번 등장했다는 것을 의미하며, 추후 reduceByKey 단계에서 단어별 빈도를 계산하는 데 사용됩니다.

[('hello', 1), ('world', 1), ('hello', 1), ('world', 1), ('hello', 1), ('world', 1), ('hello', 1), ('world', 1), ('hello', 1), ('world', 1)]

step 3

a = text_file.flatMap(lambda line : line.split(" ")).map(lambda a : (a,1)).reduceByKey(lambda a, b : a+b)
print(a.collect())

.reduceByKey(lambda a, b: a + b):

reduceByKey: 동일한 키(여기서는 단어)를 가진 값들을 함수를 사용하여 통합합니다.
lambda a, b: a + b: 동일한 단어에 대한 값(1)들을 더하여 해당 단어의 총 빈도를 계산하는 무명 함수입니다.
이 과정을 통해 각 단어별로 빈도가 집계된 RDD가 생성됩니다. (예: [("Hello", 1), ("world", 1), ("Hello", 1)] -> [("Hello", 2), ("world", 1)])

[('hello', 5), ('world', 5)]

박성현

다소Good한 데이터 엔지니어

이전 포스트

시작하기 Apache Spark with Docker

다음 포스트