[모던 자바 인 액션] 스트림으로 데이터 모으기

신명철·2022년 10월 26일

목록 보기

1/3

들어가며

스트림을 사용하면서 항상 마무리를 Collect(Collectors.toList()) 혹은 Collect(Collectors.toSet()) 으로 맺어왔다. 원래도 스트림에 대해서 무지하다고 느끼고 있었지만 모던 자바 인 액션 속 Collector 관련 내용을 보며 나의 부족함을 새삼 다시 느끼게 되었다. 그래서 모던 자바 인 액션 을 읽으며 공부한 내용을 정리하고자 작성했다.

Collector ?

Collector는 스트림의 종단 연산인 collect의 파라미터로 들어가는 인터페이스다.

public interface Collector<T, A, R> {
	Supplier<A> supplier();
	BiConsumer<A, T> accumulator();
	Function<A, R> finisher();
	BinaryOperator<A> combiner();
	Set<Characteristics> characteristics();
}

이 인터페이스를 구현함으로써 원하는 커스텀 Collector를 만들 수도 있고, 제공되는 유틸 함수를 사용하기 위해 Collectors 클래스의 정적 팩터리 메서드를 사용할 수도 있다.

Collectors

Collectors의 기능들은 크게 데이터를 어떻게 그룹화하느냐에 따라서 세 가지로 나눌 수 있다.

요약

counting
maxBy, minBy
summingInt, summingLong, summingDouble
averageInt, averageLong, averageDouble
summarizingInt, summarizingLong, summarizingDouble
joining
toList, toSet, toCollection

그룹화

groupingBy, collectiongAndThen

분할

partitioningBy

1번 요약에 해당하는 내용들은 사실 사용법도 어렵지 않고 낯익은 친구들도 많기 때문에 생략해도 될 것 같다

1. 그룹화

그룹화라는 것은 어떤 값을 기준으로 이에 해당하는 값들을 묶는 것을 말한다.

GroupingBy

모던 자바 인 액션에서는 스트림을 단순히 그룹화하는 것을 넘어서 그룹화된 요소를 조작하고, 단순한 그룹화가 아닌 다수준으로 그룹화하는 방법을 소개하고 있다.

1-1. 단순 그룹화

public static void main(String[] args) {
		List<Dish> menu = Arrays.asList(
			new Dish("불고기", DishType.MEAT),
			new Dish("제육볶음", DishType.MEAT),
			new Dish("매운탕", DishType.FISH),
			new Dish("연어덮밥", DishType.FISH)
		);
		
		Map<DishType, List<Dish>> dishes = menu.stream()
				.collect(Collectors.groupingBy(Dish::getType));
}

{MEAT=[불고기, 제육볶음], FISH=[매운탕, 연어덮밥]}

위 코드를 DishType이 아닌 칼로리를 기준으로 그룹화를 하고 싶을 수도 있다.

Map<CaloricalLevel, List<Dish>> result = menu.stream()
				.collect(Collectors.groupingBy(dish -> {
					if(dish.getCalory() <= 400) return CaloricalLevel.DIET;
					if(dish.getCalory() <= 700) return CaloricalLevel.NORMAL;
					else return CaloricalLevel.FAT;
				}));

요소를 그룹화하고 각 결과 그룹의 요소를 조작하는 연산이 필요한 경우도 있을 수 있다.

1-2. 그룹화 요소 조작

일정 수준 이상의 칼로리를 넘기는 요리들만을 모아서 새로운 Collection을 만든다고 가정을 해보자.

Map<DishType, List<Dish>> result = menu.stream()
				.filter(dish -> dish.getCalory() > 500)
                .collect(groupingBy(Dish::getType))
                
>>> 결과
{MEAT = [불고기, 제육볶음]}

당장 떠오르는 아이디어는 filter를 사용해서 데이터를 걸러낸 후에 groupingBy 를 하는 것일 것이다. 근데 이 방법은 문제점이 하나가 있다. 바로, 칼로리가 500 이하인 데이터들은 filtering 된 이후에 다시 사용할 수 없다는 것이다.

이러한 문제는 groupingBy 의 파라미터로 두 번째 인수 자리에 Collector 형식의 인수를 넣어줌으로써 해결할 수 있다.

Map<DishType, List<Dish>> result = menu.stream()
				.collect(groupingBy(Dish::getType,
						 filtering(dish -> dish.getCalory() > 500, toList())));

>>> 결과
{MEAT = [불고기, 제윢볶음], FISH = []}

filtering 메서드는 Predicate 를 받고 이를 통해 그룹화되어 있는 요소들을 대상으로 필터링을 하고 다시 스트림을 재그룹화하게 된다. 따라서 비어 있는 목록도 확인할 수 있는 것이다.

filtering 메서드가 아닌 mapping 을 넣어서 요소들을 변환시킬 수도 있다.

스트림의 최종 결과로 DishType의 Value 값으로 음식의 가격 리스트를 받고 싶다고 가정해보자. 즉, Map<DishType, List<Dish>> 타입이 아닌 Map<DishType, List<Integer>> 이 될 것이다.

Map<DishType, List<Integer>> result = menu.stream()
				.collect(groupingBy(Dish::getType,
						 mapping(Dish::getPrice(), toList()));

이번에는 다음과 같은 <요리, 태그> Map이 주어질 때, 이를 이용해서 DishType을 기준으로 요리들의 태그들을 추출해야 한다고 가정해보자. 두 수준의 리스트를 한 수준으로 평면화하려면 flatMap 을 수행해야 한다

Map<String, List<String>> dishTags = new HashMap<>();
		dishTags.put("불고기", Arrays.asList("달콤한"));
		dishTags.put("제윢볶음", Arrays.asList("짠", "달콤한", "매운"));
		dishTags.put("매운탕", Arrays.asList("매콤한", "짠"));
		dishTags.put("연어덮밥", Arrays.asList("느끼한"));

이제 같은 DishType 의 모든 태그들을 모은 데이터를 추출할 것이다.
두 수준의 리스트를 평면화를 하기 위해서는 flatMapping 을 사용하면 된다.

Map<DishType, Set<String>> result = menu.stream()
				.collect(groupingBy(Dish::getType,
						 flatMapping(dish -> dishTags.get(dish.getName()).stream(), toSet())));
                         
>>> 결과
{MEAT = [달콤한, 짠, 매운], FISH = [매콤한, 짠, 느끼한]}

flatMapping 의 인수로는 Function과 Collector 가 들어갔다.

1-3. 다수준 그룹화

Collectors.groupingBy 는 일반적인 분류 함수와 컬렉터 를 인수로 받는다. 즉, 바깥쪽 groupingBy 메서드에 스트림의 항목을 분류할 두 번째 기준을 정의하는 내부 groupingBy 를 전달해서 두 수준으로 스트림의 항목을 그룹화할 수 있다.

즉, Map<DishType, List<Dish>> 타입에서 더 나아가 Map<DishType, Map<CaloricalLevel, List<Dish>>> 가 만들어진다는 것이다. 코드로 보자.

Map<DishType, Map<CaloricalLevel, List<Dish>>> result = menu.stream()
				.collect(groupingBy(Dish::getType, // 첫 번째 수준의 분류 함수
						 groupingBy(dish -> { // 두 번째 수준의 분류 함수 
							 if(dish.getCalory() < 400) return CaloricalLevel.DIET;
							 else if(dish.getCalory() < 700) return CaloricalLevel.NORMAL;
							 else return CaloricalLevel.FAT;
						 })));

>>> 결과

{MEAT = {FAT=[제육볶음, 불고기]}}, {FISH = {DIET=[연어덮밥], NORMAL=[매운탕]}}

두 번째의 groupingBy에는 하나의 인수로 Function만 넘겨줬다.

보통 groupingBy 의 연산은 버킷 개념으로 생각하면 쉽다고 한다. 첫 번째 groupingBy가 각 키의 버킷을 만들고, 준비된 각각의 버킷을 서브스트림 컬렉터로 채워가기를 반복하면서 n수준의 그룹화를 달성하게 된다.

1-4. 서브스트림으로 데이터 수집하기

가장 간단하게 groupingBy를 이용해서 다음과 같이 스트림 데이터를 수집해서 버킷에 담을 수 있다.

// Type의 개수를 구하는 스트림
Map<DishType, Long> typesCount = menu.stream()
				.collect(groupingBy(Dish::getType, counting()));	

// 메뉴 중 가장 높은 칼로리를 요리를 찾는 스트림
Map<DishType, Optional<Dish>> typesCount = menu.stream()
				.collect(groupingBy(Dish::getType, 
                		 maxBy(comparingInt(Dish::getCalory))));

아래의 트림은 반환 타입이 Optional 인데, 맵의 모든 값은 Optional 일 필요가 없다. 이는 Collectors.collectingAndThen 을 사용해서 컬렉터가 반환한 결과를 다른 형식으로 활용할 수 있다.

Map<DishType, Optional<Dish>> typesCount = menu.stream()
				.collect(groupingBy(Dish::getType, // 분류 함수
                		 collectingAndThen(
                         	maxBy(comparingInt(Dish::getCalory)), // 감싸인 컬렉터
                            Optional::get))); // 변환 함수

collectingAndThen 은 적용할 컬렉터와 변환 함수를 인수로 받아서 새로운 컬렉터로 반환한다.

2. 분할

분할이란 Predicate를 분류 함수로 사용하는 특수한 그룹화 기능을 말한다. Predicate는 Boolean 를 return 하기 때문에 분할이라는 것은 참or거짓을 기준으로 데이터를 분류하는 것을 의미한다고 쉽게 예상할 수 있다.

Map<Boolean, List<Dish>> partition = menu.stream()
				.collect(partitioningBy(Dish::isVegiterian));
                
>>> 결과
{ false = [불고기, 제육볶음, 연어덮밥, 매운탕],
  true = [야채 볶음밥]}

같은 그룹화를 filter 를 사용해서 한다고 가정해보면 partitioningBy의 장점을 느낄 수 있다.
- filter를 사용하게 되면 반대 결과의 스트림 데이터는 사용할 수 없는데, partitioningBy는 모든 결과의 스트림 데이터를 유지할 수 있다는 점에서 장점이 있다.

partitioningBy의 두번 째 인수로 새로운 Collector를 넘길 수도 있다. groupingBy 를 넣어서 DishType으로 재 그룹화해보자.

Map<Boolean, Map<DishType, List<Dish>>> partition = menu.stream()
				.collect(partitioningBy(Dish::isVegiterian, // 분할 함수
                		 groupingBy(Dish::getType))); // 두 번쨰 컬렉터
                         
>>> 결과
{ false = { MEAT = [불고기, 제육볶음], FISH = [연어덮밥, 매운탕] },
  true = { OTHER = [야채 볶음밥] }}

이번에는 collectingAndThen 을 컬렉터로 넣어보자.

Map<Boolean, Dish> partition = menu.stream()
				.collect(partitioningBy(Dish::isVegiterian, // 분할 함수
                		 collectingAndThen(maxBy(comapringInt(Dish::getCalory)),
                         Optional::get))); // 두 번쨰 컬렉터

>>> 결과
{ false = 불고기, true = 야채 볶음밥 }

Collector 인터페이스

public interface Collector<T, A, R> {
	Supplier<A> supplier();
	BiConsumer<A, T> accumulator();
	Function<A, R> finisher();
	BinaryOperator<A> combiner();
	Set<Characteristics> characteristics();
}

각 요소에 대한 설명은 아래와 같다.

supplier()
- 결과를 담을 컨테이너 제공
accumulator()
- 리듀싱 연산을 수행할 함수 반환, 즉 결과 컨테이너에 요소를 추가하는 역할
finisher()
- 연산의 최종 결과를 결과 컨테이너로 적용
combiner()
- 병렬 처리 시 서로 다른 두 결과 컨테이너를 병합하는 역할
characteristics()
- 컬렉터 연산에 필요한 힌트 제공
- UNORDERED : 리듀싱 결과는 방문 순서나 누적 순서에 영향을 받지 않음
- CONCURRENT : 다중 스레드에서 accumulator() 함수를 동시에 호출할 수 있고, 이 컬렉터는 스트림 병렬 리듀싱을 수행할 수 있음
- IDENETITY_FINISH : 리듀싱 과정의 최종 결과로 누적자 객체를 바로 사용 가능

Collectors 에서 팩터리 메서드 형태의 유틸 함수로 제공되는 것들을 사용했다면 이번에는 직접 Collector 를 구현해보자

public class DishTypeCollector implements Collector<Dish, Map<DishType, List<String>>, Map<DishType, List<String>>>{

	@Override
	public Supplier<Map<DishType, List<String>>> supplier() {
		return () -> new HashMap<>() {{ // 결과 컨테이너 제공
			put(DishType.FISH, new ArrayList<>());
			put(DishType.MEAT, new ArrayList<>());
			put(DishType.OTHER, new ArrayList<>());
		}};
	}

	@Override
	public BiConsumer<Map<DishType, List<String>>, Dish> accumulator() {
		return (Map<DishType, List<String>> result, Dish dish) -> { // 누적 연산 수행
			result.get(dish.getType()).add(dish.getName());
		};
	}

	@Override
	public BinaryOperator<Map<DishType, List<String>>> combiner() {
		return (Map<DishType, List<String>> map1, Map<DishType, List<String>> map2) -> { // 병렬 처리 시 두 컨테이너 병합
			for(DishType dishType : map2.keySet()) {
				map1.get(dishType).addAll(map2.get(dishType));
			}
			
			return map1;
		};
	}

	@Override
	public Function<Map<DishType, List<String>>, Map<DishType, List<String>>> finisher() {
		return Function.identity(); // 항등 함수
	}

	@Override
	public Set<Characteristics> characteristics() {
		return Collections.unmodifiableSet(EnumSet.of( // 컬렉터 플래그 설정
				Characteristics.IDENTITY_FINISH, Characteristics.CONCURRENT));
	}
}

List<Dish> menu = Arrays.asList(
			new Dish("불고기", DishType.MEAT),
			new Dish("제육볶음", DishType.MEAT),
			new Dish("매운탕", DishType.FISH),
			new Dish("연어덮밥", DishType.FISH),
            new Dish("야채 볶음밥", DishType.OTHER),
);

Map<DishType, List<String>> result = menu.stream().collect(new DishTypeCollector());
Map<DishType, List<String>> parallelResult = menu.parallelStream().collect(new DishTypeCollector());

만약 위와 같은 동작을 수행하는 코드를 stream으로 표현하면 다음과 같다.

List<Dish> menu = Arrays.asList(
			new Dish("불고기", DishType.MEAT),
			new Dish("제육볶음", DishType.MEAT),
			new Dish("매운탕", DishType.FISH),
			new Dish("연어덮밥", DishType.FISH),
            new Dish("야채 볶음밥", DishType.OTHER),
);

Map<DishType, List<String>> result = menu.stream()
				.map(dish -> {
					Map<DishType, String> dishTypes = new HashMap<>();
					dishTypes.put(dish.getType(), dish.getName());
					return dishTypes;
				})
				.flatMap(map -> map.entrySet().stream())
				.collect(groupingBy(Map.Entry::getKey, 
						 Collectors.mapping(Map.Entry::getValue, Collectors.toList())));

데이터를 복잡하게 수집을 해야 하는 경우, Collector를 커스텀하는게 훨씬 더 나은 방법이란걸 느낄 수 있었다