[프로젝트] 이중 이상의 Join 시 문제점과 개선 + (쿼리 분리 / 인덱스 / 반정규화)

sobaman·2025년 2월 21일

Querydsl db sql

DB & SQL

목록 보기

5/7

프로젝트를 진행하며 피드백을 받은 사항인데

만약 Join 되는 테이블의 데이터가 매우 방대하다고 가정했을 때

이중, 삼중 이상의 Join 이 일어나게 되면 어떻게 될 지 한 번 생각해 보라는 말씀이 있었다.

A, B, C 테이블 Join + Where 조건을 적용했을 때 가정

테이블의 데이터 수는 다음과 같다고 가정해보자

A 테이블 데이터 : 1,000,000개

B 테이블 데이터 : 100,000개

C 테이블 데이터 : 10개

A와 B 가 먼저 Join 됨
합쳐진 데이터 셋에서 Where 조건이 적용됨
Where 조건이 적용된 데이터셋을 다시 C 와 Join
나온 결과에 Where 조건 적용
최종 결과 반환

만약 A * B 가 먼저 Join 되면 1,000,000 * 100,000 로 데이터 셋이 매우 커지게 된다

반대로 A * C 가 먼저 Join 되면 1,000,000 * 10 의 데이터 셋이 반환되고

Where 조건까지 거치면 B와 Join 될 데이터 셋이 상대적으로 작이지게 된다

즉, 테이블의 데이터 양에 따라 Join 순서가 성능에 영향을 미친다

무조건은 아니지만 일반적으로 가장 큰 테이블과 작은 테이블로

먼저 Join + where 조건으로 거르는 게성능적으로 유리한 선택이다

프로젝트에서 작성한 쿼리 개선

기존 프로젝트 쿼리

private SearchPostCursorResponse findByHashtag(String keyword, Long lastPostId, int size) {
		List<SearchPostResponse> searchPostResponses = jpaQueryFactory
			.select(Projections.constructor(
				SearchPostResponse.class,
				imageEntity.post.id,
				imageEntity.imageUrl))
			.from(imageEntity)
			.join(imageEntity.post, postEntity)
			.join(postHashtagEntity).on(postHashtagEntity.post.eq(postEntity))
			.join(postHashtagEntity.hashtag, hashtagEntity)
			.where(
				lastPostId != null ? imageEntity.post.id.lt(lastPostId) : null,
				hashtagEntity.content.containsIgnoreCase(keyword),
				imageEntity.id.eq(
					JPAExpressions
						.select(imageEntity.id.min())
						.from(imageEntity)
						.where(imageEntity.post.id.eq(postEntity.id))))
			.orderBy(imageEntity.post.id.desc())
			.limit(size + 1)
			.fetch();
	}

현재 프로젝트에서 작성한 검색 쿼리이다

일단 Join 이 삼중 Join 으로 되어있고, 심지어 서브쿼리까지 적용되어있다.

각 테이블의 데이터가 방대해지면 한 번에 들고 와야 할 데이터 셋이 매우 커질 위험도가 크고

필터링 되는 각 레코드마다 서브쿼리가 적용되어 성능 저하를 일으킨다

내가 찾아낸 해당 쿼리를 개선하는 방법은 다음과 같다

쿼리를 분리하여 날리기
서브쿼리를 어플리케이션 레벨에서 대체
인덱스 적용
반정규화를 통한 검색 테이블 따로 생성하기 (데이터를 중복 저장해 조회 성능 높이기)

쿼리를 분리해서 날리기

// 1. 먼저 해시태그 조건에 맞는 포스트 ID만 조회
List<Long> postIds = jpaQueryFactory
    .select(postEntity.id)
    .from(postEntity)
    .join(postHashtagEntity).on(postHashtagEntity.post.eq(postEntity))
    .join(postHashtagEntity.hashtag, hashtagEntity)
    .where(
        lastPostId != null ? postEntity.id.lt(lastPostId) : null,
        hashtagEntity.content.containsIgnoreCase(keyword))
    .groupBy(postEntity.id) // 중복 제거
    .orderBy(postEntity.id.desc())
    .limit(size + 1)
    .fetch();

// 2. 선택된 포스트 ID에 대한 이미지 조회
List<SearchPostResponse> searchPostResponses = jpaQueryFactory
    .select(Projections.constructor(
        SearchPostResponse.class,
        imageEntity.post.id,
        imageEntity.imageUrl))
    .from(imageEntity)
    .where(
        imageEntity.post.id.in(postIds), // 위에서 추출한 postIds in절로
        imageEntity.id.eq(
            JPAExpressions
                .select(imageEntity.id.min())
                .from(imageEntity)
                .where(imageEntity.post.id.eq(imageEntity.post.id))))
    .orderBy(imageEntity.post.id.desc())
    .fetch();

일단

해시태그 조건에 맞는 id를 먼저 조회하고
선택한 포스트 id 에 대한 이미지 조회를 수행

한 번 날릴 쿼리를 두번 나눠 날린다는 점에서 비효율적으로 보일수도 있지만

Join 문 으로 인해 데이터 셋이 방대해지는 경우에는 이 쪽이 훨씬 나은 선택이 될 수 있다

추가로 서브쿼리까지 개선 → 어플리케이션 레벨에서 처리

여기서 욕심을 더 내보자면,

서브쿼리 역시 성능 저하를 일으킬 수 있는 잠재 요소가 크기 때문에

해당 부분을 어플리케이션 레벨에서 처리하는 방식으로 변경할 수 있다

private SearchPostCursorResponse findByHashtag(String keyword, Long lastPostId, int size) {

		List<Long> postIds = jpaQueryFactory
			.select(postEntity.id)
			.from(postEntity)
			.join(postHashtagEntity).on(postHashtagEntity.post.eq(postEntity))
			.join(postHashtagEntity.hashtag, hashtagEntity)
			.where(
				lastPostId != 0L ? postEntity.id.lt(lastPostId) : null,
				hashtagEntity.content.eq(keyword)
			)
			.orderBy(postEntity.id.desc())
			.limit(size + 1)
			.fetch();

		List<ImageDataWithPostId> allImages = jpaQueryFactory
			.select(Projections.constructor(
				ImageDataWithPostId.class,
				imageEntity.post.id,
				imageEntity.imageUrl))
			.from(imageEntity)
			.where(
					imageEntity.post.id.in(postIds)
				// 서브쿼리 삭제
				// , imageEntity.id.eq(
				// 	JPAExpressions
				// 		.select(imageEntity.id.min())
				// 		.from(imageEntity)
				// 		.where(imageEntity.post.id.eq(postEntity.id))))
			.orderBy(imageEntity.post.id.desc(), imageEntity.id.asc())
			.limit(size + 1)
			.fetch();
			
		// 서브쿼리를 어플리케이션 레벨에서 대체
		Map<Long, SearchPostResponse> mapWithFirstImage = new LinkedHashMap<>();
		
		// postId 개수 만큼의 item 이 생성됨
		for (ImageDataWithPostId imageData : allImages) {
			if(!mapWithFirstImage.containsKey(imageData.postId())) {
				mapWithFirstImage.put(
					imageData.postId(), SearchPostResponse.create(imageData.postId(), imageData.imgUrl()));
			}
		}
}

현재 동일한 postId 의 image 중 가장 첫번째 image 만 추출하기 위해서

다음과 같은 서브쿼리가 적용되어 있는데,

위와 같이 어플리케이션 레벨로 옮겨서 처리하는 방식으로 대체할 수 있다

현재 allImages 는 imageEntity.post.id.desc(), imageEntity.id.asc() 로 정렬이 적용된 상태,

mapWithFirstImage 에는 유니크한 postId 키를 가지고

가장 첫번째 있는 image 만을 추출하여 반환하게 된다.

최종적으로 Join 도 분리되고, 서브쿼리도 제거된 최적화 쿼리가 완성됐다.

인덱스

다음 링크 참조

반정규화를 이용한 검색 테이블 따로 생성하기

CREATE TABLE hashtag_search (
  post_id BIGINT,
  hashtag_content VARCHAR(255),
  img_url VARCHAR(512),
  PRIMARY KEY (post_id, hashtag_content)
)

해당 검색 자체가 매우 빈번하게 일어날 경우 DB 구조 자체를 바꾸는 방법도 있다

아예 검색 조회 테이블을 따로 만들어 빼놓는 것이다

반정규화란 뜻으로 알 수 있듯이 DB 내에 중복 데이터를 허용하게 함으로

추가적인 DB 저장 공간의 할당을 요구하긴 하지만

검색 성능의 최적화가 매우 필요하다고 생각하면 도입하기 좋은 방법이다

sobaman

백엔드 공부 정리 블로그

이전 포스트

인덱스에 대해 알아보자 + (B+Tree 알고리즘)

다음 포스트