[ElasticSearch] Bucket Aggregation 으로 집계한 데이터 풀어서 보기 - top_hits, reverse_nested

조갱·2022년 11월 13일

ElassticSearch Spring aggregation reverse_nested top_hits

ElasticSearch

목록 보기

6/7

저번 포스팅에서, ElasticSearch의 데이터를 그룹핑 해서 작업할 수 있는 Aggregation (이하 aggs) 기능을 알아봤다.

MySQL에는 groupBy에 명시된 필드만 SELECT할 수 있지만,
ES에서는 groupBy에 명시된 필드 말고도 document를 조회할 수 있다.

PUT person/_doc/1
{
  "name":"김남자",
  "gender":"M",
  "age": 27,
  "address": {
    "personId": 1,
    "country": "KR",
    "city": "Suwon"
  }
}

PUT person/_doc/2
{
  "name":"김이상",
  "gender":"M",
  "age": 2,
  "address": {
    "personId": 2,
    "country": "KR",
    "city": "Seoul"
  }
}

PUT person/_doc/3
{
  "name":"최여자",
  "gender":"F",
  "age": 25,
  "address": {
    "personId": 3,
    "country": "KR",
    "city": "Seoul"
  }
}

top_hits

말 그대로 상위 결과이다.
groupBy로 묶인 결과를 확인할 수 있다.

GET person/_search
{
  "size": 0, 
  "aggs": {
    "byGender": {
      "terms": {
        "field": "gender"
      }
    }
  }
}

기본적으로 위와 같이 gender로 term Aggregation을 수행하면

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "byGender": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "M",
          "doc_count": 2
        },
        {
          "key": "F",
          "doc_count": 1
        }
      ]
    }
  }
}

위와 같이 성별 값만 나온다.
여기에 top_hits aggregation를 붙여보자.

GET person/_search
{
  "size": 0, 
  "aggs": {
    "byGender": {
      "terms": {
        "field": "gender"
      },
      "aggs": {
        "top_hits": {
          "top_hits": {
            "size": 1
          }
        }
      }
    }
  }
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "byGender": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "M",
          "doc_count": 2,
          "top_hits": {
            "hits": {
              "total": {
                "value": 2,
                "relation": "eq"
              },
              "max_score": 1,
              "hits": [
                {
                  "_index": "person",
                  "_id": "2",
                  "_score": 1,
                  "_source": {
                    "name": "김이상",
                    "gender": "M",
                    "age": 2,
                    "address": {
                      "personId": 2,
                      "country": "KR",
                      "city": "Seoul"
                    }
                  }
                }
              ]
            }
          }
        },
        {
          "key": "F",
          "doc_count": 1,
          "top_hits": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 1,
              "hits": [
                {
                  "_index": "person",
                  "_id": "3",
                  "_score": 1,
                  "_source": {
                    "name": "최여자",
                    "gender": "F",
                    "age": 25,
                    "address": {
                      "personId": 3,
                      "country": "KR",
                      "city": "Seoul"
                    }
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}

위와 같이 gender 별로 가장 관련도가 높은 document가 1건씩 출력된다.
top_hits에 노출되는 문서의 개수는 top_hits aggs의 size 필드에서 설정할 수 있으며,
(기본적으로) 최대 100개까지만 노출이 가능하다.

max_inner_result_window 값의 설정을 통해 변경이 가능하지만,
ES의 성능에 무리를 줄 수 있으니 100개를 초과하는 결과가 필요하다면
다른 방식으로 접근해보자.

reverse_nested

nested 필드 안에 있는 데이터로 terms Aggregation을 수행하고
그 상위의 검색 결과가 필요할 수 있다.

우선 nested 필드인 address 안에 있는 city로 terms Aggs를 수행해보자

GET person/_search
{
  "size": 0,
  "aggs": {
    "addrNested": {
      "nested": {
        "path": "address"
      },
      "aggs": {
        "byCity": {
          "terms": {
            "field": "address.city"
          }
        }
      }
    }
  }
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "addrNested": {
      "doc_count": 3,
      "byCity": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "Seoul",
            "doc_count": 2
          },
          {
            "key": "Suwon",
            "doc_count": 1
          }
        ]
      }
    }
  }
}

Seoul 과 Suwon이 집계됐다.
이제 top_hits 로 document를 확인해보자.

GET person/_search
{
  "size": 0,
  "aggs": {
    "addrNested": {
      "nested": {
        "path": "address"
      },
      "aggs": {
        "byCity": {
          "terms": {
            "field": "address.city"
          },
          "aggs": {
            "top_hits": {
              "top_hits": {
                "size": 1
              }
            }
          }
        }
      }
    }
  }
}

... 중략
  "aggregations": {
    "addrNested": {
      "doc_count": 3,
      "byCity": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "Seoul",
            "doc_count": 2,
            "top_hits": {
              "hits": {
                "total": {
                  "value": 2,
                  "relation": "eq"
                },
                "max_score": 1,
                "hits": [
                  {
                    "_index": "person",
                    "_id": "2",
                    "_nested": {
                      "field": "address",
                      "offset": 0
                    },
                    "_score": 1,
                    "_source": {
                      "country": "KR",
                      "city": "Seoul",
                      "personId": 2
                    }
                  }
                ]
              }
            }
          },
... 중략

top_hits Aggs의 결과로 nested 필드인 address만 노출되었다.
그 이유는 이곳에서 설명한 내용과 연관이 있는데,

nested 필드는 각각의 document 로 저장되기 때문에
top_hits Aggs에서 반환하는 가장 연관도 높은 document는 address의 nested document가 반환되는 것이다.

person document를 얻기 위해서 reverse_nested 를 사용해보자.

GET person/_search
{
  "size": 0,
  "aggs": {
    "addrNested": {
      "nested": {
        "path": "address"
      },
      "aggs": {
        "byCity": {
          "terms": {
            "field": "address.city"
          },
          "aggs": {
            "reverse": {
              "reverse_nested": {}, // 안에 path 필드를 넣으면 특정 nested 위치로 이동이 가능합니다.
                // path 필드가 없는 경우 root 위치로 이동합니다.
              "aggs": {
                "top_hits": {
                  "top_hits": {
                    "size": 1
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

.. 중략
  "aggregations": {
    "addrNested": {
      "doc_count": 3,
      "byCity": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "Seoul",
            "doc_count": 2,
            "reverse": {
              "doc_count": 2,
              "top_hits": {
                "hits": {
                  "total": {
                    "value": 2,
                    "relation": "eq"
                  },
                  "max_score": 1,
                  "hits": [
                    {
                      "_index": "person",
                      "_id": "2",
                      "_score": 1,
                      "_source": {
                        "name": "김이상",
                        "gender": "M",
                        "age": 2,
                        "address": {
                          "personId": 2,
                          "country": "KR",
                          "city": "Seoul"
                        }
                      }
                    }
                  ]
                }
              }
            }
          },
          {
            "key": "Suwon",
            "doc_count": 1,
            "reverse": {
              "doc_count": 1,
              "top_hits": {
                "hits": {
                  "total": {
                    "value": 1,
                    "relation": "eq"
                  },
                  "max_score": 1,
                  "hits": [
                    {
                      "_index": "person",
                      "_id": "1",
                      "_score": 1,
                      "_source": {
                        "name": "김남자",
                        "gender": "M",
                        "age": 27,
                        "address": {
                          "personId": 1,
                          "country": "KR",
                          "city": "Suwon"
                        }
                      }
                    }
                  ]
                }
              }
            }
          }
        ]
      }
    }
  }
.. 중략

person document를 읽을 수 있다.

조갱

A fast learner.

이전 포스트

[ElasticSearch] aggregation 함수 (feat. groupBy, orderBy, LIMIT)

다음 포스트

[ElasticSearch] Bucket Aggregation 으로 집계한 데이터 풀어서 보기 - top_hits, reverse_nested

ElasticSearch

top_hits

reverse_nested

[ElasticSearch] aggregation 함수 (feat. groupBy, orderBy, LIMIT)

Elasticsearch Distinct Count 구하기 (Cardinality Aggs는 사용하지 마세요!)

0개의 댓글

관련 채용 정보