Elasticsearch join 관계에서 조건에 맞는 children count를 세는 composite aggregation

a worker on the blue dot·2020년 3월 18일

엘라스틱서치

목록 보기

2/6

join 관계의 doc이 있다

매핑은 아래와 같다
post의 자식으로 comment가있고, post에는 언급된 사람들이 배열로 있다.
(7버전 아래에서는 mappings 하위에 _doc으로 감싸줘야함.)

PUT blog
{
  "mappings": {
    "properties": {
      "post": {
        "properties": {
          "postId": {
            "type": "keyword"
          },
          "mentionedPeople": {
            "type": "keyword"
          }
        }
      },
      "comment": {
        "properties": {
          "commentId": {
            "type": "keyword"
          },
          "content": {
            "type": "text"
          }
        }
      },
      "join": {
        "type": "join",
        "relations": {
          "post": "comment"
        }
      }
    }
  }
}

데이터는 아래와 같이 들어가있다.

데이터 구분을 위해 모든 댓글의 내용(comment.content)는 "hello"이고, 26번만 " goodbye"로 지정했다.
아래와같이 데이터를 넣었다.

### 부모 포스트 2개 넣음

PUT blog/_doc/1
{
  "post":{
    "postId" :1,
    "mentionedPeople":["a","b","c"]
  },
  "join":"post"
}
PUT blog/_doc/2
{
  "post":{
    "postId" :2,
    "mentionedPeople":["b","c","d","e"]
  },
  "join":"post"
}

### 아래와 같이 11개의 댓글을 넣었다.
PUT blog/_doc/11?routing=1
{
  "comment": {
    "commentId": 11,
    "content": "hello"
  },
  "join": {
    "name": "comment",
    "parent": 1
  }
}
## 2의 6번째 댓글만 content를 bye로 넣었다.
PUT blog/_doc/26?routing=2
{
  "comment": {
    "commentId": 26,
    "content": "goodbye"
  },
  "join": {
    "name": "comment",
    "parent": 2
  }
}

comment가 hello인 것의 개수를 '언급된 사람별'로 count해보자.

하고싶은것은, post에는 여러 언급된 사람들이있고, 또 post에는 여러 comment가 있는데,
언급된 사람 > post > 특정 조건의 comment를, 사람별로 집계하고 싶은것이다.

즉 b는 1번과 2번 포스트에 언급되어 11개의 모든 댓글을 가지지만, 그중 내용이 hello인 10개를 가진다는 것을 알고싶고, a는 1번 포스트에 언급되어 내용이 hello인 5개의 포스트를 가진다는 것을 집계하고 싶은것이다.

1. mentionedPeople로 aggregation하고 -> 코멘트로 aggregation

comment는 자식이기때문에 aggregation시 children aggr를 써줘야한다.

GET blog/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "inner_hits": {
        "_source": false,
        "size": 0
      },
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "comment.content": "hello"
              }
            }
          ]
        }
      }
    }
  },
  "_source": false,
  "aggs": {
    "byworkspaceId": {
      "terms": {
        "field": "post.mentionedPeople"
      },
      "aggs": {
        "commentCount": {
          "children": {
            "type": "comment"
          },
          "aggs": {
            "commentHowMany": {
              "value_count": {
                "field": "comment.commentId"
              }
            }
          }
        }
      }
    }
  }
}

위 결과는 아래와 같다.

inner_hit에 보면, hello인 것의 개수가 a포스트 5개, b포스트 5개이지만
bucket에보면 a사람에 11개, b사람에 11개로 goodbye인 것이 제외되지 않고 나온다.

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 4,
    "successful": 4,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "blog",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "inner_hits": {
          "comment": {
            "hits": {
              "total": 5,
              "max_score": 0,
              "hits": []
            }
          }
        }
      },
      {
        "_index": "blog",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "inner_hits": {
          "comment": {
            "hits": {
              "total": 5,
              "max_score": 0,
              "hits": []
            }
          }
        }
      }
    ]
  },
  "aggregations": {
    "byworkspaceId": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "b",
          "doc_count": 2,
          "commentCount": {
            "doc_count": 11,
            "commentHowMany": {
              "value": 11
            }
          }
        },
        {
          "key": "c",
          "doc_count": 2,
          "commentCount": {
            "doc_count": 11,
            "commentHowMany": {
              "value": 11
            }
          }
        },
        {
          "key": "a",
          "doc_count": 1,
          "commentCount": {
            "doc_count": 5,
            "commentHowMany": {
              "value": 5
            }
          }
        },
        {
          "key": "d",
          "doc_count": 1,
          "commentCount": {
            "doc_count": 6,
            "commentHowMany": {
              "value": 6
            }
          }
        },
        {
          "key": "e",
          "doc_count": 1,
          "commentCount": {
            "doc_count": 6,
            "commentHowMany": {
              "value": 6
            }
          }
        }
      ]
    }
  }
}

2. aggregation에도 필터를 걸어준다.

GET blog/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "inner_hits": {
        "_source": false,
        "size": 0
      },
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "comment.content": "hello"
              }
            }
          ]
        }
      }
    }
  },
  "_source": false,
  "aggs": {
    "byworkspaceId": {
      "terms": {
        "field": "post.mentionedPeople",
        "size" : 5
      },
      "aggs": {
        "childCount": {
          "children": {
            "type": "comment"
          },
          "aggs": {
            "inner_filter": {
              "filter": {
                "bool": {
                  "should": [
                    {
                      "match": {
                        "comment.content": "hello"
                      }
                    }
                  ]
                }
              },
              "aggs": {
                "commentHowMany": {
                  "value_count": {
                    "field": "comment.commentId"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

내부 aggr에 filter aggr를 걸었더니, goodbye인것이 제외되고 잘 집계된다. 결과는 아래와같다.

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 4,
    "successful": 4,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "blog",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "inner_hits": {
          "comment": {
            "hits": {
              "total": 5,
              "max_score": 0,
              "hits": []
            }
          }
        }
      },
      {
        "_index": "blog",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "inner_hits": {
          "comment": {
            "hits": {
              "total": 5,
              "max_score": 0,
              "hits": []
            }
          }
        }
      }
    ]
  },
  "aggregations": {
    "byworkspaceId": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "b",
          "doc_count": 2,
          "childCount": {
            "doc_count": 11,
            "inner_filter": {
              "doc_count": 10,
              "commentHowMany": {
                "value": 10
              }
            }
          }
        },
        {
          "key": "c",
          "doc_count": 2,
          "childCount": {
            "doc_count": 11,
            "inner_filter": {
              "doc_count": 10,
              "commentHowMany": {
                "value": 10
              }
            }
          }
        },
        {
          "key": "a",
          "doc_count": 1,
          "childCount": {
            "doc_count": 5,
            "inner_filter": {
              "doc_count": 5,
              "commentHowMany": {
                "value": 5
              }
            }
          }
        },
        {
          "key": "d",
          "doc_count": 1,
          "childCount": {
            "doc_count": 6,
            "inner_filter": {
              "doc_count": 5,
              "commentHowMany": {
                "value": 5
              }
            }
          }
        },
        {
          "key": "e",
          "doc_count": 1,
          "childCount": {
            "doc_count": 6,
            "inner_filter": {
              "doc_count": 5,
              "commentHowMany": {
                "value": 5
              }
            }
          }
        }
      ]
    }
  }
}

3. 한번에 다가져오나? 페이징처리는?

문제는, 위 쿼리에 보면 아래와같이 size를 5개까지만 명시했는데, 명시하지 않으면 10개가 default이다.

  "aggs": {
    "byworkspaceId": {
      "terms": {
        "field": "post.mentionedPeople",
        "size" : 5
      },

문제는 mentionedPeople이 만명, 십만명 이라면?

bucket이 만개 십만개 생길것이고, 이를 한번의 response로 내리면 메모리 문제가 생길 수 있다.

4. composite aggr 추가

composite aggr은 조합 집계를 할때쓰이지만, 페이징처리를 통해 모든 집계를 가져오기 위해서도 쓰인다.
아래와같이 composite 을 추가하면 paging처리되어 aggr를 가져오는 것이 가능하다.

GET blog/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "inner_hits": {
        "_source": false,
        "size": 0
      },
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "comment.content": "hello"
              }
            }
          ]
        }
      }
    }
  },
  "size": 0,
  "aggs": {
    "my_compostie": {
      "composite": {
        "sources": [
          {
            "byMentionedPeople": {
              "terms": {
                "field": "post.mentionedPeople"
              }
            }
          }
        ],
        "size": 2
      },
      "aggs": {
        "childCount": {
          "children": {
            "type": "comment"
          },
          "aggs": {
            "inner_filter": {
              "filter": {
                "bool": {
                  "should": [
                    {
                      "match": {
                        "comment.content": "hello"
                      }
                    }
                  ]
                }
              },
              "aggs": {
                "commentHowMAny": {
                  "value_count": {
                    "field": "comment.commentId"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

위 쿼리의 결과에서는 a,b가 나왔다.

결과에서 아래 "after_key"를 주목하자.


 "aggregations": {
    "my_compostie": {
      "after_key": {
        "byMentionedPeople": "b"
      },

다음 쿼리시, composite의 after에 위 맵을 그대로 넘겨주면, 그 이후로 조회 가능하다.

"aggs": {
    "my_compostie": {
      "composite": {
        "sources": [
          {
            "byMentionedPeople": {
              "terms": {
                "field": "post.mentionedPeople"
              }
            }
          }
        ],
        "size": 2,
        "after": {
          "byMentionedPeople": "b"
        }
      },

위와같이 b 이후 부터 조회하겠다라고 하면 c와 d를 보여준다.

5. java ligh-level client로 작성하면

	RestHighLevelClient metaEs = ESClient.getClient("localhost", 9200);

	BoolQueryBuilder shouldMatchHello = QueryBuilders.boolQuery().should(QueryBuilders.matchQuery("comment.content","hello"));
        HasChildQueryBuilder hasChildQueryBuilder = JoinQueryBuilders.hasChildQuery("comment", shouldMatchHello, ScoreMode.None);
        hasChildQueryBuilder.innerHit();

        List<CompositeValuesSourceBuilder<?>> sources = new ArrayList<>();
        sources.add(new TermsValuesSourceBuilder("byMentionedPoeple").field("post.mentionedPeople"));
        CompositeAggregationBuilder compositeAggregation = new CompositeAggregationBuilder("my_composite", sources).size(2);

        compositeAggregation.subAggregation(
                new ChildrenAggregationBuilder("childCount", "comment")
                        .subAggregation(AggregationBuilders.filter("inner_filter", shouldMatchHello))
        );


        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        sourceBuilder.query(hasChildQueryBuilder);
        sourceBuilder.aggregation(compositeAggregation);
        sourceBuilder.size(0);


        SearchRequest searchRequest = new SearchRequest();
        searchRequest.indices("blog");
        searchRequest.source(sourceBuilder);


        try {
            SearchResponse response = metaEs.search(searchRequest, RequestOptions.DEFAULT);
            System.out.println(searchRequest);
            System.out.println(response);

        } catch (IOException e) {
            System.out.println(e);
        }

6.사실

사실 inner_filter를 추가할때부터 그 내부의 value_count aggs는 의미가 없어졌다.

이미 filter의 결과에 doc_count가 있기 때문이다. 어쨋든.. 복잡한 요건이 해결되었다.

a worker on the blue dot

일하며 하는 기록

이전 포스트

Elasticsearch Certification 후기

다음 포스트