_source
object is not directly used when searching for docsAnalyzer processes text before data store
3 components of analyzer
Character filters
Tokenizer
Token filters
Adds, removes, or changes characters
ex) html_strip filter
an analyzer contains one tokenizer. Tokenizing string into tokens
ex) ["I", "really", "like", "beer"]
Receive output of tokenizer as input. Token filters add, remove, or modify tokens
ex) lowercase filter
POST /_analyze
{
"text" : "2 guys wal into a bar, but the third... DUCKS! :-)",
"analyzer": "standard"
}
Output
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "guys",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "wal",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "into",
"start_offset" : 11,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 3
},
...
]
}
// same request as using standard analyzer
POST /_analyze
{
"text" : "2 guys wal into a bar, but the third... DUCKS! :-)",
"char_filter" : [],
"tokenizer" : "standard",
"filter" : ["lowercase"]
}
Field's values are stored in one of several data structures depending on it's data type, which ensures efficient data access
inverted index enables efficient search of docs by term
inverted index contain many information including relevance scoring
(rank by how well doc match)
inverted index is created for each text field
fields with data type other than text uses different index data structures