关于Elasticsearch中的映射参数与自动映射字段解析,以及为什么聚合不支持text类型探究
在对elasticsearch建立mapping时,使用了map类型
private Map<String, Object> specs; 使用kibana查看自动映射类型,发现为:
"specs": {
"properties": {
"CPU品牌": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}, 而不是传统的:
"price": {
"type": "long"
},
"skus": {
"type": "keyword",
"index": false
}, 这样的简单类型,对上面的情况和字段比较陌生,于是去搜集资料,最终在官网找到了相关的解答:
fieldseditIt is often useful to index the same field in different ways for different purposes. This is the purpose of multi-fields. For instance, a
stringfield could be mapped as atextfield for full-text search, and as akeywordfield for sorting or aggregations:PUT my_index { "mappings": { "_doc": { "properties": { "city": { "type": "text", "fields": { "raw": { "type": "keyword" } } } } } } } PUT my_index/_doc/1 { "city": "New York" } PUT my_index/_doc/2 { "city": "York" } GET my_index/_search { "query": { "match": { "city": "york" } }, "sort": { "city.raw": "asc" }, "aggs": { "Cities": { "terms": { "field": "city.raw" } } } }
The
city.rawfield is akeywordversion of thecityfield.The
cityfield can be used for full text search.The
city.rawfield can be used for sorting and aggregations
Multi-fields do not change the original
_sourcefield.
The
fieldssetting is allowed to have different settings for fields of the same name in the same index. New multi-fields can be added to existing fields using the PUT mapping API.Multi-fields with multiple analyzersedit
Another use case of multi-fields is to analyze the same field in different ways for better relevance. For instance we could index a field with the
standardanalyzer which breaks text up into words, and again with theenglishanalyzer which stems words into their root form:PUT my_index { "mappings": { "_doc": { "properties": { "text": { "type": "text", "fields": { "english": { "type": "text", "analyzer": "english" } } } } } } } PUT my_index/_doc/1 { "text": "quick brown fox" } PUT my_index/_doc/2 { "text": "quick brown foxes" } GET my_index/_search { "query": { "multi_match": { "query": "quick brown foxes", "fields": [ "text", "text.english" ], "type": "most_fields" } } }
The
textfield uses thestandardanalyzer.The
text.englishfield uses theenglishanalyzer.Index two documents, one with
foxand the other withfoxes.Query both the
textandtext.englishfields and combine the scores.The
textfield contains the termfoxin the first document andfoxesin the second document. Thetext.englishfield containsfoxfor both documents, becausefoxesis stemmed tofox.The query string is also analyzed by the
standardanalyzer for thetextfield, and by theenglishanalyzer for thetext.englishfield. The stemmed field allows a query forfoxesto also match the document containing justfox. This allows us to match as many documents as possible. By also querying the unstemmedtextfield, we improve the relevance score of the document which matchesfoxesexactly.
问题解决,fields即为不同目的以不同方式索引相同字段,达到多方式索引,例如,string 字段可以映射为text全文搜索字段,也可以映射keyword为排序或聚合字段.如其中CPU品牌的text字段可用于搜索分词,而CPU品牌的keyword字段可用于聚合.测试:
GET /goods/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "specs.内存",
"size": 10
}
}
}
}
搜索结果出现:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [specs.内存] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "goods",
"node": "sjredvFNT729Jrv0wvucVA",
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [specs.内存] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
}
]
},
"status": 400
} 为什么会报这个错?
我在官方找到了解释:
fielddataeditMost fields are indexed by default, which makes them searchable. Sorting, aggregations, and accessing field values in scripts, however, requires a different access pattern from search.
Search needs to answer the question "Which documents contain this term?", while sorting and aggregations need to answer a different question: "What is the value of this field for thisdocument?".
Most fields can use index-time, on-disk
doc_valuesfor this data access pattern, buttextfields do not supportdoc_values.Instead,
textfields use a query-time in-memory data structure calledfielddata. This data structure is built on demand the first time that a field is used for aggregations, sorting, or in a script. It is built by reading the entire inverted index for each segment from disk, inverting the term ↔︎ document relationship, and storing the result in memory, in the JVM heap.Fielddata is disabled on
textfields by defaulteditFielddata can consume a lot of heap space, especially when loading high cardinality
textfields. Once fielddata has been loaded into the heap, it remains there for the lifetime of the segment. Also, loading fielddata is an expensive process which can cause users to experience latency hits. This is why fielddata is disabled by default.If you try to sort, aggregate, or access values from a script on a
textfield, you will see this exception:Fielddata is disabled on text fields by default. Set
fielddata=trueon [your_field_name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.Before enabling fielddataedit
Before you enable fielddata, consider why you are using a
textfield for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.A text field is analyzed before indexing so that a value like
New Yorkcan be found by searching fornewor foryork. Atermsaggregation on this field will return anewbucket and ayorkbucket, when you probably want a single bucket calledNew York.Instead, you should have a
textfield for full text searches, and an unanalyzedkeywordfield withdoc_valuesenabled for aggregations, as follows:PUT my_index { "mappings": { "_doc": { "properties": { "my_field": { "type": "text", "fields": { "keyword": { "type": "keyword" } } } } } } }
Use the
my_fieldfield for searches.Use the
my_field.keywordfield for aggregations, sorting, or in scripts.Enabling fielddata on
textfieldseditYou can enable fielddata on an existing
textfield using the PUT mapping API as follows:PUT my_index/_mapping/_doc { "properties": { "my_field": { "type": "text", "fielddata": true } } }
The mapping that you specify for
my_fieldshould consist of the existing mapping for that field, plus thefielddataparameter.
fielddata_frequency_filtereditFielddata filtering can be used to reduce the number of terms loaded into memory, and thus reduce memory usage. Terms can be filtered by frequency:
The frequency filter allows you to only load terms whose document frequency falls between a
minandmaxvalue, which can be expressed an absolute number (when the number is bigger than 1.0) or as a percentage (eg0.01is1%and1.0is100%). Frequency is calculated per segment. Percentages are based on the number of docs which have a value for the field, as opposed to all docs in the segment.Small segments can be excluded completely by specifying the minimum number of docs that the segment should contain with
min_segment_size:PUT my_index { "mappings": { "_doc": { "properties": { "tag": { "type": "text", "fielddata": true, "fielddata_frequency_filter": { "min": 0.001, "max": 0.1, "min_segment_size": 500 } } } } } }
简单意思就是说:text字段在默认情况下,禁用Fieddata,而elasticsearch自动映射会给出一个keyword字段用于聚合等操作,为什么text被禁用呢,有两个原因,一个就是Fielddata在加载高基数的text字段时,会消耗大量的堆空间,另一个原因就是对text字段进行聚合通常没有意义,比如
A text field is analyzed before indexing so that a value like
New Yorkcan be found by searching fornewor foryork. Atermsaggregation on this field will return anewbucket and ayorkbucket, when you probably want a single bucket calledNew York.Instead, you should have a
textfield for full text searches, and an unanalyzedkeywordfield withdoc_valuesenabled for aggregations, as follows:PUT my_index { "mappings": { "_doc": { "properties": { "my_field": { "type": "text", "fields": { "keyword": { "type": "keyword" } } } } } } }
本来你想对"new york"进行聚合,但是在之前进行了分词,聚合后会有两个桶,分别是new和york,如果你想得到一个桶,就应该有一个启用了聚合的未分析keyword字段doc_values,然后text用于全文搜索,岂不两全其美.
下面对keyword字段进行聚合测试,测试成功:
GET /goods/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "specs.内存.keyword",
"size": 10
}
}
}
}
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 182,
"max_score": 0,
"hits": []
},
"aggregations": {
"NAME": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "4GB",
"doc_count": 75
},
{
"key": "3GB",
"doc_count": 49
},
{
"key": "6GB",
"doc_count": 48
},
{
"key": "2GB",
"doc_count": 23
},
{
"key": "8GB",
"doc_count": 2
}
]
}
}
} 最后感谢google+baidu.解决了问题.
海康威视公司福利 1125人发布
查看13道真题和解析