2020-03-23 14:29 已编辑字节跳动_商业化_资深研发工程师

关注

关于Elasticsearch中的映射参数与自动映射字段解析,以及为什么聚合不支持text类型探究

在对elasticsearch建立mapping时,使用了map类型

private Map<String, Object> specs;

使用kibana查看自动映射类型,发现为:

"specs": {
            "properties": {
              "CPU品牌": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },

而不是传统的:

         "price": {
            "type": "long"
          },
          "skus": {
            "type": "keyword",
            "index": false
          },

这样的简单类型,对上面的情况和字段比较陌生,于是去搜集资料,最终在官网找到了相关的解答:

fieldsedit

It is often useful to index the same field in different ways for different purposes. This is the purpose of multi-fields. For instance, a string field could be mapped as a text field for full-text search, and as a keyword field for sorting or aggregations:
PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "city": "New York"
}

PUT my_index/_doc/2
{
  "city": "York"
}

GET my_index/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}
COPY AS CURL VIEW IN CONSOLE

The city.raw field is a keyword version of the city field.

The city field can be used for full text search.

The city.raw field can be used for sorting and aggregations

Multi-fields do not change the original _source field.

The fields setting is allowed to have different settings for fields of the same name in the same index. New multi-fields can be added to existing fields using the PUT mapping API.

Multi-fields with multiple analyzersedit

Another use case of multi-fields is to analyze the same field in different ways for better relevance. For instance we could index a field with the standard analyzer which breaks text up into words, and again with the english analyzer which stems words into their root form:
PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "text": { 
          "type": "text",
          "fields": {
            "english": { 
              "type":     "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

PUT my_index/_doc/1
{ "text": "quick brown fox" } 

PUT my_index/_doc/2
{ "text": "quick brown foxes" } 

GET my_index/_search
{
  "query": {
    "multi_match": {
      "query": "quick brown foxes",
      "fields": [ 
        "text",
        "text.english"
      ],
      "type": "most_fields" 
    }
  }
}
COPY AS CURL VIEW IN CONSOLE

The text field uses the standard analyzer.

The text.english field uses the english analyzer.

Index two documents, one with fox and the other with foxes.

Query both the text and text.english fields and combine the scores.

The text field contains the term fox in the first document and foxes in the second document. The text.english field contains fox for both documents, because foxes is stemmed to fox.

The query string is also analyzed by the standard analyzer for the text field, and by the englishanalyzer for the text.english field. The stemmed field allows a query for foxes to also match the document containing just fox. This allows us to match as many documents as possible. By also querying the unstemmed text field, we improve the relevance score of the document which matches foxes exactly.

问题解决,fields即为不同目的以不同方式索引相同字段,达到多方式索引,例如，string 字段可以映射为text全文搜索字段，也可以映射keyword为排序或聚合字段.如其中CPU品牌的text字段可用于搜索分词,而CPU品牌的keyword字段可用于聚合.测试:

GET /goods/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "terms": {
        "field": "specs.内存",
        "size": 10
      }
    }
  }
}

搜索结果出现:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [specs.内存] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "goods",
        "node": "sjredvFNT729Jrv0wvucVA",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [specs.内存] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
        }
      }
    ]
  },
  "status": 400
}

为什么会报这个错?

我在官方找到了解释:

fielddataedit

Most fields are indexed by default, which makes them searchable. Sorting, aggregations, and accessing field values in scripts, however, requires a different access pattern from search.

Search needs to answer the question "Which documents contain this term?", while sorting and aggregations need to answer a different question: "What is the value of this field for thisdocument?".

Most fields can use index-time, on-disk doc_values for this data access pattern, but text fields do not support doc_values.

Instead, text fields use a query-time in-memory data structure called fielddata. This data structure is built on demand the first time that a field is used for aggregations, sorting, or in a script. It is built by reading the entire inverted index for each segment from disk, inverting the term ↔︎ document relationship, and storing the result in memory, in the JVM heap.

Fielddata is disabled on text fields by defaultedit

Fielddata can consume a lot of heap space, especially when loading high cardinality text fields. Once fielddata has been loaded into the heap, it remains there for the lifetime of the segment. Also, loading fielddata is an expensive process which can cause users to experience latency hits. This is why fielddata is disabled by default.

If you try to sort, aggregate, or access values from a script on a text field, you will see this exception:

Fielddata is disabled on text fields by default. Set fielddata=true on [your_field_name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.

Before enabling fielddataedit

Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.

A text field is analyzed before indexing so that a value like New York can be found by searching for new or for york. A terms aggregation on this field will return a new bucket and a york bucket, when you probably want a single bucket called New York.

Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:
PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": { 
          "type": "text",
          "fields": {
            "keyword": { 
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}
COPY AS CURL VIEW IN CONSOLE

Use the my_field field for searches.

Use the my_field.keyword field for aggregations, sorting, or in scripts.

Enabling fielddata on text fieldsedit

You can enable fielddata on an existing text field using the PUT mapping API as follows:
PUT my_index/_mapping/_doc
{
  "properties": {
    "my_field": { 
      "type":     "text",
      "fielddata": true
    }
  }
}
COPY AS CURL VIEW IN CONSOLE

The mapping that you specify for my_field should consist of the existing mapping for that field, plus the fielddata parameter.

fielddata_frequency_filteredit

Fielddata filtering can be used to reduce the number of terms loaded into memory, and thus reduce memory usage. Terms can be filtered by frequency:

The frequency filter allows you to only load terms whose document frequency falls between a minand max value, which can be expressed an absolute number (when the number is bigger than 1.0) or as a percentage (eg 0.01 is 1% and 1.0 is 100%). Frequency is calculated per segment. Percentages are based on the number of docs which have a value for the field, as opposed to all docs in the segment.

Small segments can be excluded completely by specifying the minimum number of docs that the segment should contain with min_segment_size:
PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "tag": {
          "type": "text",
          "fielddata": true,
          "fielddata_frequency_filter": {
            "min": 0.001,
            "max": 0.1,
            "min_segment_size": 500
          }
        }
      }
    }
  }
}

简单意思就是说:text字段在默认情况下,禁用Fieddata,而elasticsearch自动映射会给出一个keyword字段用于聚合等操作,为什么text被禁用呢,有两个原因,一个就是Fielddata在加载高基数的text字段时,会消耗大量的堆空间,另一个原因就是对text字段进行聚合通常没有意义,比如

A text field is analyzed before indexing so that a value like New York can be found by searching for new or for york. A terms aggregation on this field will return a new bucket and a york bucket, when you probably want a single bucket called New York.

Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:
PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": { 
          "type": "text",
          "fields": {
            "keyword": { 
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

本来你想对"new york"进行聚合,但是在之前进行了分词,聚合后会有两个桶,分别是new和york,如果你想得到一个桶,就应该有一个启用了聚合的未分析keyword字段doc_values,然后text用于全文搜索,岂不两全其美.

下面对keyword字段进行聚合测试,测试成功:

GET /goods/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "terms": {
        "field": "specs.内存.keyword",
        "size": 10
      }
    }
  }
}

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 182,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "NAME": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "4GB",
          "doc_count": 75
        },
        {
          "key": "3GB",
          "doc_count": 49
        },
        {
          "key": "6GB",
          "doc_count": 48
        },
        {
          "key": "2GB",
          "doc_count": 23
        },
        {
          "key": "8GB",
          "doc_count": 2
        }
      ]
    }
  }
}

最后感谢google+baidu.解决了问题.

全部评论

推荐最新楼层

10-22 09:39

海康威视_自动化开发工程师(准入职员工)

海康威视内推，海康威视内推码

分享一下自己对海康的感受，也在海康总部的3期。 之前看了网上的评论实属是有点吓人的，但是百闻不如一见自己终究是亲自感受了一下。 这可能是我国内外大大小小加起来的第6段实习或者工作。 海康首先给我的感觉是人真的好多，尤其食堂的人，我可能上学都没有见过这么多人，还有电梯，我每次坐是一头雾水。当然这些对于我来说都不是很重要。 可能很多人最关心的就是海康的工作强度和时间是不是真如网上说的那么严重，而通过这段时间的感受，我觉得海康可能是我节奏最慢的一次体验，完成了任务就可以开开心心的回家了，根本不需要无效加班，如果自己想学点产品类的知识还是可以在公司里多学一点的。 关于部门小组氛围，我一开始是有点惊讶的...

海康威视公司福利 1125人发布

点赞评论收藏

10-25 11:21

北京理工大学算法工程师

无论文勇闯秋招算法岗——猿辅导一面

面试官超级无敌好，上来先说我优秀。50min项目介绍介绍到一半的时候面试官开始问场景和八股多模态模型的感知错误问题，怎么进行caption？面试官给出的一种思路是让模型生成一些结构化的描述，例如题目的点线面关系，grounding，counting等等，这种比cap更容易judge一些怎么解决感知错误，过度依赖文本信息的问题？我从预训练数据构造上去说的。提到了一篇论文，说是把图片mask掉，用rl去增强模型的感知能力qwen2.5-vl 的位置编码，Mrope，对rope有什么了解，旋转矩阵，外推性很好什么样的位置编码是好的？计算量要小，...Rope有缺点吗？qwen2.5vl 和 qwen...

查看13道真题和解析

点赞评论收藏

10-23 15:27

华中科技大学 Java

字节主动联系，怎么办

我这两天刚下载boss，打招呼打了很多小厂（想练练手），但都没有回应。结果字节主动联系我，但我感觉八股还没准备充分，不想第一次面试就面字节，怎么办

迷茫的大四🐶：太主动了，建议直接把off发我邮箱里

点赞评论收藏

10-10 14:40

香港城市大学 Java

27届日常实习简历求拷打

准备跑路了，实习太水了，只能写一个黑客松的项目，但感觉也很玩具，文档也没得偷，这个简历现在投后端日常有机会吗

Ccyk_03：补药来大陆虐我们啊

简历中的项目经历要怎么写

点赞评论收藏

10-23 10:39

门头沟学院后端工程师

秋招接近尾声，双二鼠鼠战绩总结

目前offer：京东收钱吧中电科金仓（北京）一面挂：淘天-阿里妈妈途虎养车二面挂：小红书百度提前批滴滴待二面：顺丰科技科大讯飞字节用友（高潜）待HR面：虾皮京东转正实习结束后开始准备，分别在投的第二天和第四天收到了百度提前批和小红书的面邀，运气好一面过了，二面聊得挺好，手撕全没写出来（鼠鼠刚结束实习，八股算法都没怎么复习，生疏得很，说实话就是太贪玩了）算法题真的挺重要的，手撕没了基本就挂了。不知道春招还有没有机会，最近面了快手的实习，准备去实习几个月，春招再战一波。另外一个问题，除了一些大厂外的公司，还有必要继续面下去吗，感觉心气要没了。

面了100年面试不知...：收手吧阿祖

投递京东等公司10个岗位

点赞评论收藏

全站热榜

创作者周榜

正在热议

# 选完offer后，你后悔学机械吗？ #

关于Elasticsearch中的映射参数与自动映射字段解析,以及为什么聚合不支持text类型探究

`fields`edit

Multi-fields with multiple analyzersedit

`fielddata`edit

Fielddata is disabled on `text` fields by defaultedit

Before enabling fielddataedit

Enabling fielddata on `text` fieldsedit

`fielddata_frequency_filter`edit

全站热榜

创作者周榜

	The `city.raw` field is a `keyword` version of the `city` field.
	The `city` field can be used for full text search.
	The `city.raw` field can be used for sorting and aggregations

	The `text` field uses the `standard` analyzer.
	The `text.english` field uses the `english` analyzer.
	Index two documents, one with `fox` and the other with `foxes`.
	Query both the `text` and `text.english` fields and combine the scores.

	Use the `my_field` field for searches.
	Use the `my_field.keyword` field for aggregations, sorting, or in scripts.

关于Elasticsearch中的映射参数与自动映射字段解析,以及为什么聚合不支持text类型探究

fieldsedit

Multi-fields with multiple analyzersedit

fielddataedit

Fielddata is disabled on text fields by defaultedit

Before enabling fielddataedit

Enabling fielddata on text fieldsedit

fielddata_frequency_filteredit

全站热榜

创作者周榜

`fields`edit

`fielddata`edit

Fielddata is disabled on `text` fields by defaultedit

Enabling fielddata on `text` fieldsedit

`fielddata_frequency_filter`edit