IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Sampler Aggregation

edit

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.

Example use cases:

  • Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
  • Reducing the running cost of aggregations that can produce useful results using only samples e.g. significant_terms

Example:

{
    "query": {
        "match": {
            "text": "iphone"
        }
    },
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "text"
                    }
                }
            }
        }
    }
}

Response:

{
    ...
        "aggregations": {
        "sample": {
            "doc_count": 1000,
            "keywords": {
                "doc_count": 1000,
                "buckets": [
                    ...
                    {
                        "key": "bend",
                        "doc_count": 58,
                        "score": 37.982536582524276,
                        "bg_count": 103
                    },
                    ....
}

1000 documents were sampled in total because we asked for a maximum of 200 from an index with 5 shards. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded.

shard_size

edit

The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.

Limitations

edit

Cannot be nested under breadth_first aggregations

edit

Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document. It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores. In this situation an error will be thrown.