IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
ICU Tokenizer
edit
IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.
ICU Tokenizer
editTokenizes text into words on word boundaries, as defined in
UAX #29: Unicode Text Segmentation.
It behaves much like the standard
tokenizer,
but adds better support for some Asian languages by using a dictionary-based
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
using custom rules to break Myanmar and Khmer text into syllables.
PUT icu_sample { "settings": { "index": { "analysis": { "analyzer": { "my_icu_analyzer": { "tokenizer": "icu_tokenizer" } } } } } }