IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Compound Word Token Filter

edit

Token filters that allow to decompose compound words. There are two types available: dictionary_decompounder and hyphenation_decompounder.

The following are settings that can be set for a compound word token filter type:

Setting Description

word_list

A list of words to use.

word_list_path

A path (either relative to config location, or absolute) to a list of words.

hyphenation_patterns_path

A path (either relative to config location, or absolute) to a FOP XML hyphenation pattern file. (See http://offo.sourceforge.net/hyphenation/) Required for hyphenation_decompounder.

min_word_size

Minimum word size(Integer). Defaults to 5.

min_subword_size

Minimum subword size(Integer). Defaults to 2.

max_subword_size

Maximum subword size(Integer). Defaults to 15.

only_longest_match

Only matching the longest(Boolean). Defaults to false

Here is an example:

index :
    analysis :
        analyzer :
            myAnalyzer2 :
                type : custom
                tokenizer : standard
                filter : [myTokenFilter1, myTokenFilter2]
        filter :
            myTokenFilter1 :
                type : dictionary_decompounder
                word_list: [one, two, three]
            myTokenFilter2 :
                type : hyphenation_decompounder
                word_list_path: path/to/words.txt
                max_subword_size : 22