docs/reference/analysis/analyzers/pattern-analyzer.asciidoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130

[[analysis-pattern-analyzer]]
=== Pattern Analyzer

An analyzer of type `pattern` that can flexibly separate text into terms
via a regular expression. Accepts the following settings:

The following are settings that can be set for a `pattern` analyzer
type:

[cols="<,<",options="header",]
|===================================================================
|Setting |Description
|`lowercase` |Should terms be lowercased or not. Defaults to `true`.
|`pattern` |The regular expression pattern, defaults to `\W+`.
|`flags` |The regular expression flags.
|`stopwords` |A list of stopwords to initialize the stop filter with.
Defaults to an 'empty' stopword list added[1.0.0.RC1, Previously 
defaulted to the English stopwords list]. Check
<<analysis-stop-analyzer,Stop Analyzer>> for more details.
|===================================================================

*IMPORTANT*: The regular expression should match the *token separators*,
not the tokens themselves.

Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
Pattern API] for more details about `flags` options.

[float]
==== Pattern Analyzer Examples

In order to try out these examples, you should delete the `test` index
before running each example:

[source,js]
--------------------------------------------------
    curl -XDELETE localhost:9200/test
--------------------------------------------------

[float]
===== Whitespace tokenizer

[source,js]
--------------------------------------------------
    curl -XPUT 'localhost:9200/test' -d '
    {
        "settings":{
            "analysis": {
                "analyzer": {
                    "whitespace":{
                        "type": "pattern",
                        "pattern":"\\\\s+"
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'
    # "foo,bar", "baz"
--------------------------------------------------

[float]
===== Non-word character tokenizer

[source,js]
--------------------------------------------------

    curl -XPUT 'localhost:9200/test' -d '
    {
        "settings":{
            "analysis": {
                "analyzer": {
                    "nonword":{
                        "type": "pattern",
                        "pattern":"[^\\\\w]+"
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz'
    # "foo,bar baz" becomes "foo", "bar", "baz"

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4'
    # "type_1","type_4"
--------------------------------------------------

[float]
===== CamelCase tokenizer

[source,js]
--------------------------------------------------

    curl -XPUT 'localhost:9200/test?pretty=1' -d '
    {
        "settings":{
            "analysis": {
                "analyzer": {
                    "camel":{
                        "type": "pattern",
                        "pattern":"([^\\\\p{L}\\\\d]+)|(?<=\\\\D)(?=\\\\d)|(?<=\\\\d)(?=\\\\D)|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])"
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '
        MooseX::FTPClass2_beta
    '
    # "moose","x","ftp","class","2","beta"
--------------------------------------------------

The regex above is easier to understand as:

[source,js]
--------------------------------------------------

      ([^\\p{L}\\d]+)                 # swallow non letters and numbers,
    | (?<=\\D)(?=\\d)                 # or non-number followed by number,
    | (?<=\\d)(?=\\D)                 # or number followed by non-number,
    | (?<=[ \\p{L} && [^\\p{Lu}]])    # or lower case
      (?=\\p{Lu})                    #   followed by upper case,
    | (?<=\\p{Lu})                   # or upper case
      (?=\\p{Lu}                     #   followed by upper case
        [\\p{L}&&[^\\p{Lu}]]          #   then lower case
      )
--------------------------------------------------