docs/reference/mapping/types/attachment-type.asciidoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90

[[mapping-attachment-type]]
=== Attachment Type

The `attachment` type allows to index different "attachment" type field
(encoded as `base64`), for example, Microsoft Office formats, open
document formats, ePub, HTML, and so on (full list can be found
http://lucene.apache.org/tika/0.10/formats.html[here]).

The `attachment` type is provided as a
https://github.com/elasticsearch/elasticsearch-mapper-attachments[plugin
extension]. The plugin is a simple zip file that can be downloaded and
placed under `$ES_HOME/plugins` location. It will be automatically
detected and the `attachment` type will be added.

Note, the `attachment` type is experimental.

Using the attachment type is simple, in your mapping JSON, simply set a
certain JSON element as attachment, for example:

[source,js]
--------------------------------------------------
{
    "person" : {
        "properties" : {
            "my_attachment" : { "type" : "attachment" }
        }
    }
}
--------------------------------------------------

In this case, the JSON to index can be:

[source,js]
--------------------------------------------------
{
    "my_attachment" : "... base64 encoded attachment ..."
}
--------------------------------------------------

Or it is possible to use more elaborated JSON if content type or
resource name need to be set explicitly:

[source,js]
--------------------------------------------------
{
    "my_attachment" : {
        "_content_type" : "application/pdf",
        "_name" : "resource/name/of/my.pdf",
        "content" : "... base64 encoded attachment ..."
    }
}
--------------------------------------------------

The `attachment` type not only indexes the content of the doc, but also
automatically adds meta data on the attachment as well (when available).
The metadata supported are: `date`, `title`, `author`, and `keywords`.
They can be queried using the "dot notation", for example:
`my_attachment.author`.

Both the meta data and the actual content are simple core type mappers
(string, date, ...), thus, they can be controlled in the mappings. For
example:

[source,js]
--------------------------------------------------
{
    "person" : {
        "properties" : {
            "file" : {
                "type" : "attachment",
                "fields" : {
                    "file" : {"index" : "no"},
                    "date" : {"store" : true},
                    "author" : {"analyzer" : "myAnalyzer"}
                }
            }
        }
    }
}
--------------------------------------------------

In the above example, the actual content indexed is mapped under
`fields` name `file`, and we decide not to index it, so it will only be
available in the `_all` field. The other fields map to their respective
metadata names, but there is no need to specify the `type` (like
`string` or `date`) since it is already known.

The plugin uses http://lucene.apache.org/tika/[Apache Tika] to parse
attachments, so many formats are supported, listed
http://lucene.apache.org/tika/0.10/formats.html[here].