1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head>
<meta http-equiv="Content-Language" content="en">
<title>Fix invalid UTF-8 Sequences (mmutf8fix)</title></head>
<body>
<a href="rsyslog_conf_modules.html">back</a>
<h1>Fix invalid UTF-8 Sequences (mmutf8fix)</h1>
<p><b>Module Name: mmutf8fix</b></p>
<p><b>Author: </b>Rainer Gerhards <rgerhards@adiscon.com></p>
<p><b>Available since</b>: 7.5.4</p>
<p><b>Description</b>:</p>
<p>The mmutf8fix module permits to fix invalid UTF-8 sequences.
Most often, such invalid sequences result from syslog sources sending
in non-UTF character sets, e.g. ISO 8859. As syslog does not have a way
to convey the character set information, these sequences are not properly
handled. While they are typically uncritical with plain text files, they can
cause big headache with database sources as well as systems like ElasticSearch.
<p>The module supports different "fixing" modes and fixes. The current
implementation will always replace invalid bytes with a single US ASCII
character. Additional replacement modes will probably be added in the future,
depending on user demand. In the longer term
it could also be evolved into an any-charset-to-UTF8 converter. But
first let's see if it really gets into widespread enough use.
<p><b>Proper Usage</b>:</p>
<p>Some notes are due for proper use of this module. This is a message modification
module utilizing the action interface, which means you call it like an action.
This gives great flexibility on the question on when and how to call this module.
Note that once it has been called, it actually modifies the message. The original
messsage is then no longer available. However, this does <b>not</b> change any
properties set, used or extracted before the modification is done.
<p>One potential use case is to normalize all messages. This is done by simply calling
mmutf8fix right in front of all other actions.
<p>If only a specific source (or set of sources) is known to cause problems,
mmutf8fix can be conditionally called only on messages from them. This also offers
performance benefits. If such multiple sources exists, it probably is a good idea
to define different listeners for their incoming traffic, bind them to specific
<a href="multi_ruleset.html">ruleset</a> and call mmutf8fix as first action in this
ruleset.
<p><b>Module Configuration Parameters</b>:</p>
<p>Currently none.
<p> </p>
<p><b>Action Confguration Parameters</b>:</p>
<ul>
<li><b>mode</b> - <b>utf-8</b>/controlcharacters<br>
This sets the basic detection mode.
<br>In <b>utf-8</b> mode (the default), proper
UTF-8 encoding is checked and bytes which are not proper UTF-8 sequences
are acted on. If a proper multi-byte start sequence byte is detected but
any of the following bytes is invalid, the whole sequence is replaced by
the replacement method. This mode is most useful with non-US-ASCII character
sets, which validly includes multibyte sequences. Note that in this mode
control characters are NOT being replaced, because they are valid UTF-8.
<br>In <b>controlcharacters</b> mode, all bytes which do not represent a
printable US-ASCII character (codes 32 to 126) are replaced. Note that this
also mangles valid UTF-8 multi-byte sequences, as these are (deliberately) outside
of that character range. This mode is most useful if it is known that no
characters outside of the US-ASCII alphabet need to be processed.
<li><b>replacementChar</b> - default " " (space), a single character<br>
This is the character that invalid sequences are replaced by. Currently, it
MUST be a <b>printable</b> US-ASCII character.
</ul>
<p><b>Caveats/Known Bugs:</b>
<ul>
<li>overlong UTF-8 encodings are currently not detected in utf-8 mode.
</ul>
<p><b>Samples:</b></p>
<p>In this snippet, we write one file without fixing UTF-8 and another one
with the message fixed. Note that once mmutf8fix has run, access to the
original message is no longer possible.
<p><textarea rows="5" cols="60">module(load="mmutf8fix")
action(type="omfile" file="/path/to/non-fixed.log")
action(type="mmutf8fix")
action(type="omfile" file="/path/to/fixed.log")
</textarea>
<p>In this sample, we fix only message originating from host 10.0.0.1.
<p><textarea rows="5" cols="60">module(load="mmutf8fix")
if $fromhost-ip == "10.0.0.1" then
action(type="mmutf8fix")
# all other actions here...
</textarea>
<p>This is mostly the same as the previous sample, but uses "controlcharacters"
processing mode.
<p><textarea rows="5" cols="60">module(load="mmutf8fix")
if $fromhost-ip == "10.0.0.1" then
action(type="mmutf8fix" mode="controlcharacters")
# all other actions here...
</textarea>
<p>[<a href="rsyslog_conf.html">rsyslog.conf overview</a>] [<a href="manual.html">manual
index</a>] [<a href="http://www.rsyslog.com/">rsyslog site</a>]</p>
<p><font size="2">This documentation is part of the
<a href="http://www.rsyslog.com/">rsyslog</a> project.<br>
Copyright © 2013 by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a> and
<a href="http://www.adiscon.com/">Adiscon</a>. Released under the GNU GPL
version 3 or higher.</font></p>
</body></html>
|