diff options
author | Michael Biebl <biebl@debian.org> | 2014-04-03 03:08:50 +0200 |
---|---|---|
committer | Michael Biebl <biebl@debian.org> | 2014-04-03 03:08:50 +0200 |
commit | 9374a46543e9c43c009f80def8c3b2506b0b377e (patch) | |
tree | 8853fd40ee8d55ff24304ff8a4421640f3493c58 /doc/messageparser.html | |
parent | 209e193f14ec562df5aad945f04cd88b227cc602 (diff) | |
download | rsyslog-9374a46543e9c43c009f80def8c3b2506b0b377e.tar.gz |
Imported Upstream version 8.2.0upstream/8.2.0
Diffstat (limited to 'doc/messageparser.html')
-rw-r--r-- | doc/messageparser.html | 222 |
1 files changed, 0 insertions, 222 deletions
diff --git a/doc/messageparser.html b/doc/messageparser.html deleted file mode 100644 index d22021d..0000000 --- a/doc/messageparser.html +++ /dev/null @@ -1,222 +0,0 @@ -<html> -<head> -<title>Message parsers in rsyslog</title> -</head> -<body> -<a href="manual.html">rsyslog documentation</a> - -<h1>Message parsers in rsyslog</h1> -<p><small><i>Written by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a> -(2009-11-06)</i></small></p> -<h2>Intro</h2> -<p>Message parsers are a feature of rsyslog 5.3.4 and above. In this article, I describe what -message parsers are, what they can do and how they relate to the relevant standards. I will -also describe what you can not do with time. Finally, I give some advice on implementing your -own custom parser. - -<h2>What are message parsers?</h2> -<p>Well, the quick answer is that message parsers are the component of rsyslog that -parses the syslog message after it is being received. Prior to rsyslog 5.3.4, message parsers -where built in into the rsyslog core itself and could not be modified (other than by modifying -the rsyslog code). -<p>In 5.3.4, we changed that: message parsers are now loadable modules (just -like input and output modules). That means that new message parsers can be added without -modifying the rsyslog core, even without contributing something back to the -project. -<p>But that doesn't answer what a message parser really is. What does it mean to "parse a -message" and, maybe more importantly, what is a message? To answer these questions correctly, -we need to dig down into the relevant standards. -<a href="http://tools.ietf.org/html/rfc5424">RFC5424</a> specifies a layered architecture -for the syslog protocol: -<p align="center"><img src="rfc5424layers.png" alt="RFC5424 syslog protocol layers"> -<p>For us important is the distinction between the syslog transport and the upper layers. -The transport layer specifies how a stream of messages is assembled at the sender side and -how this stream of messages is disassembled into the individual messages at the receiver -side. In networking terminology, this is called "framing". The core idea is that -each message is put into a so-called "frame", which then is transmitted over the communications -link. -<p>The framing used is depending on the protocol. For example, in UDP the "frame"-equivalent is -a packet that is being sent (this also means that no two messages can travel within a single -UDP packet). In "plain tcp syslog", the industry standard, LF is used as a frame delimiter -(which also means that no multi-line message can properly be transmitted, a "design" flaw -in plain tcp syslog). In <a href="http://tools.ietf.org/html/rfc5425">RFC5425</a> there is -a header in front of each frame that contains the size of the message. With this framing, -any message content can properly be transferred. -<p>And now comes the important part: <b>message parsers do NOT operate at the transport -layer</b>, they operate, as their name implies, on messages. So we can not use message -parsers to change the underlying framing. For example, if a sender splits (for whatever -reason) a single message into two and encapsulates these into two frames, there is no way -a message parser could undo that. -<p>A typical example may be a multi-line message: let's assume some originator has generated -a message for the format "A\nB" (where \n means LF). If that message is being transmitted -via plain tcp syslog, the frame delimiter is LF. So the sender will delimit the frame with -LF, but otherwise send the message unmodified onto the wire (because that is how things are --unfortunately- done in plain tcp syslog...). So wire will see "A\nB\n". When this -arrives at the receiver, the transport layer will undo the framing. When it sees the LF -after A, it thinks it finds a valid frame delimiter (in fact, this is the correct view!). So -the receive will extract one complete message A and one complete message B, not knowing -that they once were both part of a large multi-line message. These two messages are then -passed to the upper layers, where the message parsers receive them and extract information. -However, the message parsers never know (or even have a chance to see) that A and B -belonged together. Even further, in rsyslog there is no guarantee that A will be parsed -before B - concurrent operations may cause the reverse order (and do so very validly). -<p>The important lesson is: <b>message parsers can not be used to fix a broken framing</b>. -You need a full protocol implementation to do that, what is the domain of input and -output modules. -<p>I have now told you what you can not do with message parsers. But what they are good for? -Thankfully, broken framing is not the primary problem of the syslog world. A wealth of different -formats is. Unfortunately, many real-world implementations violate the relevant standards -in one way or another. That makes it often very hard to extract meaningful information from -a message or to process messages from different sources by the same rules. In my article -<a href="syslog_parsing.html">syslog parsing in rsyslog</a> I have elaborated on all -the real-world evil that you can usually see. So I won't repeat that here. But in short, the -real problem is not the framing, but how to make malformed messages well-looking. -<p><b>This is what message parsers permit you to do: take a (well-known) malformed message, parse -it according to its semantics and generate perfectly valid internal message representations -from it.</b> So as long as messages are consistently in the same wrong format (and they usually -are!), a message parser can look at that format, parse it, and make the message processable just -like it were well formed in the first place. Plus, one can abuse the interface to do some other -"interesting" tricks, but that would take us to far. -<p>While this functionality may not sound exciting, it actually solves a very big issue (that you -only really understand if you have managed a system with various different syslog sources). -Note that we were often able to process malformed messages in the past with the help of the -property replacer and regular expressions. While this is nice, it has a performance hit. A -message parser is a C code, compiled to native language, and thus typically much faster than -any regular expression based method (depending, of course, on the quality of the implementation...). - -<h2>How are message parsers used?</h2> -<p>In a simlified view, rsyslog -<ol> -<li>first receives messages (via the input module), -<li><i>then parses them (at the message level!)</i> and -<li>then processes them (operating on the internal message representation). -</ol> -Message parsers are utilized in the second step (written in italics). -Thus, they take the raw message (NOT frame!) received from the remote system and create -the internal structure out of it that the other parts of rsyslog need in order to perform -their processing. Parsing is vital, because an unparsed message can not be processed in the -third stage, the actual application-level processing (like forwarding or writing to files). -<h3>Parser Chains and how they Operate</h3> -Rsyslog chains parsers together to provide flexibility. -A <b>parser chain</b> -contains all parsers that can potentially be used to parse a message. -It is assumed that there is some -way a parser can detect if the message it is being presented is supported by it. If so, the parser -will tell the rsyslog engine and parse the message. The rsyslog engine now calls each parser -inside the chain (in sequence!) until the first parser is able to parse the message. After one -parser has been found, the message is considered parsed and no others parsers are called on that -message. -<p>Side-note: this method implies there are some "not-so-dirty" tricks available to modify -the message by a parser module that declares itself as "unable to parse" but still does -some message modification. This was not a primary design goal, but may be utilized, and the -interface probably extended, to support generic filter modules. These would need to go -to the root of the parser chain. As mentioned, the current system already supports this. -<p>The position inside the parser chain can be thought of as a priority: parser sitting -earlier in the chain take precedence over those sitting later in it. So more specific -parser should go earlier in the chain. A good example of how this works is the default parser -set provided by rsyslog: rsyslog.rfc5424 and rsyslog.rfc3164, each one parses according to the -rfc that has named it. RFC5424 was designed to be distinguishable from RFC3164 message by the -sequence "1 " immediately after the so-called PRI-part (don't worry about these words, it is -sufficient if you understand there is a well-defined sequence used to identify RFC5424 -messages). In contrary, RFC3164 actually permits everything as a valid message. Thus the -RFC3164 parser will always parse a message, sometimes with quite unexpected outcome (there is -a lot of guesswork involved in that parser, which unfortunately is unavoidable due to -existing technology limits). So the default parser chain is to try the RFC5424 parser first -and after it the RFC3164 parser. If we have a 5424-formatted message, that parser will -identify and parse it and the rsyslog engine will stop processing. But if we receive a -legacy syslog message, the RFC5424 will detect that it can not parse it, return this status -to the engine which then calls the next parser inside the chain. That usually happens to be -the RFC3164 parser, which will always process the message. But there could also be any other -parser inside the chain, and then each one would be called unless one that is able to parse -can be found. -<p>If we reversed the parser order, RFC5424 messages would incorrectly parsed. Why? Because the -RFC3164 parser will always parse every message, so if it were asked first, it would parse -(and misinterpret) the 5424-formatted message, return it did so and the rsyslog engine would -never call the 5424 parser. So oder of sequence is very important. -<p>What happens if no parser in the chain could parse a message? Well, then we could not -obtain the in-memory representation that is needed to further process the message. In that -case, rsyslog has no other choice than to discard the message. If it does so, it will emit -a warning message, but only in the first 1,000 incidents. This limit is a safety measure -against message-loops, which otherwise could quickly result from a parser chain -misconfiguration. <b>If you do not tolerate loss of unparsable messages, you must ensure -that each message can be parsed.</b> You can easily achieve this by always using the -"rsyslog-rfc3164" parser as the <i>last</i> parser inside parser chains. That may result -in invalid parsing, but you will have a chance to see the invalid message (in debug mode, -a warning message will be written to the debug log each time a message is dropped due to -inability to parse it). -<h3>Where are parser chains used?</h3> -<p>We now know what parser chains are and how they operate. The question is now how many -parser chains can be active and how it is decided which parser chain is used on which message. -This is controlled via <a href="multi_ruleset.html">rsyslog's rulesets</a>. In short, multiple -rulesets can be defined and there always exist at least one ruleset (for specifics, follow -the <a href="multi_ruleset.html">link</a>). A parser chain is bound to a specific ruleset. -This is done by virtue of defining parsers via the -<a href="rsconf1_rulesetparser.html">$RulesetParser</a> configuration directive (for specifics, -see there). If no such directive is specified, the default parser chain is used. As of this -writing, the default parser chain always consists of "rsyslog.rfc5424", "rsyslog.rfc3164", in -that order. As soon as a parser is configured, the default list is cleared and the new parser -is added to the end of the (initially empty) ruleset's parser chain. -<p>The important point to know is that parser chains are defined on a per-ruleset basis. -<h3>Can I use different parser chains for different devices?</h3> -<p>The correct answer is: generally yes, but it depends. First of all, remember that input -modules (and specific listeners) may be bound to specific rulesets. As parser chains "reside" -in rulesets, binding to a ruleset also binds to the parser chain that is bound to that ruleset. -As a number one prerequisite, the input module must support binding to different rulesets. Not -all do, but their number is growing. For example, the important -<a href="imudp.html">imudp</a> and <a href="imtcp.html">imtcp</a> input modules support -that functionality. Those that do not (for example <a href="im3195">im3195</a>) can only -utilize the default ruleset and thus the parser chain defined in that ruleset. -<p>If you do not know if the input module in question supports ruleset binding, check -its documentation page. Those that support it have the required directives. -<p>Note that it is currently under evaluation if rsyslog will support binding parser chains -to specific inputs directly, without depending on the ruleset. There are some concerns that -this may not be necessary but adds considerable complexity to the configuration. So this may -or may not be possible in the future. In any case, if we decide to add it, input modules -need to support it, so this functionality would require some time to implement. -<p>The cookbook recipe for using different parsers for different devices is given -as an actual in-depth example in the <a href="rscon1_rulesetsparser.html">$RulesetParser</a> -configuration directive doc page. In short, it is accomplished by defining specific rulesets -for the required parser chains, defining different listener ports for each of the devices -with different format and binding these listeners to the correct ruleset (and thus parser -chains). Using that approach, a variety of different message formats can be supported -via a single rsyslog instance. - -<h2>Which message parsers are available</h2> -<p>As of this writing, there exist only two message parsers, one for RFC5424 format and one for -legacy syslog (loosely described in -<a href="http://tools.ietf.org/html/rfc3164">RFC3164</a>). These parsers are built-in and -must not be explicitly loaded. However, message parsers can be added with relative ease -by anyone knowing to code in C. Then, they can be loaded via $ModLoad just like any -other loadable module. It is expected that the rsyslog project will be contributed additional -message parsers over time, so that at some point there hopefully is a rich choice of them -(I intend to add a browsable repository as soon as new parsers pop up). -<h3>How to write a message parser?</h3> -<p>As a prerequisite, you need to know the exact format that the device is sending. Then, you need -moderate C coding skills, and a little bit of rsyslog internals. I guess the rsyslog specific part -should not be that hard, as almost all information can be gained from the existing parsers. They -are rather simple in structure and can be found under the "./tools" directory. They are named -pmrfc3164.c and pmrfc5424.c. You need to follow the usual loadable module guidelines. -It is my expectation that writing a parser should typically not take longer than a single -day, with maybe a day more to get acquainted with rsyslog. Of course, I am not sure if the number -is actually right. -<p>If you can not program or have no time to do it, Adiscon can also write a message parser -for you as -part of the <a href="http://www.rsyslog/professional-services">rsyslog professional services -offering</a>. -<h2>Conclusion</h2> -<p>Malformed syslog messages are a pain and unfortunately often seen in practice. Message parsers -provide a fast and efficient solution for this problem. Different parsers can be defined for -different devices, and they all convert message information into rsyslog's well-defined -internal format. Message parsers were first introduced in rsyslog 5.3.4 and also offer -some interesting ideas that may be explored in the future - up to full message normalization -capabilities. It is strongly recommended that anyone with a heterogeneous environment take -a look at message parser capabilities. - -<p>[<a href="rsyslog_conf.html">rsyslog.conf overview</a>] [<a href="manual.html">manual -index</a>] [<a href="http://www.rsyslog.com/">rsyslog site</a>]</p> -<p><font size="2">This documentation is part of the -<a href="http://www.rsyslog.com/">rsyslog</a> project.<br> -Copyright © 2009 by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a> and -<a href="http://www.adiscon.com/">Adiscon</a>. Released under the GNU GPL version 3 or higher.</font></p> -</body> -</html> |