Imported Upstream version 8.2.0upstream/8.2.0

author: Michael Biebl <biebl@debian.org> 2014-04-03 03:08:50 +0200
committer: Michael Biebl <biebl@debian.org> 2014-04-03 03:08:50 +0200
commit: 9374a46543e9c43c009f80def8c3b2506b0b377e (patch)
tree: 8853fd40ee8d55ff24304ff8a4421640f3493c58 /doc/messageparser.html
parent: 209e193f14ec562df5aad945f04cd88b227cc602 (diff)
download: rsyslog-9374a46543e9c43c009f80def8c3b2506b0b377e.tar.gz
1 files changed, 0 insertions, 222 deletions
diff --git a/doc/messageparser.html b/doc/messageparser.html
deleted file mode 100644
index d22021d..0000000
--- a/doc/messageparser.html
+++ /dev/null
@@ -1,222 +0,0 @@
-<html>
-<head>
-<title>Message parsers in rsyslog</title>
-</head>
-<body>
-<a href="manual.html">rsyslog documentation</a>
-
-<h1>Message parsers in rsyslog</h1>
-<p><small><i>Written by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a>
-(2009-11-06)</i></small></p>
-<h2>Intro</h2>
-<p>Message parsers are a feature of rsyslog 5.3.4 and above. In this article, I describe what
-message parsers are, what they can do and how they relate to the relevant standards. I will
-also describe what you can not do with time. Finally, I give some advice on implementing your
-own custom parser.
-
-<h2>What are message parsers?</h2>
-<p>Well, the quick answer is that message parsers are the component of rsyslog that
-parses the syslog message after it is being received. Prior to rsyslog 5.3.4, message parsers
-where built in into the rsyslog core itself and could not be modified (other than by modifying 
-the rsyslog code).
-<p>In 5.3.4, we changed that: message parsers are now loadable modules (just
-like input and output modules). That means that new message parsers can be added without
-modifying the rsyslog core, even without contributing something back to the 
-project.
-<p>But that doesn't answer what a message parser really is. What does it mean to &quot;parse a
-message&quot; and, maybe more importantly, what is a message? To answer these questions correctly,
-we need to dig down into the relevant standards.
-<a href="http://tools.ietf.org/html/rfc5424">RFC5424</a> specifies a layered architecture
-for the syslog protocol:
-<p align="center"><img src="rfc5424layers.png" alt="RFC5424 syslog protocol layers">
-<p>For us important is the distinction between the syslog transport and the upper layers.
-The transport layer specifies how a stream of messages is assembled at the sender side and
-how this stream of messages is disassembled into the individual messages at the receiver
-side. In networking terminology, this is called &quot;framing&quot;. The core idea is that
-each message is put into a so-called "frame", which then is transmitted over the communications
-link.
-<p>The framing used is depending on the protocol. For example, in UDP the "frame"-equivalent is 
-a packet that is being sent (this also means that no two messages can travel within a single
-UDP packet). In "plain tcp syslog", the industry standard, LF is used as a frame delimiter
-(which also means that no multi-line message can properly be transmitted, a "design" flaw
-in plain tcp syslog). In <a href="http://tools.ietf.org/html/rfc5425">RFC5425</a> there is
-a header in front of each frame that contains the size of the message. With this framing,
-any message content can properly be transferred.
-<p>And now comes the important part: <b>message parsers do NOT operate at the transport
-layer</b>, they operate, as their name implies, on messages. So we can not use message
-parsers to change the underlying framing. For example, if a sender splits (for whatever 
-reason) a single message into two and encapsulates these into two frames, there is no way
-a message parser could undo that.
-<p>A typical example may be a multi-line message: let's assume some originator has generated
-a message for the format "A\nB" (where \n means LF). If that message is being transmitted
-via plain tcp syslog, the frame delimiter is LF. So the sender will delimit the frame with
-LF, but otherwise send the message unmodified onto the wire (because that is how things are
--unfortunately- done in plain tcp syslog...). So wire will see "A\nB\n". When this
-arrives at the receiver, the transport layer will undo the framing. When it sees the LF
-after A, it thinks it finds a valid frame delimiter (in fact, this is the correct view!). So
-the receive will extract one complete message A and one complete message B, not knowing
-that they once were both part of a large multi-line message. These two messages are then
-passed to the upper layers, where the message parsers receive them and extract information.
-However, the message parsers never know (or even have a chance to see) that A and B
-belonged together. Even further, in rsyslog there is no guarantee that A will be parsed
-before B - concurrent operations may cause the reverse order (and do so very validly).
-<p>The important lesson is: <b>message parsers can not be used to fix a broken framing</b>.
-You need a full protocol implementation to do that, what is the domain of input and
-output modules.
-<p>I have now told you what you can not do with message parsers. But what they are good for? 
-Thankfully, broken framing is not the primary problem of the syslog world. A wealth of different
-formats is. Unfortunately, many real-world implementations violate the relevant standards
-in one way or another. That makes it often very hard to extract meaningful information from
-a message or to process messages from different sources by the same rules. In my article
-<a href="syslog_parsing.html">syslog parsing in rsyslog</a> I have elaborated on all
-the real-world evil that you can usually see. So I won't repeat that here. But in short, the
-real problem is not the framing, but how to make malformed messages well-looking.
-<p><b>This is what message parsers permit you to do: take a (well-known) malformed message, parse
-it according to its semantics and generate perfectly valid internal message representations
-from it.</b> So as long as messages are consistently in the same wrong format (and they usually
-are!), a message parser can look at that format, parse it, and make the message processable just
-like it were well formed in the first place. Plus, one can abuse the interface to do some other
-"interesting" tricks, but that would take us to far.
-<p>While this functionality may not sound exciting, it actually solves a very big issue (that you
-only really understand if you have managed a system with various different syslog sources).
-Note that we were often able to process malformed messages in the past with the help of the
-property replacer and regular expressions. While this is nice, it has a performance hit. A
-message parser is a C code, compiled to native language, and thus typically much faster than
-any regular expression based method (depending, of course, on the quality of the implementation...).
-
-<h2>How are message parsers used?</h2>
-<p>In a simlified view, rsyslog
-<ol>
-<li>first receives messages (via the input module),
-<li><i>then parses them (at the message level!)</i> and
-<li>then processes them (operating on the internal message representation).
-</ol>
-Message parsers are utilized in the second step (written in italics).
-Thus, they take the raw message (NOT frame!) received from the remote system and create
-the internal structure out of it that the other parts of rsyslog need in order to perform
-their processing. Parsing is vital, because an unparsed message can not be processed in the
-third stage, the actual application-level processing (like forwarding or writing to files).
-<h3>Parser Chains and how they Operate</h3>
-Rsyslog chains parsers together to provide flexibility.
-A <b>parser chain</b>
-contains all parsers that can potentially be used to parse a message.
-It is assumed that there is some
-way a parser can detect if the message it is being presented is supported by it. If so, the parser
-will tell the rsyslog engine and parse the message. The rsyslog engine now calls each parser
-inside the chain (in sequence!) until the first parser is able to parse the message. After one
-parser has been found, the message is considered parsed and no others parsers are called on that
-message.
-<p>Side-note: this method implies there are some "not-so-dirty" tricks available to modify
-the message by a parser module that declares itself as "unable to parse" but still does
-some message modification. This was not a primary design goal, but may be utilized, and the 
-interface probably extended, to support generic filter modules. These would need to go
-to the root of the parser chain. As mentioned, the current system already supports this.
-<p>The position inside the parser chain can be thought of as a priority: parser sitting
-earlier in the chain take precedence over those sitting later in it. So more specific 
-parser should go earlier in the chain. A good example of how this works is the default parser
-set provided by rsyslog: rsyslog.rfc5424 and rsyslog.rfc3164, each one parses according to the
-rfc that has named it. RFC5424 was designed to be distinguishable from RFC3164 message by the
-sequence "1 " immediately after the so-called PRI-part (don't worry about these words, it is
-sufficient if you understand there is a well-defined sequence used to identify RFC5424
-messages). In contrary, RFC3164 actually permits everything as a valid message. Thus the
-RFC3164 parser will always parse a message, sometimes with quite unexpected outcome (there is
-a lot of guesswork involved in that parser, which unfortunately is unavoidable due to
-existing technology limits). So the default parser chain is to try the RFC5424 parser first
-and after it the RFC3164 parser. If we have a 5424-formatted message, that parser will
-identify and parse it and the rsyslog engine will stop processing. But if we receive a
-legacy syslog message, the RFC5424 will detect that it can not parse it, return this status
-to the engine which then calls the next parser inside the chain. That usually happens to be
-the RFC3164 parser, which will always process the message. But there could also be any other
-parser inside the chain, and then each one would be called unless one that is able to parse
-can be found.
-<p>If we reversed the parser order, RFC5424 messages would incorrectly parsed. Why? Because the
-RFC3164 parser will always parse every message, so if it were asked first, it would parse
-(and misinterpret) the 5424-formatted message, return it did so and the rsyslog engine would
-never call the 5424 parser. So oder of sequence is very important.
-<p>What happens if no parser in the chain could parse a message? Well, then we could not 
-obtain the in-memory representation that is needed to further process the message. In that
-case, rsyslog has no other choice than to discard the message. If it does so, it will emit
-a warning message, but only in the first 1,000 incidents. This limit is a safety measure 
-against message-loops, which otherwise could quickly result from a parser chain 
-misconfiguration. <b>If you do not tolerate loss of unparsable messages, you must ensure
-that each message can be parsed.</b> You can easily achieve this by always using the
-"rsyslog-rfc3164" parser as the <i>last</i> parser inside parser chains. That may result
-in invalid parsing, but you will have a chance to see the invalid message (in debug mode,
-a warning message will be written to the debug log each time a message is dropped due to
-inability to parse it).
-<h3>Where are parser chains used?</h3>
-<p>We now know what parser chains are and how they operate. The question is now how many
-parser chains can be active and how it is decided which parser chain is used on which message.
-This is controlled via <a href="multi_ruleset.html">rsyslog's rulesets</a>. In short, multiple
-rulesets can be defined and there always exist at least one ruleset (for specifics, follow
-the <a href="multi_ruleset.html">link</a>). A parser chain is bound to a specific ruleset. 
-This is done by virtue of defining parsers via the
-<a href="rsconf1_rulesetparser.html">$RulesetParser</a> configuration directive (for specifics,
-see there). If no such directive is specified, the default parser chain is used. As of this
-writing, the default parser chain always consists of "rsyslog.rfc5424", "rsyslog.rfc3164", in
-that order. As soon as a parser is configured, the default list is cleared and the new parser
-is added to the end of the (initially empty) ruleset's parser chain.
-<p>The important point to know is that parser chains are defined on a per-ruleset basis.
-<h3>Can I use different parser chains for different devices?</h3>
-<p>The correct answer is: generally yes, but it depends. First of all, remember that input
-modules (and specific listeners) may be bound to specific rulesets. As parser chains "reside"
-in rulesets, binding to a ruleset also binds to the parser chain that is bound to that ruleset.
-As a number one prerequisite, the input module must support binding to different rulesets. Not
-all do, but their number is growing. For example, the important 
-<a href="imudp.html">imudp</a> and <a href="imtcp.html">imtcp</a> input modules support
-that functionality. Those that do not (for example <a href="im3195">im3195</a>) can only
-utilize the default ruleset and thus the parser chain defined in that ruleset.
-<p>If you do not know if the input module in question supports ruleset binding, check
-its documentation page. Those that support it have the required directives.
-<p>Note that it is currently under evaluation if rsyslog will support binding parser chains
-to specific inputs directly, without depending on the ruleset. There are some concerns that
-this may not be necessary but adds considerable complexity to the configuration. So this may
-or may not be possible in the future. In any case, if we decide to add it, input modules 
-need to support it, so this functionality would require some time to implement.
-<p>The cookbook recipe for using different parsers for different devices is given
-as an actual in-depth example in the <a href="rscon1_rulesetsparser.html">$RulesetParser</a>
-configuration directive doc page. In short, it is accomplished by defining specific rulesets
-for the required parser chains, defining different listener ports for each of the devices
-with different format and binding these listeners to the correct ruleset (and thus parser
-chains). Using that approach, a variety of different message formats can be supported
-via a single rsyslog instance.
-
-<h2>Which message parsers are available</h2>
-<p>As of this writing, there exist only two message parsers, one for RFC5424 format and one for
-legacy syslog (loosely described in
-<a href="http://tools.ietf.org/html/rfc3164">RFC3164</a>). These parsers are built-in and
-must not be explicitly loaded. However, message parsers can be added with relative ease
-by anyone knowing to code in C. Then, they can be loaded via $ModLoad just like any
-other loadable module. It is expected that the rsyslog project will be contributed additional
-message parsers over time, so that at some point there hopefully is a rich choice of them
-(I intend to add a browsable repository as soon as new parsers pop up).
-<h3>How to write a message parser?</h3>
-<p>As a prerequisite, you need to know the exact format that the device is sending. Then, you need
-moderate C coding skills, and a little bit of rsyslog internals. I guess the rsyslog specific part 
-should not be that hard, as almost all information can be gained from the existing parsers. They
-are rather simple in structure and can be found under the "./tools" directory. They are named
-pmrfc3164.c and pmrfc5424.c. You need to follow the usual loadable module guidelines.
-It is my expectation that writing a parser should typically not take longer than a single
-day, with maybe a day more to get acquainted with rsyslog. Of course, I am not sure if the number
-is actually right.
-<p>If you can not program or have no time to do it, Adiscon can also write a message parser
-for you as
-part of the <a href="http://www.rsyslog/professional-services">rsyslog professional services
-offering</a>.
-<h2>Conclusion</h2>
-<p>Malformed syslog messages are a pain and unfortunately often seen in practice. Message parsers
-provide a fast and efficient solution for this problem. Different parsers can be defined for
-different devices, and they all convert message information into rsyslog's well-defined
-internal format. Message parsers were first introduced in rsyslog 5.3.4 and also offer
-some interesting ideas that may be explored in the future - up to full message normalization
-capabilities. It is strongly recommended that anyone with a heterogeneous environment take
-a look at message parser capabilities.
-
-<p>[<a href="rsyslog_conf.html">rsyslog.conf overview</a>] [<a href="manual.html">manual 
-index</a>] [<a href="http://www.rsyslog.com/">rsyslog site</a>]</p>
-<p><font size="2">This documentation is part of the
-<a href="http://www.rsyslog.com/">rsyslog</a> project.<br>
-Copyright &copy; 2009 by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a> and
-<a href="http://www.adiscon.com/">Adiscon</a>. Released under the GNU GPL version 3 or higher.</font></p>
-</body>
-</html>
author	Michael Biebl <biebl@debian.org>	2014-04-03 03:08:50 +0200
committer	Michael Biebl <biebl@debian.org>	2014-04-03 03:08:50 +0200
commit	9374a46543e9c43c009f80def8c3b2506b0b377e (patch)
tree	8853fd40ee8d55ff24304ff8a4421640f3493c58 /doc/messageparser.html
parent	209e193f14ec562df5aad945f04cd88b227cc602 (diff)
download	rsyslog-9374a46543e9c43c009f80def8c3b2506b0b377e.tar.gz