From 9374a46543e9c43c009f80def8c3b2506b0b377e Mon Sep 17 00:00:00 2001 From: Michael Biebl Date: Thu, 3 Apr 2014 03:08:50 +0200 Subject: Imported Upstream version 8.2.0 --- doc/syslog_parsing.html | 210 ------------------------------------------------ 1 file changed, 210 deletions(-) delete mode 100644 doc/syslog_parsing.html (limited to 'doc/syslog_parsing.html') diff --git a/doc/syslog_parsing.html b/doc/syslog_parsing.html deleted file mode 100644 index 1ccec6f..0000000 --- a/doc/syslog_parsing.html +++ /dev/null @@ -1,210 +0,0 @@ - -syslog parsing in rsyslog - - -

syslog parsing in rsyslog

Written by Rainer Gerhards -(2008-09-23)

We regularly receive messages asking why rsyslog -parses this or that message incorrectly. Of course, it turns out that rsyslog does -the right thing, but the message sender does not. And also of course, this is not even -of the slightest help to the end user experiencing the problem ;). So I thought I write this -paper. It describes the problem source and shows potential solutions (aha!). -

Syslog Standardization

-The syslog protocol has not been standardized until relatively recently.The first document "smelling" a bit -like a standard is RFC 3164, which dates back -to August 2001. The problem is that this document is no real standard. It has assigned "informational" -status by the IETF which means it provides some hopefully -useful information but does not demand anything. It is impossible to "comply" to an informational -document. This, of course, doesn't stop marketing guys from telling they comply to RFC3164 and -it also does not stop some techs to tell you "this and that does not comply to RFC3164, so it is -<anybody else but them>'s fault". -

Then, there is RFC3195, which is -a real standard. In it's section 3 it makes (a somewhat questionable) reference to (informational) -RFC 3164 which may be interpreted in a way that RFC3195 standardizes the format layed out -in RFC 3164 by virtue of referencing them. So RFC3195 seems to extend its standardization -domain to the concepts layed out in RFC 3164 (which is why I tend to find that refrence -questionable). In that sense, RFC3195 standardizes the format informationally described in -RFC3164, Section 4. But it demands it only for the scope of RFC3195, which is syslog over -BEEP - and NOT syslog over UDP. So one may argue whether or not the RFC3164 format could -be considered a standard for any non-BEEP (including UDP) syslog, too. In the strict view -I tend to have, it does not. Refering to the RFC3195 context usually does not help, -because there are virtually no RFC3195 implementations available (at this time, -I would consider this RFC a failure). -

Now let's for a short moment assume that RFC3195 would somehow be able to demand -RFC3164 format for non-BEEP syslog. So we could use RFC3164 format as a standard. But does -that really help? Let's cite RFC 3164, right at the begining of section 4 (actually, this -is the first sentence): -

-   The payload of any IP packet that has a UDP destination port of 514
-   MUST be treated as a syslog message. 
--

Think a bit about it: this means that whatever is send to port 514 must be considered -a valid syslog message. No format at all is demanded. So if "this is junk" is sent to -UDP port 514 - voila, we have a valid message (interestingly, it is no longer a syslog -message if it is sent to port 515 ;)). You may now argue that I am overdoing. So let's -cite RFC 3164, Section 5.4, Example 2: -

-  Example 2
-
-        Use the BFG!
-
-   While this is a valid message, it has extraordinarily little useful
-   information.
-

As you can see, RFC3164 explicitely states that no format at all is required. -

Now a side-note is due: all of this does not mean that the RFC3164 authors -did not know what they were doing. No, right the contrary is true: RFC3164 mission -is to describe what has been seen in practice as syslog messages and the -conclusion is quite right that there is no common understanding on the -message format. This is also the reason why RFC3164 is an informational document: -it provides useful information, but does not precisely specify anything. -

After all of this bashing, I now have to admit that RFC3164 has some format -recommendations layed out in section 4. The format described has quite some -value in it and implementors recently try to follow it. This format is usually meant -when someone tells you that a software is "RFC3164 compliant" or expects "RFC3164 compliant messages". -I also have to admit that rsyslog also uses this format and, in the sense outlined here, -expects messages received to be "RFC3164 compliant" (knowingly that such a beast does not -exist - I am simply lying here ;)). -

Please note that there is some relief of the situation in reach. There is a new normative -syslog RFC series upcoming, and it specifies a standard message format. At the time of -this writing, the main documents are sitting in the RFC editor queue waiting for a transport -mapping to be completed. I personally expect them to be assigned RFC numbers in 2009. -

Practical Format Requirements

From a practical point of view, the message format expected (and generated by -default in legacy mode) is: -


-<PRI>TIMESTAMP SP HOST SP TAG MSG(Freetext)
-

SP is the ASCII "space" character and the definition of the rest of the fields -can be taken from RFC3164. Please note that there also is a lot of confusion on what -syntax and semantics the TAG actually has. This format is called "legacy syslog" because -it is not well specified (as you know by now) and has been "inherited from the real world". -

Rsyslog offers two parsers: one for the upcoming RFC series and one for legacy format. We -concentrate on the later. That parser applies some logic to detect missing hostnames, -is able to handle various ways the TIMESTAMP is typically malformed. In short it applies -a lot of guesswork in trying to figure out what a message really means. I am sure the -guessing algorithm can be improved, and I am always trying that when I see new malformed -messages (and there is an ample set of them...). However, this finds its limits where -it is not possible to differentiate between two entities which could be either. -For example, look at this message: -


-<144>Tue Sep 23 11:40:01 taghost sample message
-

Does it contain a hostname? Mabye. The value "taghost" is a valid hostname. Of course, it is -also a valid tag. If it is a hostname, the tag's value is "sample" and the msg value is "message". -Or is the hostname missing, the tag is "taghost" and msg is "sample message"? As a human, I tend -to say the later interpretation is correct. But that's hard to tell the message parser (and, no, I do -not intend to apply artificial intelligence just to guess what the hostname value is...). -

One approach is to configure the parser so that it never expects hostnames. This becomes problematic -if you receive messages from multiple devices. Over time, I may implement parser conditionals, -but this is not yet available and I am not really sure if it is needed comlexity... -

Things like this, happen. Even more scary formats happen in practice. Even from mainstream -vendors. For example, I was just asked about this message (which, btw, finally made me -write this article here): -


-"<130> [ERROR] iapp_socket_task.c 399: iappSocketTask: iappRecvPkt returned error"
-

If you compare it with the format RFC3164 "suggests", you'll quickly notice that -the message is "a bit" malformed. Actually, even my human intelligence is not sufficient -to guess if there is a TAG or not (is "[ERROR]" a tag or part of the message). I may not be -the smartest guy, but don't expect me to program a parser that is smarter than me. -

To the best of my konwledge, these vendor's device's syslog format can be configured, so it -would proabably be a good idea to include a (sufficiently well-formed) timestamp, -the sending hostname and (maybe?) a tag to make this message well parseable. -I will also once again take this sample and see if we can apply some guesswork. -For example, "[" can not be part of a well-formed TIMESTAMP, so logic can conclude -there is not TIMESTAMP. Also, "[" can not be used inside a valid hostname, so -logic can conclude that the message contains no hostname. Even if I implement this -logic (which I will probably do), this is a partial solution: it is impossible to -guess if there is a tag or not (honestly!). And, even worse, it is a solution only for -those set of messages that can be handled by the logic described. Now consider this -hypothetical message: -


-"<130> [ERROR] host.example.net 2008-09-23 11-40-22 PST iapp_socket_task.c 399: iappSocketTask: iappRecvPkt returned error"
-

Obviously, it requires additional guesswork. If we iterate over all the cases, we -can very quickly see that it is impossible to guess everything correct. In the example above -we can not even surely tell if PST should be a timezone or some other message property. -

A potential solution is to generate a parser-table based parser, but this requires -considerable effort and also has quite some runtime overhead. I try to avoid this for -now (but I may do it, especially if someone sponsors this work ;)). Side-note: if you want -to be a bit scared about potential formats, you may want to have a look at my paper -"On the Nature of Syslog Data". -

Work-Around

The number one work-around is to configure your devices so that they emit -(sufficiently) well-formed messages. You should by now know what these look -like. -

If that cure is not available, there are some things you can do in rsyslog to -handle the situation. First of all, be sure to read about -rsyslog.conf format -and the property replacer and properties specifically. -You need to understand that everything is configured in rsyslog. And that the message is parsed -into properties. There are also properties available which do not stem back directly to parsing. -Most importantly, %fromhost% property holds the name of the system rsyslog received -the message from. In non-relay cases, this can be used instead of hostname. In relay cases, -there is no cure other than to either fix the orginal sender or at least one of the -relays in front of the rsyslog instance in question. Similarly, you can use %timegenerated% -instead of %timereported%. Timegenerated is the time the message hit rsyslog for the first -time. For non-relayed, locally connected peers, Timegenerated should be a very close approximation -of the actual time a message was formed at the sender (depending, of course, on potential -internal queueing inside the sender). -Also, you may use the -%rawmsg% property together with the several extraction modes the property replacer supports. -Rawmsg contains the message as it is received from the remote peer. In a sense, you can -implement a post-parser with this method. -

To use these properties, you need to define your own templates and assign them. Details -can be found in the above-quoted documentation. Just let's do a quick example. Let's say -you have the horrible message shown above and can not fix the sending device for -some good reason. In rsyslog.conf, you used to say: -


-*.* /var/log/somefile
-

Of course, things do not work out well with that ill-formed message. So you decide -to dump the rawmsg to the file and pull the remote host and time of message generation -from rsyslog's internal properties (which, btw, is clever, because otherwise there is no -indication of these two properties...). So you need to define a template for that and -make sure the template is used with your file logging action. This is how it may look: -


-$template, MalfromedMsgFormater,"%timegenerated% %fromhost% %rawmsg:::drop-last-lf%\n"
-*.* /var/log/somefile;MalformedMsgFormatter
-

This will make your log much nicer, but not look perfect. Experiment a bit -with the available properties and replacer extraction options to fine-tune it -to your needs. -

The Ultimate Solution...

Is available with rsyslog 5.3.4 and above. Here, we can define so-called custom -parsers. These are plugin modules, written in C and adapted to a specific message format -need. The big plus of custom parsers is that they offer excellent performance and unlimited -possibilities - far better than any work-around could do. Custom parsers can be -bound to specific rule sets -(and thus listening) ports with relative ease. The only con is that they must be written. -However, if you are lucky, a parser for your device may already exist. If not, you can -opt to write it yourself, what is not too hard if you know some C. Alternatively, -Adiscon can program one for you as part of the -rsyslog professional services offering. -In any case, you should seriously consider custom parsers as an alternative if you can not -reconfigure your device to send decent message format. -

Wrap-Up

Syslog message format is not sufficiently standardized. There exists a weak -"standard" format, which is used by a good number of implementations. However, there -exist many others, including mainstream vendor implementations, which have a -(sometimes horribly) different format. Rsyslog tries to deal with anomalies but -can not guess right in all instances. If possible, the sender should be configured -to submit well-formed messages. If that is not possible, you can work around these -issues with rsyslog's property replacer and template system. Or you can use a suitable -message parser or write one for your needs. -

I hope this is a useful guide. You may also have a look at the -rsyslog troubleshooting guide for further help and places where -to ask questions. -

[manual index] [rsyslog site]

- -- cgit v1.2.3