summaryrefslogtreecommitdiff
path: root/src/pmlogcheck/RFC
diff options
context:
space:
mode:
authorIgor Pashev <pashev.igor@gmail.com>2014-10-26 12:33:50 +0400
committerIgor Pashev <pashev.igor@gmail.com>2014-10-26 12:33:50 +0400
commit47e6e7c84f008a53061e661f31ae96629bc694ef (patch)
tree648a07f3b5b9d67ce19b0fd72e8caa1175c98f1a /src/pmlogcheck/RFC
downloadpcp-debian.tar.gz
Debian 3.9.10debian/3.9.10debian
Diffstat (limited to 'src/pmlogcheck/RFC')
-rw-r--r--src/pmlogcheck/RFC64
1 files changed, 64 insertions, 0 deletions
diff --git a/src/pmlogcheck/RFC b/src/pmlogcheck/RFC
new file mode 100644
index 0000000..5ff62e4
--- /dev/null
+++ b/src/pmlogcheck/RFC
@@ -0,0 +1,64 @@
+In the beginning all PCP archives were created by pmlogger. So a
+corrupt PCP archive meant pmlogger had a bug or was interrupted in
+some manner.
+
+We fixed the bugs (!) and hardened the checking of archives to ensure
+we could process as much as possible of an interrupted archive.
+
+But things changed and archives could be created in more ways ...
+- pmlogmerge, so more checks to ensure the semantic consistency of
+ the input archives, but again we could assume pmlogmerge would
+ create correct archives
+- pmlogreduce, same as pmlogmerge
+
+With the introduction of libpcp_import and bindings for Perl and
+Python, we now have the possibility of and infinite number of scripts
+creating archives using low-level calls that can be combined to
+produce and infinite variety of corrupted archives. The first example
+of this class is https://bugzilla.redhat.com/show_bug.cgi?id=958745
+but we should expect more of these to be lurking in the future.
+
+When these problems appear, the initial triage effort is directed
+(rightly) towards the replay tool that is failing, and it takes
+considerable time and effort to determine that the root cause is
+a corrupted archive, not an application or libpcp failure.
+
+Some corruption we can (and do) catch in libpcp. We could probably
+do more there, but the most common usage with "interp" mode replay
+makes it almost impossible to check timestamps on the fly, so the
+but above would be most unlikely to be found there.
+
+So, in the spirit of the original Unix filesystem, I'm proposing
+an ncheck/icheck (none of you're whimpy fsck in those days) tool,
+pmlogcheck.
+
+The objective would be to have one anally retentive tool that can
+assert the "goodness" of a PCP archive, as the first step in any
+triage, even before pmdumplog is used.
+
+pmlogcheck would certainly be a multi-pass tool, initially using
+no libpcp services to read blocks of the files, and then graduate
+to the low-level libpcp routines once basic sanity has been
+established.
+
+The sorts of checks it might try would include:
+
+x. process temporal index if any
+ [ ] check label
+ [ ] if missing, warn
+ [ ] else load temporal index
+x. process meta file
+ [ ] check label
+ [ ] check header-trailer len for every record
+ [ ] check timestamps for indoms are monotonic increasing (and >= label
+ record start)
+ [ ] check timestamp and offset against temporal index (if available)
+ [ ] if OK load metadata
+x. process each metric volume file
+ [ ] check label
+ [ ] check header-trailer len for every record
+ [ ] check timestamps are monotonic increasing (and >= label record start)
+ [ ] check timestamp and offset against temporal index (if available)
+ [ ] check pmids are defined in meta data
+ [ ] check instances are defined in metadata
+ [ ] check value encoding matches metadata type