summaryrefslogtreecommitdiff
path: root/man/html/howto.diskperf.html
diff options
context:
space:
mode:
Diffstat (limited to 'man/html/howto.diskperf.html')
-rw-r--r--man/html/howto.diskperf.html754
1 files changed, 754 insertions, 0 deletions
diff --git a/man/html/howto.diskperf.html b/man/html/howto.diskperf.html
new file mode 100644
index 0000000..ffedfa0
--- /dev/null
+++ b/man/html/howto.diskperf.html
@@ -0,0 +1,754 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
+<!--
+ (c) Copyright 2000-2004 Silicon Graphics Inc. All rights reserved.
+ Permission is granted to copy, distribute, and/or modify this document
+ under the terms of the Creative Commons Attribution-Share Alike, Version
+ 3.0 or any later version published by the Creative Commons Corp. A copy
+ of the license is available at
+ http://creativecommons.org/licenses/by-sa/3.0/us/ .
+-->
+<HTML>
+<HEAD>
+ <meta http-equiv="content-type" content="text/html; charset=utf-8">
+ <meta http-equiv="content-style-type" content="text/css">
+ <link href="pcpdoc.css" rel="stylesheet" type="text/css">
+ <link href="images/pcp.ico" rel="icon" type="image/ico">
+ <TITLE>How to understand disk performance</TITLE>
+</HEAD>
+<BODY LANG="en-AU" TEXT="#000060" DIR="LTR">
+<TABLE WIDTH=100% BORDER=0 CELLPADDING=0 CELLSPACING=0 STYLE="page-break-before: always">
+ <TR> <TD WIDTH=64 HEIGHT=64><FONT COLOR="#000080"><A HREF="http://pcp.io/"><IMG SRC="images/pcpicon.png" NAME="pmcharticon" ALIGN=TOP WIDTH=64 HEIGHT=64 BORDER=0></A></FONT></TD>
+ <TD WIDTH=1><P>&nbsp;&nbsp;&nbsp;&nbsp;</P></TD>
+ <TD WIDTH=500><P VALIGN=MIDDLE ALIGN=LEFT><A HREF="index.html"><FONT COLOR="#cc0000">Home</FONT></A>&nbsp;&nbsp;&middot;&nbsp;<A HREF="lab.pmchart.html"><FONT COLOR="#cc0000">Charts</FONT></A>&nbsp;&nbsp;&middot;&nbsp;<A HREF="timecontrol.html"><FONT COLOR="#cc0000">Time Control</FONT></A></P></TD>
+ </TR>
+</TABLE>
+<H1 ALIGN=CENTER STYLE="margin-top: 0.48cm; margin-bottom: 0.32cm"><FONT SIZE=7>How to understand measures of disk performance</FONT></H1>
+<TABLE WIDTH=15% BORDER=0 CELLPADDING=5 CELLSPACING=10 ALIGN=RIGHT>
+ <TR><TD BGCOLOR="#e2e2e2"><IMG SRC="images/system-search.png" WIDTH=16 HEIGHT=16 BORDER=0>&nbsp;&nbsp;<I>Tools</I><BR><PRE>
+pmchart
+sar
+</PRE></TD></TR>
+</TABLE>
+<P>This chapter of the Performance Co-Pilot tutorial provides some hints
+on how to interpret and understand the various measures of disk
+performance.</P>
+<P>For an explanation of Performance Co-Pilot terms and acronyms, consult
+the <A HREF="glossary.html">PCP glossary</A>.</P>
+
+<P><BR></P>
+<TABLE WIDTH=100% BORDER=0 CELLPADDING=0 CELLSPACING=0 BGCOLOR="#e2e2e2">
+ <TR><TD WIDTH=100% BGCOLOR="#081c59"><P ALIGN=LEFT><FONT SIZE=5 COLOR="#ffffff"><B>Reconciling sar -d and PCP disk performance metrics</B></FONT></P></TD></TR>
+</TABLE>
+<P>
+Both <I>sar</I> and Performance Co-Pilot (PCP) use a common collection
+of disk performance instrumentation from the block layer in the kernel,
+however the disk performance metrics provided by <I>sar</I> and PCP
+differ in their derivation and semantics.&nbsp;&nbsp;This document
+is an attempt to explain these differences. </P>
+<P>
+It is convenient to define the ``response time'' to be the time to
+complete a disk operation as the sum of the time spent:</P>
+<UL>
+ <LI>
+ entering the read() or write() system call and set up for an I/O
+ operation (time here is CPU bound and is assumed to be negligible per
+ I/O)
+ <LI>
+ in a queue of pending requests waiting to be handed to the device
+ controller (the ``queue time'')
+ <LI>
+ the time between the request being handed to the device controller and
+ the end of transfer interrupt (the ``(device) service time''),
+ typically composed of delays due to request scheduling at the
+ controller, bus arbitration, possible seek time, rotational latency,
+ data transfer, etc.
+ <LI>
+ time to process the end of transfer interrupt, housekeeping at the end
+ of an I/O operation and return from the read() or write() system call
+ (time here is CPU bound and also assumed to be negligible per I/O)
+</UL>
+<P>
+Note that while the CPU time per I/O is assumed to be small in
+relationship to the times involving operations at the device level,
+when the system-wide I/O rate is high (and it could be tens of
+thousands of I/Os per second on a very large configuration), the <B>aggregate</B>
+ CPU demand to support this I/O activity may be significant.</P>
+<P>
+The kernel agents for PCP export the following metrics for each disk spindle:</P>
+<TABLE BORDER="1">
+ <CAPTION ALIGN="BOTTOM"><B>Table 1: Raw PCP disk metrics</B></CAPTION>
+ <TR VALIGN="TOP">
+ <TH>Metric</TH>
+ <TH>Units</TH>
+ <TH>Semantics</TH>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.read</TT></I></TD>
+ <TD>number</TD>
+ <TD>running total of <B>read</B> I/O requests since boot time</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.write</TT></I></TD>
+ <TD>number</TD>
+ <TD>running total of <B>write</B> I/O requests since boot time</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.total</TT></I></TD>
+ <TD>number</TD>
+ <TD>running total of I/O requests since boot time, equals <I><TT>disk.dev.read</TT></I>
+ + <I><TT>disk.dev.write</TT></I></TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.blkread</TT></I></TD>
+ <TD>number</TD>
+ <TD>running total of data <B>read</B> since boot time in units
+ of 512-byte blocks</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.blkwrite</TT></I></TD>
+ <TD>number</TD>
+ <TD>running total of data <B>written</B> since boot time in
+ units of 512-byte blocks</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.blktotal</TT></I></TD>
+ <TD>number</TD>
+ <TD>running total of data <B>read</B> or <B>written</B> since
+ boot time in units of 512-bytes, equals <I><TT>disk.dev.blkread
+ + disk.dev.blkwrite</TT></I></TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.read_bytes</TT></I></TD>
+ <TD>Kbytes</TD>
+ <TD>running total of data <B>read</B> since boot time in units
+ of Kbytes</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.write_bytes</TT></I></TD>
+ <TD>Kbytes</TD>
+ <TD>running total of data <B>written</B> since boot time in
+ units of Kbytes</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.bytes</TT></I></TD>
+ <TD>Kbytes</TD>
+ <TD>running total of data <B>read</B> or <B>written</B> since
+ boot time in units of Kbytes, equals <I><TT>disk.dev.read_bytes
+ + disk.dev.write_bytes</TT></I></TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.active</TT></I></TD>
+ <TD>milliseconds</TD>
+ <TD>running total (milliseconds since boot time) of time this
+ device has been busy servicing at least one I/O request</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.response</TT></I></TD>
+ <TD>milliseconds</TD>
+ <TD>running total (milliseconds since boot time) of the
+ response time for all completed I/O requests</TD>
+ </TR>
+</TABLE>
+<P>
+These metrics are all &quot;counters&quot; so when displayed with most
+PCP tools, they are sampled periodically and the differences between
+consecutive values converted to rates or time utilization over the
+sample interval as follows:</P>
+<TABLE BORDER="1">
+ <CAPTION ALIGN="BOTTOM"><B>Table 2: PCP disk metrics as reported by
+ most PCP tools</B></CAPTION>
+ <TR VALIGN="TOP">
+ <TH>Metric</TH>
+ <TH>Units</TH>
+ <TH>Semantics</TH>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.read</TT></I></TD>
+ <TD>number per second</TD>
+ <TD><B>read</B> I/O requests per second (or <B>read</B> IOPS)</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.write</TT></I></TD>
+ <TD>number per second</TD>
+ <TD><B>write</B> I/O requests per second (or <B>write</B> IOPS)</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.total</TT></I></TD>
+ <TD>number per second</TD>
+ <TD>I/O requests per second (or IOPS)</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.blkread</TT></I></TD>
+ <TD>number per second</TD>
+ <TD>2 * (Kbytes <B>read</B> per second)</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.blkwrite</TT></I></TD>
+ <TD>number per second</TD>
+ <TD>2 * (Kbytes <B>written </B>per second)</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.blktotal</TT></I></TD>
+ <TD>number per second</TD>
+ <TD>2 * (Kbytes <B>read</B> or <B>written</B> per second)</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.read_bytes</TT></I></TD>
+ <TD>Kbytes per second</TD>
+ <TD>Kbytes <B>read</B> per second</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.write_bytes</TT></I></TD>
+ <TD>Kbytes per second</TD>
+ <TD>Kbytes <B>written </B>per second</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.bytes</TT></I></TD>
+ <TD>Kbytes per second</TD>
+ <TD>Kbytes <B>read</B> or <B>written</B> per second</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.active</TT></I></TD>
+ <TD>time utilization</TD>
+ <TD>fraction of time device was &quot;busy&quot; over the
+ sample interval (either in the range 0.0-1.0 or expressed as a
+ percentage in the rance 0-100); in this context &quot;busy&quot; means
+ servicing one or more I/O requests</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>disk.dev.response</TT></I></TD>
+ <TD>time utilization</TD>
+ <TD>time average of the response time over the interval; this
+ is a slightly strange metric in that values larger than 1.0 (or 100%)
+ imply either device saturation, or controller saturation or a very
+ ``bursty'' request arrival pattern -- in isolation there is <B>no
+ sensible interpretation</B> of the rate converted value
+ this metric </TD>
+ </TR>
+</TABLE>
+<P>
+The <I>sar</I> metrics <I><TT>avque</TT></I>, <I><TT>avwait</TT></I>
+ and <I><TT>avserv</TT></I> are subject to widespread
+misinterpretation, and so warrant some special explanation. They may be
+understood with the aid of a simple illustrative example. Consider the
+following snapshot of disk activity in which the response time has been
+simplified to be a multiple of 10 milliseconds for each I/O operation
+over a 100 millisecond sample interval (this is an unlikely
+simplification, but makes the arithmetic easier).</P>
+<CENTER><P ALIGN="CENTER">
+<IMG SRC="images/sar-d.png" WIDTH="529" HEIGHT="152"></P>
+</CENTER><P>
+Each green block represents a 4 Kbyte read. Each red block represents a
+16Kbyte write.</P>
+<DL>
+ <DT>
+ <I><TT>avque</TT></I>
+ <DD>
+ <P>
+ The <B><I>stochastic</I></B> <B><I>average</I></B> of the
+ &quot;queue&quot; length sampled just before each I/O is complete,
+ where ``queue'' here includes those requests in the queue <B>and</B>
+ those being serviced by the device controller. Unfortunately the <B><I>stochastic</I></B>
+ <B><I>average</I></B> of a queue length is not the same as the
+ more commonly understood <B><I>temporal</I></B> or <B><I>time</I></B>
+ <B><I>average</I></B> of a queue length. </P>
+ <P>
+ In the table below, <B>R</B> is the contribution to the sum of the
+ response times, <B>Qs</B> is the contribution to the sum of the
+ queue length used to compute the <B><I>stochastic</I></B> average
+ and <B>Qt</B> is the contribution to the sum of the queue length
+ &#215; time used to compute the <B><I>temporal</I></B> average. </P>
+</DL>
+ <CENTER>
+ <TABLE BORDER="1">
+ <TR>
+ <TH ALIGN="CENTER">
+ <B>Time</B><BR>
+ (msec)</TH>
+ <TH ALIGN="CENTER"><B>Event</B></TH>
+ <TH ALIGN="CENTER"><B>R</B><BR>
+ (msec)</TH>
+ <TH ALIGN="CENTER"><B>Qs</B></TH>
+ <TH ALIGN="CENTER"><B>Qt</B><BR>
+ (msec)</TH>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">300</TD>
+ <TD>Start I/O #1 (write)</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">320</TD>
+ <TD>End I/O #1</TD>
+ <TD ALIGN="RIGHT">20</TD>
+ <TD ALIGN="RIGHT">1</TD>
+ <TD ALIGN="RIGHT">1&#215;20</TD>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">320</TD>
+ <TD>Start I/O #2 (read)</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">320</TD>
+ <TD>Start I/O #3 (read)</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">330</TD>
+ <TD>End I/O #2</TD>
+ <TD ALIGN="RIGHT">10</TD>
+ <TD ALIGN="RIGHT">2</TD>
+ <TD ALIGN="RIGHT">2&#215;10</TD>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">340</TD>
+ <TD>End I/O #3</TD>
+ <TD ALIGN="RIGHT">20</TD>
+ <TD ALIGN="RIGHT">1</TD>
+ <TD ALIGN="RIGHT">1&#215;10</TD>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">360</TD>
+ <TD>Start I/O #4 (write)</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">360</TD>
+ <TD>Start I/O #5 (read)</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">360</TD>
+ <TD>Start I/O #6 (read)</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ <TD ALIGN="RIGHT">&nbsp;</TD>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">370</TD>
+ <TD>End I/O #6</TD>
+ <TD ALIGN="RIGHT">10</TD>
+ <TD ALIGN="RIGHT">3</TD>
+ <TD ALIGN="RIGHT">3&#215;10</TD>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">380</TD>
+ <TD>End I/O #5</TD>
+ <TD ALIGN="RIGHT">20</TD>
+ <TD ALIGN="RIGHT">2</TD>
+ <TD ALIGN="RIGHT">2&#215;10</TD>
+ </TR>
+ <TR>
+ <TD ALIGN="RIGHT">400</TD>
+ <TD>End I/O #4</TD>
+ <TD ALIGN="RIGHT">40</TD>
+ <TD ALIGN="RIGHT">1</TD>
+ <TD ALIGN="RIGHT">1&#215;20</TD>
+ </TR>
+</TABLE>
+</CENTER>
+<DL>
+ <DT>
+ &nbsp;
+ <DD>
+ <P>
+ The (stochastic) average response time is sum(<B>R</B>) / 6 = 120 /
+ 6 = 20 msec.</P>
+ <P>
+ The <B><I>stochastic</I></B> <B><I>average</I></B> of the queue
+ length is sum(<B>Qs</B>) / 6 = 10 / 6 = 1.67.</P>
+ <P>
+ The <B><I>temporal </I></B> <B><I>average</I></B> of the queue
+ length is sum(<B>Qt</B>) / 100 = 120 / 100 = 1.20.</P>
+ <P>
+ Even in this simple example, the two methods for computing the &quot;average&quot;
+ queue length produce different answers. As the inter-arrival rate
+ for I/O requests becomes more variable, and particularly when many I/O
+ requests are issued in a short period of time followed by a period of
+ quiescence, the two methods produce radically different results.</P>
+ <P>
+ For example if the idle period in the example above was 420 msec rather
+ than 20 msec, then the <B><I>stochastic</I></B> <B><I>average</I></B>
+ would remain unchanged at 1.67, but the <B><I>temporal average</I></B>
+ would fall to 120/500 = 0.24 ... given that this disk is now <B>idle</B>
+ for 420/500 = 84% of the time one can see how misleading the <B><I>stochastic</I></B>
+ <B><I>average</I></B> can be. Unfortunately many disks are subject
+ to exactly this pattern of short bursts when many I/Os are enqueued,
+ followed by long periods of comparative calm (consider flushing dirty
+ blocks by <I>bdflush</I> in IRIX or the DBWR process in Oracle).
+ Under these circumstances, <I><TT>avque</TT> </I>as reported by <I>sar</I>
+ can be very misleading.</P>
+ <DT>
+ <I><TT>avserv</TT></I>
+ <DD>
+ <P>
+ Because multiple operations may be processed by the controller at the
+ same time, and the order of completion is not necessarily the same as
+ the order of dispatch, the notion of individual service time is
+ difficult (if not impossible) to measure. Rather, <I>sar</I>
+ approximates using the total time the disk was busy processing at
+ least one request divided by the number of completed requests.</P>
+ <P>
+ In the example above this translates to busy for 80 msec, in which time
+ 6 I/Os were completed, so the average service time is 13.33 msec.</P>
+ <DT>
+ <I><TT>avwait</TT></I>
+ <DD>
+ <P>
+ For reasons similar to those applying to <I><TT>avserv</TT></I> the
+ average time spent waiting cannot be split between waiting in the
+ queue of requests to be sent to the controller and waiting at the
+ controller while some other concurrent request is being processed. So <I>sar</I>
+ computes the total time spent waiting as the total response time minus
+ the total service time, and then averages over the number of completed
+ requests.</P>
+ <P>
+ In the example above this translates to a total waiting time of 120
+ msec - 80 msec, in which time 6 I/Os were completed, so the average
+ waiting time is 6.67 msec.</P>
+</DL>
+<P>
+When run with a <B>-d</B> option, <I>sar</I> reports the following for
+each disk spindle:</P>
+<TABLE BORDER="1">
+ <CAPTION ALIGN="BOTTOM"><B>Table 3: PCP and sar metric equivalents</B></CAPTION>
+ <TR VALIGN="TOP">
+ <TH>Metric</TH>
+ <TH>Units</TH>
+ <TH>PCP equivalent<BR>
+ (in terms of the rate converted metrics in Table 2)</TH>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>%busy</TT></I></TD>
+ <TD>percent</TD>
+ <TD>100 * <I><TT>disk.dev.active</TT></I> </TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>avque</TT></I></TD>
+ <TD>I/O operations</TD>
+ <TD>N/A (see above)</TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>r+w/s</TT></I></TD>
+ <TD>I/Os per second</TD>
+ <TD><I><TT>disk.dev.total</TT></I></TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>blks/s</TT></I></TD>
+ <TD>512-byte blocks per second</TD>
+ <TD><I><TT>disk.dev.blktotal</TT></I></TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>w/s</TT></I></TD>
+ <TD><B>write</B> I/Os per second</TD>
+ <TD><I><TT>disk.dev.write</TT></I></TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>wblks/s</TT></I></TD>
+ <TD>512-byte blocks <B>written</B> per second</TD>
+ <TD><I><TT>disk.dev.blkwrite</TT></I></TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>avwait</TT></I></TD>
+ <TD>milliseconds</TD>
+ <TD>1000 * (<I><TT>disk.dev.response</TT></I> <I><TT>-
+ disk.dev.active)</TT></I> / <I><TT>disk.dev.total</TT></I></TD>
+ </TR>
+ <TR VALIGN="TOP">
+ <TD><I><TT>avserv</TT></I></TD>
+ <TD>milliseconds</TD>
+ <TD>1000 * <I><TT>disk.dev.active</TT></I> / <I><TT>disk.dev.total</TT></I></TD>
+ </TR>
+</TABLE>
+<P>
+The table below shows how the PCP tools and <I>sar</I> would report the
+disk performance over the 100 millisecond interval from the example
+above:</P>
+<TABLE BORDER="1">
+ <CAPTION ALIGN="BOTTOM"><B>Table 3: Illustrative values and
+ calculations</B></CAPTION>
+ <TR>
+ <TH>Rate converted PCP metric<BR>
+ (like in Table 2)</TH>
+ <TH>sar metrics</TH>
+ <TH>Explanation</TH>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.read</TT></I></TD>
+ <TD>N/A</TD>
+ <TD>4 reads in 100 msec = 40 reads per second</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.write</TT></I></TD>
+ <TD><I><TT>w/s</TT></I></TD>
+ <TD>2 writes in 100 msec = 20 writes per second</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.total</TT></I></TD>
+ <TD><I><TT>r+w/s</TT></I></TD>
+ <TD>4 reads + 2 write in 100 msec = 60 I/Os per second</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.blkread</TT></I></TD>
+ <TD>N/A</TD>
+ <TD>4 * 4 Kbytes = 32 blocks in 100 msec = 320 blocks read per
+ second</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.blkwrite</TT></I></TD>
+ <TD><I><TT>wblks/s</TT></I></TD>
+ <TD>2 * 16 Kbytes = 64 blocks in 100 msec = 640 blocks written
+ per second</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.blktotal</TT></I></TD>
+ <TD><I><TT>blks/s</TT></I></TD>
+ <TD>96 blocks in 100 msec = 960 blocks per second</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.read_bytes</TT></I></TD>
+ <TD>N/A</TD>
+ <TD>4 * 4 Kbytes = 16 Kbytes in 100 msec = 160 Kbytes per second</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.write_bytes</TT></I></TD>
+ <TD>N/A</TD>
+ <TD>2 * 16 Kbytes = 32 Kbytes in 100 msec = 320 Kbytes per
+ second</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.bytes</TT></I></TD>
+ <TD>N/A</TD>
+ <TD>48 Kbytes in 100 msec = 480 Kbytes per second</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.active</TT></I></TD>
+ <TD><I><TT>%busy</TT></I></TD>
+ <TD>80 msec active in 100 msec = 0.8 or 80%</TD>
+ </TR>
+ <TR>
+ <TD><I><TT>disk.dev.response</TT></I></TD>
+ <TD>N/A</TD>
+ <TD>Disregard (see comments in Table 2)</TD>
+ </TR>
+ <TR>
+ <TD>N/A</TD>
+ <TD><I><TT>avque</TT></I></TD>
+ <TD>1.67 requests (see derivation above)</TD>
+ </TR>
+ <TR>
+ <TD>N/A</TD>
+ <TD><I><TT>avwait</TT></I></TD>
+ <TD>6.67 msec (see derivation above)</TD>
+ </TR>
+ <TR>
+ <TD>N/A</TD>
+ <TD><I><TT>avserv</TT></I></TD>
+ <TD>13.33 msec (see derivation above)</TD>
+ </TR>
+</TABLE>
+<P>
+In practice many of these metrics are of little use. Fortunately the
+most common performance problems related to disks can be identified
+quite simply as follows:</P>
+<DL>
+ <DT>
+ <B>Device saturation</B>
+ <DD>
+ Occurs when <I><TT>disk.dev.active</TT></I> is close to 1.0
+ (which is the same as <I><TT>%busy</TT></I> is close to 100%).
+ <DT>
+ <B>Device throughput</B>
+ <DD>
+ Use <I><TT>disk.dev.bytes</TT></I> (or <I><TT>blks/s</TT></I>
+ divided by 2 to produce Kbytes per second)
+ <DD>
+ The peak value depends on the bus and disk characteristics, and is
+ subject to significant variation depending on the distribution, size
+ and type of requests. Fortunately in many environments the peak value
+ does not change over time, so once established, monitoring thresholds
+ tend to remain valid.
+ <DT>
+ <B>Read/write mix</B>
+ <DD>
+ For some disks (and RAID devices in particular) writes may be slower
+ than reads. The ratio of <I><TT>disk.dev.write</TT></I> to <I><TT>disk.dev.total</TT></I>
+ (or <I><TT>w/s</TT></I> to <I><TT>r+w/s</TT></I>) indicates the
+ fraction of I/O requests that are writes.
+</DL>
+<P>
+In terms of the available instrumentation from the IRIX kernel, one
+potentially useful metric would be the stochastic average of the
+response time per completed I/O operation, which in the sample above
+would be 20 msec. Unfortunately no performance tool reports this
+directly.</P>
+<UL>
+ <LI>
+ For <I>sar</I>, this metric is the sum of <I><TT>avwait</TT></I>
+ and <I><TT>avserv</TT></I>.
+ <P>
+ </P>
+ <LI>
+ The common PCP tools only support temporal rate conversion for
+ counters, however the stochastic average of the response time can be
+ computed with the PCP inference engine (<I>pmie</I>) using an
+ expression of the form:
+ <PRE>
+<TT>avg_resp = 1000 * disk.dev.response / disk.dev.total;</TT>
+</PRE>
+</UL>
+
+<P><BR></P>
+<TABLE WIDTH=100% BORDER=0 CELLPADDING=0 CELLSPACING=0 BGCOLOR="#e2e2e2">
+ <TR><TD WIDTH=100% BGCOLOR="#081c59"><P ALIGN=LEFT><FONT SIZE=5 COLOR="#ffffff"><B>A real example</B></FONT></P></TD></TR>
+</TABLE>
+<P>
+Consider this data from<B> sar -d</B> with a <B>10 minute</B> update
+interval:</P>
+<PRE>
+ device %busy avque r+w/s blks/s w/s wblks/s avwait avserv
+ dks0d2 34 12.8 32 988 29 874 123.1 10.5
+ dks0d5 34 12.5 33 1006 29 891 119.0 10.4
+</PRE>
+<P>
+At first impression, queue lengths of 12-13 requests and wait time of
+120msec looks pretty bad. </P>
+<P>
+But further investigation is warranted ...</P>
+<UL>
+ <LI>
+ most of the I/Os are writes (58 of 65 I/Os per second)
+ <LI>
+ average I/Os are (874+891)*512/(29+29) = 15580 bytes ... close to
+ default 16K filesystem block size
+ <LI>
+ to sustain (874+891)*512 = 903680 bytes of write throughput per second
+ for at least 10 minutes you are doing a lot of file writes
+ <LI>
+ the disks are not unduly busy at 34% utilization
+ <LI>
+ consider what happens when <I>bdflush</I>, <I>pdflush</I> and
+ friends run ... lets make some simplifying assumptions to make the
+ arithmetic easy
+ <UL>
+ <LI>
+ we are dirtying (writing) 60 x 16 Kbyte pages (983040 bytes) per second
+ <LI>
+ flushing goes off every 10 seconds, but the page cache is scanned in
+ something under 10 msec
+ <LI>
+ to keep up, each flush must push out 600 pages
+ <LI>
+ I/O is balanced across 2 disks
+ <LI>
+ disk service time is 10 msec per I/O
+ <LI>
+ after the flushing code has scanned the page cache, all 300 writes per
+ disk are on the queue <B>before</B> the first one is done (this
+ is what skews the wait time and queue lengths)
+ </UL>
+ <LI>
+ disk utilization is 300 * 10 / (10 * 1000) = 0.3 = 30%
+ <LI>
+ the stochastic average wait time is (0 + 10 + 20 + ... + 2990) / 300
+ &gt; = 150 msec
+ <LI>
+ time to empty the queue after a flush is 3 seconds
+ <LI>
+ the temporal average queue length is 0 * 7/10 + 150 * 3/10 = 45
+</UL>
+<P>
+The complicating issue here is that the I/O demand is very bursty and
+this is what skews the &quot;average&quot; measures.</P>
+<P>
+In this case, the I/O is probably <B>asynchronous</B> with respect to
+the process(es) doing the writing. Under these circumstances,
+performance is unlikely to improve dramatically if the aggregate I/O
+bandwidth was increased (e.g. by spreading the writes across more disk
+spindles).</P>
+<P>
+However if the I/O is <B>synchronous</B> (e.g. it it was read dominated,
+or the I/O was to a raw disk), then more I/O would reduce application
+running time.</P>
+<P>
+There are also <B>hybrid</B> scenarios in which a small number of
+synchronous reads are seriously slowed down during the bursts of
+asynchronous writes. In the example above, a read could have the
+misfortune of being queued behind 300 writes (or delayed for 3 seconds).</P>
+
+<P><BR></P>
+<TABLE WIDTH=100% BORDER=0 CELLPADDING=0 CELLSPACING=0 BGCOLOR="#e2e2e2">
+ <TR><TD WIDTH=100% BGCOLOR="#081c59"><P ALIGN=LEFT><FONT SIZE=5 COLOR="#ffffff"><B>Beware of Wait I/O</B></FONT></P></TD></TR>
+</TABLE>
+<P>
+PCP (and <I>sar</I> and <I>osview</I> and ...) all report CPU
+utilization broken down into:</P>
+<UL>
+ <LI>
+ user
+ <LI>
+ system (sys, intr)
+ <LI>
+ idle
+ <LI>
+ wait (for file system I/O, graphics, physical I/O and swap I/O)
+</UL>
+<P>
+Because I/O does not &quot;belong&quot; to any processor (and in some
+cases may not &quot;belong&quot; to any current process), a CPU that is
+&quot;waiting for I/O&quot; is more accurately described as an
+&quot;idle CPU while at least one I/O is outstanding&quot;.</P>
+<P>
+Anomalous Wait I/O time occurs under light load when a small number of <B>processes</B>
+are waiting for I/O but many <B>CPUs</B> are otherwise idle, but
+appear in the &quot;Wait for I/O&quot; state. When the number of CPUs
+increases to 30, 60 or 120 then 1 process doing I/O can make all of the
+CPUs except 1 look like they are all waiting for I/O, but clearly no
+amount of I/O bandwidth increase is going to make any difference to
+these CPUs. And if that one process is doing asynchronous I/O and not
+blocking, then additional I/O bandwidth will not make it run faster
+either.</P>
+
+<TABLE WIDTH=100% BORDER=0 CELLPADDING=10 CELLSPACING=20>
+ <TR><TD BGCOLOR="#e2e2e2" WIDTH=70%><BR><IMG SRC="images/stepfwd_on.png" WIDTH=16 HEIGHT=16 BORDER=0>&nbsp;&nbsp;&nbsp;Using <I>pmchart</I> to display concurrent disk and CPU activity (aggregated over all CPUs and all disks respectively).<BR>
+<PRE><B>
+$ source /etc/pcp.conf
+$ tar xzf $PCP_DEMOS_DIR/tutorials/diskperf.tgz
+$ pmchart -t 2sec -O -0sec -a diskperf/waitio -c diskperf/waitio.view
+</B></PRE>
+<P>The system has 4 CPUs, several disks and only 1 process really doing I/O.</P>
+<P>Note that over time:</P>
+<UL>
+ <LI>
+ in the top chart as the CPU user (blue) and system (red) time
+ increases, the Wait I/O (pale blue) time decreases
+ <LI>
+ from the bottom chart, the I/O rate is pretty constant throughout
+ <LI>
+ in the bursts where the I/O rate falls, the Wait I/O time becomes CPU
+ idle (green) time
+</UL>
+</TD></TR>
+</TABLE>
+
+<P><BR></P>
+<HR>
+<CENTER>
+<TABLE WIDTH=100% BORDER=0 CELLPADDING=0 CELLSPACING=0>
+ <TR> <TD WIDTH=50%><P>Copyright &copy; 2007-2010 <A HREF="http://www.aconex.com/"><FONT COLOR="#000060">Aconex</FONT></A><BR>Copyright &copy; 2000-2004 <A HREF="http://www.sgi.com/"><FONT COLOR="#000060">Silicon Graphics Inc</FONT></P></TD>
+ <TD WIDTH=50%><P ALIGN=RIGHT><A HREF="http://pcp.io/"><FONT COLOR="#000060">PCP Site</FONT></A><BR>Copyright &copy; 2012-2014 <A HREF="http://www.redhat.com/"><FONT COLOR="#000060">Red Hat</FONT></P></TD> </TR>
+</TABLE>
+</CENTER>
+</BODY>
+</HTML>