Add some notes with thoughts on what a parallel bulk build system should do.

author: dmcmahill <dmcmahill> 2003-03-16 13:45:12 +0000
committer: dmcmahill <dmcmahill> 2003-03-16 13:45:12 +0000
commit: 3428fc0fcd9e2138c37b2c92eb1570c6fe72eb4d (patch)
tree: 24f941a112b56429754cf6d9d63cfa753fe268a3 /mk
parent: c39d3e1fd00eb57dd0723db4f57aed9b80fe9115 (diff)
download: pkgsrc-3428fc0fcd9e2138c37b2c92eb1570c6fe72eb4d.tar.gz
1 files changed, 210 insertions, 0 deletions
diff --git a/mk/bulk/parallel.txt b/mk/bulk/parallel.txt
new file mode 100644
index 00000000000..9b4bf96f4e2
--- /dev/null
+++ b/mk/bulk/parallel.txt
@@ -0,0 +1,210 @@
+# $Id: parallel.txt,v 1.1 2003/03/16 13:45:12 dmcmahill Exp $
+#
+
+These are my thoughts on how one would want a parallel bulk build to
+work. 
+
+
+====================================================================
+Single Machine Build Process
+====================================================================
+
+The current (as of 2003-03-16) bulk build system works in the
+following manner:
+
+1)  All installed packages are removed.
+
+2)  Packages listed in the BULK_PREREQ variable are installed.  This
+    must be done before step 2 as some packages (like xpkgwedge) can
+    affect the dependencies of other packages when installed.
+
+3)  Each package directory is visited and its explicitly listed
+    dependencies are extracted and put in a 'dependstree' file.  The
+    mk/bulk/tflat script is used to generate flattened dependencies
+    for all packages from this dependstree file in both the up and
+    down directions.  The result is a file 'dependsfile' which has one
+    line per package that lists all build dependencies.  Additionally,
+    a 'supportsfile' is created which has one line for each package
+    and lists all packages which depend upon the listed pacakge.
+    Finally, tsort(1) is applied to the 'dependstree' file to
+    determine the correct build order for the bulk build.  The build
+    order is stored in a 'buildorder' file.  This is all achieved via
+    the 'bulk-cache' top level target.  By extracting dependencies in
+    this fashion, we avoid highly redundant recursive make calls.  For
+    example, we no longer need to use a recursive make to find the
+    dependencies for libtool literally thousands and thousands of
+    times throughout the build.
+
+4)  During the build, the 'buildorder' file is consulted to figure out
+    which package should be built next.   Then to build the package,
+    the following steps are taken:
+
+    a)  Check for the existance of a '.broken' file in the package
+    directory.  If this file exists, then the package is already
+    broken for some reason so move on to the next package.
+
+    b)  Remove all packages which are not needed to build the current
+    package.  This dependency list is obtained from the 'dependsfile'
+    created in step 3 and the BULK_PREREQ variable.
+
+    c)  Install via pkg_add all packages which are needed to build the
+    current package.  We are able to do this because we have been
+    building our packages in a bottom up order so all dependencies
+    should have been build.
+
+    d)  Build and package the package.
+
+    e)  If the package build fails, then we copy over the build log to
+    a .broken file and in addition, we consult the 'supportsfile' and
+    mark all packages which depend upon this one as broken by adding a
+    line to their .broken files (creating them if needed).  By going
+    ahead and marking these packages as broken, we avoid wasting time
+    on them later.
+
+    f)  Append the package directory name to the top level pkgsrc
+    '.make' file to indicate that we have processed this package.
+
+5)  Run the mk/bulk/post-build script to collect the summary and
+    generate html pages and the email we've all seen.
+
+====================================================================
+Single Machine Build Comments
+====================================================================
+
+There are several features of this approach that are worth mentioning
+explicitly.  
+
+1)  Packages are built in the correct order.  We don't want to rebuild
+    the gnome meta-pkg and then rebuild gnome-libs for example.
+
+2)  Restarting the build is a cheap operation.  Remember that this
+    build can take weeks or more.  In fact the 1.6 build took nearly 6
+    weeks on a sparc 20!  If for some reason, the build needs to be
+    interrupted, it can be easily restarted because in step 4f we keep
+    track of what has been built in a file.  The lines in the build
+    script which control this are:
+
+      for pkgdir in `cat $ORDERFILE` ; do
+        if ! grep -q "^${pkgdir}\$" $BUILDLOG ; then
+          (cd $pkgdir && \
+             nice -n 20 ${BMAKE} USE_BULK_CACHE=yes bulk-package)
+        fi
+      done
+
+    In addition to storing the progress to disk, the bulk cache files
+    (the 'dependstreefile', 'dependsfile', 'supportsfile', and
+    'orderfile') are stored on disk so they do not need to be
+    recreated if a build is stopped and then restarted.
+
+3)  By leaving packages installed and only deleting the ones which are
+    not needed before each build, we reduce the amount of installing
+    and deinstalling needed during the build.  For example, it is
+    quite common to build several packages in a row which all need GNU
+    make or perl.  
+
+4)  Using the 'supportsfile' to mark all packages which depend upon a
+    package which has just failed to build can greatly reduce the time
+    wasted on trying to build packages which known broken dependencies.
+
+====================================================================
+Parallel Build Thoughts
+====================================================================
+
+To exploit multiple machines in an attempt to reduce the build time,
+many of the same ideas used in the single machine build can still be
+used.  My view of how a parallel build should work is detailed here.
+
+master   == master machine.  This machine is in charge of directing
+            the build and may or may not actively participate in it.
+            In addition, this machine might not be of the same
+            architecture or operating system as the slaves (unless it
+            is to be used as a slave as well).
+
+slave#x  == slave machine #x.  All slave machines are of the same
+            MACHINE_ARCH and have the same operating system and access
+            the same pkgsrc tree via NFS and access the same binary
+            packages directory.  
+
+            If the master machine is also to be used as a build
+            machine, then it is also considered a slave. 
+
+Prior to starting the build, the master directs one of the slaves to
+extract the dependency information per steps 1-3 in the single machine
+case. 
+
+The actually build should progress as follows:
+
+1)  For each slave which needs a job, the master assigns a package to
+    build based on the rule that only packages that have had all their
+    dependencies built will be sent to slaves for compilation.
+
+2)  When a slave finishes, the master either notes  that the binary
+    package is now available for use as a depends _or_ notes failure
+    and marks all pacakges which depend upon it as broken as in step
+    4e of the single machine build.
+
+
+Each slave builds a package in the same way as it would in a single
+machine build (steps 4a-d).
+
+====================================================================
+Important Parallel Build Considerations
+====================================================================
+
+
+1)  Security.  Packages are installed as root prior to packaging.
+
+2)  All state kept by the master should be stored to disk to
+    facilitate restarting a build.  Remember this could take weeks so
+    we don't want to have to start over.
+
+3)  The master needs to be able to monitor all slaves for signs of
+    life.  Ie, if a slave machine is simply shut off, the master
+    should detect that its no longer there an re-assign that slaves
+    current job.
+
+3a) The master must be able to distinguish between a slave failing to
+    compile a package due to the package failing vs a
+    network/power/disk/etc. failure.  The former causes the package to
+    be marked as broken, the latter causes the slave to be marked as
+    broken.
+
+4)  Security.
+
+5)  Ability to add and remove slaves from the cluster during a build.
+    Again, a build may take a long time so we want to add/remove
+    slaves while the build is in progress.
+
+====================================================================
+Additional Thoughts
+====================================================================
+
+This is mostly related to using slaves which are not on a local
+network.
+
+-  maybe a hook could be put in place which rsync's the binary package
+   tree between the binary package repository machine and the slave
+   machine before and after each package is built? 
+
+-  security
+
+-  Support for kerberos?
+
+====================================================================
+Implementation Thoughts
+====================================================================
+
+-  Can this all be written around using ssh to send out tasks?  How do
+   we monitor slaves for signs of life?  How do we indicate 'build
+   failed/build succeeded/slave failed' conditions?
+
+-  Maybe we could have a file listing slaves and the master consults
+   this each time it needs a slave.  That would make adding/removing
+   slaves easy.  There would need to be another file to keep track of
+   which slaves are busy (and with what).
+
+-  Do we want to use something like pvm instead?  There is a
+   p5-Parallel-Pvm package and perl nicely deals with parsing some of
+   these files and sorting dependencies although I hate to add any
+   extra dependencies to the build system.
+
author	dmcmahill <dmcmahill>	2003-03-16 13:45:12 +0000
committer	dmcmahill <dmcmahill>	2003-03-16 13:45:12 +0000
commit	3428fc0fcd9e2138c37b2c92eb1570c6fe72eb4d (patch)
tree	24f941a112b56429754cf6d9d63cfa753fe268a3 /mk
parent	c39d3e1fd00eb57dd0723db4f57aed9b80fe9115 (diff)
download	pkgsrc-3428fc0fcd9e2138c37b2c92eb1570c6fe72eb4d.tar.gz