# $Id: parallel.txt,v 1.2 2003/04/16 09:55:53 wiz Exp $
#

These are my thoughts on how one would want a parallel bulk build to
work. 


====================================================================
Single Machine Build Process
====================================================================

The current (as of 2003-03-16) bulk build system works in the
following manner:

1)  All installed packages are removed.

2)  Packages listed in the BULK_PREREQ variable are installed.  This
    must be done before step 2 as some packages (like xpkgwedge) can
    affect the dependencies of other packages when installed.

3)  Each package directory is visited and its explicitly listed
    dependencies are extracted and put in a 'dependstree' file.  The
    mk/bulk/tflat script is used to generate flattened dependencies
    for all packages from this dependstree file in both the up and
    down directions.  The result is a file 'dependsfile' which has one
    line per package that lists all build dependencies.  Additionally,
    a 'supportsfile' is created which has one line for each package
    and lists all packages which depend upon the listed pacakge.
    Finally, tsort(1) is applied to the 'dependstree' file to
    determine the correct build order for the bulk build.  The build
    order is stored in a 'buildorder' file.  This is all achieved via
    the 'bulk-cache' top level target.  By extracting dependencies in
    this fashion, we avoid highly redundant recursive make calls.  For
    example, we no longer need to use a recursive make to find the
    dependencies for libtool literally thousands and thousands of
    times throughout the build.

4)  During the build, the 'buildorder' file is consulted to figure out
    which package should be built next.  Then to build the package,
    the following steps are taken:

    a)  Check for the existance of a '.broken' file in the package
    directory.  If this file exists, then the package is already
    broken for some reason so move on to the next package.

    b)  Remove all packages which are not needed to build the current
    package.  This dependency list is obtained from the 'dependsfile'
    created in step 3 and the BULK_PREREQ variable.

    c)  Install via pkg_add all packages which are needed to build the
    current package.  We are able to do this because we have been
    building our packages in a bottom up order so all dependencies
    should have been build.

    d)  Build and package the package.

    e)  If the package build fails, then we copy over the build log to
    a .broken file and in addition, we consult the 'supportsfile' and
    mark all packages which depend upon this one as broken by adding a
    line to their .broken files (creating them if needed).  By going
    ahead and marking these packages as broken, we avoid wasting time
    on them later.

    f)  Append the package directory name to the top level pkgsrc
    '.make' file to indicate that we have processed this package.

5)  Run the mk/bulk/post-build script to collect the summary and
    generate html pages and the email we've all seen.

====================================================================
Single Machine Build Comments
====================================================================

There are several features of this approach that are worth mentioning
explicitly.  

1)  Packages are built in the correct order.  We don't want to rebuild
    the gnome meta-pkg and then rebuild gnome-libs for example.

2)  Restarting the build is a cheap operation.  Remember that this
    build can take weeks or more.  In fact the 1.6 build took nearly 6
    weeks on a sparc 20!  If for some reason, the build needs to be
    interrupted, it can be easily restarted because in step 4f we keep
    track of what has been built in a file.  The lines in the build
    script which control this are:

      for pkgdir in `cat $ORDERFILE` ; do
        if ! grep -q "^${pkgdir}\$" $BUILDLOG ; then
          (cd $pkgdir && \
             nice -n 20 ${BMAKE} USE_BULK_CACHE=yes bulk-package)
        fi
      done

    In addition to storing the progress to disk, the bulk cache files
    (the 'dependstreefile', 'dependsfile', 'supportsfile', and
    'orderfile') are stored on disk so they do not need to be
    recreated if a build is stopped and then restarted.

3)  By leaving packages installed and only deleting the ones which are
    not needed before each build, we reduce the amount of installing
    and deinstalling needed during the build.  For example, it is
    quite common to build several packages in a row which all need GNU
    make or perl.  

4)  Using the 'supportsfile' to mark all packages which depend upon a
    package which has just failed to build can greatly reduce the time
    wasted on trying to build packages which known broken dependencies.

====================================================================
Parallel Build Thoughts
====================================================================

To exploit multiple machines in an attempt to reduce the build time,
many of the same ideas used in the single machine build can still be
used.  My view of how a parallel build should work is detailed here.

master   == master machine.  This machine is in charge of directing
            the build and may or may not actively participate in it.
            In addition, this machine might not be of the same
            architecture or operating system as the slaves (unless it
            is to be used as a slave as well).

slave#x  == slave machine #x.  All slave machines are of the same
            MACHINE_ARCH and have the same operating system and access
            the same pkgsrc tree via NFS and access the same binary
            packages directory.  

            If the master machine is also to be used as a build
            machine, then it is also considered a slave. 

Prior to starting the build, the master directs one of the slaves to
extract the dependency information per steps 1-3 in the single machine
case. 

The actually build should progress as follows:

1)  For each slave which needs a job, the master assigns a package to
    build based on the rule that only packages that have had all their
    dependencies built will be sent to slaves for compilation.

2)  When a slave finishes, the master either notes that the binary
    package is now available for use as a depends _or_ notes failure
    and marks all packages which depend upon it as broken as in step
    4e of the single machine build.


Each slave builds a package in the same way as it would in a single
machine build (steps 4a-d).

====================================================================
Important Parallel Build Considerations
====================================================================


1)  Security.  Packages are installed as root prior to packaging.

2)  All state kept by the master should be stored to disk to
    facilitate restarting a build.  Remember this could take weeks so
    we don't want to have to start over.

3)  The master needs to be able to monitor all slaves for signs of
    life.  I.e., if a slave machine is simply shut off, the master
    should detect that it's no longer there and re-assign that slaves
    current job.

3a) The master must be able to distinguish between a slave failing to
    compile a package due to the package failing vs a
    network/power/disk/etc. failure.  The former causes the package to
    be marked as broken, the latter causes the slave to be marked as
    broken.

4)  Ability to add and remove slaves from the cluster during a build.
    Again, a build may take a long time so we want to add/remove
    slaves while the build is in progress.

====================================================================
Additional Thoughts
====================================================================

This is mostly related to using slaves which are not on a local
network.

-  maybe a hook could be put in place which rsync's the binary package
   tree between the binary package repository machine and the slave
   machine before and after each package is built? 

-  security

-  support for Kerberos?

====================================================================
Implementation Thoughts
====================================================================

-  Can this all be written around using ssh to send out tasks?  How do
   we monitor slaves for signs of life?  How do we indicate 'build
   failed/build succeeded/slave failed' conditions?

-  Maybe we could have a file listing slaves and the master consults
   this each time it needs a slave.  That would make adding/removing
   slaves easy.  There would need to be another file to keep track of
   which slaves are busy (and with what).

-  Do we want to use something like pvm instead?  There is a
   p5-Parallel-Pvm package and perl nicely deals with parsing some of
   these files and sorting dependencies although I hate to add any
   extra dependencies to the build system.