diff options
author | dmcmahill <dmcmahill@pkgsrc.org> | 2003-03-16 13:45:12 +0000 |
---|---|---|
committer | dmcmahill <dmcmahill@pkgsrc.org> | 2003-03-16 13:45:12 +0000 |
commit | 5fb18e18270fb0f0c6692d5e1302fd4c17865b37 (patch) | |
tree | 24f941a112b56429754cf6d9d63cfa753fe268a3 /mk/bulk/parallel.txt | |
parent | c76dd1e5ba08b2b038ba7968c3a57238e9e50c53 (diff) | |
download | pkgsrc-5fb18e18270fb0f0c6692d5e1302fd4c17865b37.tar.gz |
Add some notes with thoughts on what a parallel bulk build system should do.
Diffstat (limited to 'mk/bulk/parallel.txt')
-rw-r--r-- | mk/bulk/parallel.txt | 210 |
1 files changed, 210 insertions, 0 deletions
diff --git a/mk/bulk/parallel.txt b/mk/bulk/parallel.txt new file mode 100644 index 00000000000..9b4bf96f4e2 --- /dev/null +++ b/mk/bulk/parallel.txt @@ -0,0 +1,210 @@ +# $Id: parallel.txt,v 1.1 2003/03/16 13:45:12 dmcmahill Exp $ +# + +These are my thoughts on how one would want a parallel bulk build to +work. + + +==================================================================== +Single Machine Build Process +==================================================================== + +The current (as of 2003-03-16) bulk build system works in the +following manner: + +1) All installed packages are removed. + +2) Packages listed in the BULK_PREREQ variable are installed. This + must be done before step 2 as some packages (like xpkgwedge) can + affect the dependencies of other packages when installed. + +3) Each package directory is visited and its explicitly listed + dependencies are extracted and put in a 'dependstree' file. The + mk/bulk/tflat script is used to generate flattened dependencies + for all packages from this dependstree file in both the up and + down directions. The result is a file 'dependsfile' which has one + line per package that lists all build dependencies. Additionally, + a 'supportsfile' is created which has one line for each package + and lists all packages which depend upon the listed pacakge. + Finally, tsort(1) is applied to the 'dependstree' file to + determine the correct build order for the bulk build. The build + order is stored in a 'buildorder' file. This is all achieved via + the 'bulk-cache' top level target. By extracting dependencies in + this fashion, we avoid highly redundant recursive make calls. For + example, we no longer need to use a recursive make to find the + dependencies for libtool literally thousands and thousands of + times throughout the build. + +4) During the build, the 'buildorder' file is consulted to figure out + which package should be built next. Then to build the package, + the following steps are taken: + + a) Check for the existance of a '.broken' file in the package + directory. If this file exists, then the package is already + broken for some reason so move on to the next package. + + b) Remove all packages which are not needed to build the current + package. This dependency list is obtained from the 'dependsfile' + created in step 3 and the BULK_PREREQ variable. + + c) Install via pkg_add all packages which are needed to build the + current package. We are able to do this because we have been + building our packages in a bottom up order so all dependencies + should have been build. + + d) Build and package the package. + + e) If the package build fails, then we copy over the build log to + a .broken file and in addition, we consult the 'supportsfile' and + mark all packages which depend upon this one as broken by adding a + line to their .broken files (creating them if needed). By going + ahead and marking these packages as broken, we avoid wasting time + on them later. + + f) Append the package directory name to the top level pkgsrc + '.make' file to indicate that we have processed this package. + +5) Run the mk/bulk/post-build script to collect the summary and + generate html pages and the email we've all seen. + +==================================================================== +Single Machine Build Comments +==================================================================== + +There are several features of this approach that are worth mentioning +explicitly. + +1) Packages are built in the correct order. We don't want to rebuild + the gnome meta-pkg and then rebuild gnome-libs for example. + +2) Restarting the build is a cheap operation. Remember that this + build can take weeks or more. In fact the 1.6 build took nearly 6 + weeks on a sparc 20! If for some reason, the build needs to be + interrupted, it can be easily restarted because in step 4f we keep + track of what has been built in a file. The lines in the build + script which control this are: + + for pkgdir in `cat $ORDERFILE` ; do + if ! grep -q "^${pkgdir}\$" $BUILDLOG ; then + (cd $pkgdir && \ + nice -n 20 ${BMAKE} USE_BULK_CACHE=yes bulk-package) + fi + done + + In addition to storing the progress to disk, the bulk cache files + (the 'dependstreefile', 'dependsfile', 'supportsfile', and + 'orderfile') are stored on disk so they do not need to be + recreated if a build is stopped and then restarted. + +3) By leaving packages installed and only deleting the ones which are + not needed before each build, we reduce the amount of installing + and deinstalling needed during the build. For example, it is + quite common to build several packages in a row which all need GNU + make or perl. + +4) Using the 'supportsfile' to mark all packages which depend upon a + package which has just failed to build can greatly reduce the time + wasted on trying to build packages which known broken dependencies. + +==================================================================== +Parallel Build Thoughts +==================================================================== + +To exploit multiple machines in an attempt to reduce the build time, +many of the same ideas used in the single machine build can still be +used. My view of how a parallel build should work is detailed here. + +master == master machine. This machine is in charge of directing + the build and may or may not actively participate in it. + In addition, this machine might not be of the same + architecture or operating system as the slaves (unless it + is to be used as a slave as well). + +slave#x == slave machine #x. All slave machines are of the same + MACHINE_ARCH and have the same operating system and access + the same pkgsrc tree via NFS and access the same binary + packages directory. + + If the master machine is also to be used as a build + machine, then it is also considered a slave. + +Prior to starting the build, the master directs one of the slaves to +extract the dependency information per steps 1-3 in the single machine +case. + +The actually build should progress as follows: + +1) For each slave which needs a job, the master assigns a package to + build based on the rule that only packages that have had all their + dependencies built will be sent to slaves for compilation. + +2) When a slave finishes, the master either notes that the binary + package is now available for use as a depends _or_ notes failure + and marks all pacakges which depend upon it as broken as in step + 4e of the single machine build. + + +Each slave builds a package in the same way as it would in a single +machine build (steps 4a-d). + +==================================================================== +Important Parallel Build Considerations +==================================================================== + + +1) Security. Packages are installed as root prior to packaging. + +2) All state kept by the master should be stored to disk to + facilitate restarting a build. Remember this could take weeks so + we don't want to have to start over. + +3) The master needs to be able to monitor all slaves for signs of + life. Ie, if a slave machine is simply shut off, the master + should detect that its no longer there an re-assign that slaves + current job. + +3a) The master must be able to distinguish between a slave failing to + compile a package due to the package failing vs a + network/power/disk/etc. failure. The former causes the package to + be marked as broken, the latter causes the slave to be marked as + broken. + +4) Security. + +5) Ability to add and remove slaves from the cluster during a build. + Again, a build may take a long time so we want to add/remove + slaves while the build is in progress. + +==================================================================== +Additional Thoughts +==================================================================== + +This is mostly related to using slaves which are not on a local +network. + +- maybe a hook could be put in place which rsync's the binary package + tree between the binary package repository machine and the slave + machine before and after each package is built? + +- security + +- Support for kerberos? + +==================================================================== +Implementation Thoughts +==================================================================== + +- Can this all be written around using ssh to send out tasks? How do + we monitor slaves for signs of life? How do we indicate 'build + failed/build succeeded/slave failed' conditions? + +- Maybe we could have a file listing slaves and the master consults + this each time it needs a slave. That would make adding/removing + slaves easy. There would need to be another file to keep track of + which slaves are busy (and with what). + +- Do we want to use something like pvm instead? There is a + p5-Parallel-Pvm package and perl nicely deals with parsing some of + these files and sorting dependencies although I hate to add any + extra dependencies to the build system. + |