Table of Content: - General overview
- The definition
- Using catalogs
- Some examples
- How to tune catalog usage
- How to debug catalog processing
- How to create and maintain catalogs
- The implementor corner quick review
oftheAPI
- Other resources
What is a catalog? Basically it's a lookup mechanism used when an
entity(afile or a remote resource) references another entity. The catalog
lookupisinserted between the moment the reference is recognized by the
software(XMLparser, stylesheet processing, or even images referenced for
inclusionin arendering) and the time where loading that resource is
actuallystarted. It is basically used for 3 things: - mapping from "logical" names, the public identifiers and a
moreconcretename usable for download (and URI). For example it can
associatethelogical name
"-//OASIS//DTD DocBook XML V4.1.2//EN"
of the DocBook 4.1.2 XML DTD with the actual URL where it
canbedownloaded
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
- remapping from a given URL to another one, like an
HTTPindirectionsaying that
"http://www.oasis-open.org/committes/tr.xsl"
should really be looked at
"http://www.oasis-open.org/committes/entity/stylesheets/base/tr.xsl"
- providing a local cache mechanism allowing to load
theentitiesassociated to public identifiers or remote resources, this is
areallyimportant feature for any significant deployment of XML or
SGMLsince itallows to avoid the aleas and delays associated to
fetchingremoteresources.
Libxml, as of 2.4.3 implements 2 kind of catalogs: - the older SGML catalogs, the official spec is SGML
OpenTechnicalResolution TR9401:1997, but is better understood by reading
the SP
CatalogpagefromJames Clark. This is relatively old and not the
preferredmode ofoperation of libxml.
- XMLCatalogsisfar
more flexible, more recent, uses an XML syntax andshould scale
quitebetter. This is the default option of libxml.
In a normal environment libxml2 will by default check the presence
ofacatalog in /etc/xml/catalog, and assuming it has been
correctlypopulated,the processing is completely transparent to the document
user. Totake aconcrete example, suppose you are authoring a DocBook document,
thisonestarts with the following DOCTYPE definition: <?xml version='1.0'?>
<!DOCTYPE book PUBLIC "-//Norman Walsh//DTD DocBk XML V3.1.4//EN"
"http://nwalsh.com/docbook/xml/3.1.4/db3xml.dtd"> When validating the document with libxml, the catalog will
beautomaticallyconsulted to lookup the public identifier "-//Norman
Walsh//DTDDocBk XMLV3.1.4//EN" and the
systemidentifier"http://nwalsh.com/docbook/xml/3.1.4/db3xml.dtd", and if
theseentities havebeen installed on your system and the catalogs actually
point tothem, libxmlwill fetch them from the local disk. Note: Really don't usethisDOCTYPE
example it's a really old version, but is fine as an example. Libxml2 will check the catalog each time that it is requested to
loadanentity, this includes DTD, external parsed entities, stylesheets, etc
...Ifyour system is correctly configured all the authoring phase
andprocessingshould use only local files, even if your document stays
portablebecause ituses the canonical public and system ID, referencing the
remotedocument. Here is a couple of fragments from XML Catalogs used in
libxml2earlyregression tests in test/catalogs : <?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC
"-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"
"http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//OASIS//DTD DocBook XML V4.1.2//EN"
uri="http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"/>
... This is the beginning of a catalog for DocBook 4.1.2, XML
Catalogsarewritten in XML, there is a specific namespace for
catalogelements"urn:oasis:names:tc:entity:xmlns:xml:catalog". The first entry
inthiscatalog is a public mapping it allows to associate
aPublicIdentifier with an URI. ...
<rewriteSystem systemIdStartString="http://www.oasis-open.org/docbook/"
rewritePrefix="file:///usr/share/xml/docbook/"/>
... A rewriteSystem is a very powerful instruction, it saysthatany
URI starting with a given prefix should be looked at anotherURIconstructed by
replacing the prefix with an new one. In effect this actslikea cache system
for a full area of the Web. In practice it is extremelyusefulwith a file
prefix if you have installed a copy of those resources onyourlocal system. ...
<delegatePublic publicIdStartString="-//OASIS//DTD XML Catalog //"
catalog="file:///usr/share/xml/docbook.xml"/>
<delegatePublic publicIdStartString="-//OASIS//ENTITIES DocBook XML"
catalog="file:///usr/share/xml/docbook.xml"/>
<delegatePublic publicIdStartString="-//OASIS//DTD DocBook XML"
catalog="file:///usr/share/xml/docbook.xml"/>
<delegateSystem systemIdStartString="http://www.oasis-open.org/docbook/"
catalog="file:///usr/share/xml/docbook.xml"/>
<delegateURI uriStartString="http://www.oasis-open.org/docbook/"
catalog="file:///usr/share/xml/docbook.xml"/>
... Delegation is the core features which allows to build a tree
ofcatalogs,easier to maintain than a single catalog, based on
PublicIdentifier, SystemIdentifier or URI prefixes it instructs the
catalogsoftware to look upentries in another resource. This feature allow to
buildhierarchies ofcatalogs, the set of entries presented should be
sufficient toredirect theresolution of all DocBook references to the specific
catalogin/usr/share/xml/docbook.xml this one in turn could
delegateallreferences for DocBook 4.2.1 to a specific catalog installed at
the sametimeas the DocBook resources on the local machine. The user can change the default catalog behaviour by redirecting
queriestoits own set of catalogs, this can be done by
settingtheXML_CATALOG_FILES environment variable to a list of
catalogs,anempty one should deactivate loading the
default/etc/xml/catalog default catalog Setting up the XML_DEBUG_CATALOG environment variable
willmakelibxml2 output debugging informations for each catalog
operations,forexample: orchis:~/XML -> xmllint --memory --noout test/ent2
warning: failed to load external entity "title.xml"
orchis:~/XML -> export XML_DEBUG_CATALOG=
orchis:~/XML -> xmllint --memory --noout test/ent2
Failed to parse catalog /etc/xml/catalog
Failed to parse catalog /etc/xml/catalog
warning: failed to load external entity "title.xml"
Catalogs cleanup
orchis:~/XML -> The test/ent2 references an entity, running the parser from memorymakesthe
base URI unavailable and the the "title.xml" entity cannot beloaded.Setting
up the debug environment variable allows to detect that anattempt ismade to
load the /etc/xml/catalog but since it's notpresent theresolution
fails. But the most advanced way to debug XML catalog processing is to
usethexmlcatalogcommand shipped with libxml2, it allows
toloadcatalogs and make resolution queries to see what is going on. This
isalsoused for the regression tests: orchis:~/XML -> ./xmlcatalog test/catalogs/docbook.xml \
"-//OASIS//DTD DocBook XML V4.1.2//EN"
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
orchis:~/XML -> For debugging what is going on, adding one -v flags increase
theverbositylevel to indicate the processing done (adding a second flag
alsoindicatewhat elements are recognized at parsing): orchis:~/XML -> ./xmlcatalog -v test/catalogs/docbook.xml \
"-//OASIS//DTD DocBook XML V4.1.2//EN"
Parsing catalog test/catalogs/docbook.xml's content
Found public match -//OASIS//DTD DocBook XML V4.1.2//EN
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
Catalogs cleanup
orchis:~/XML -> A shell interface is also available to debug and process
multiplequeries(and for regression tests): orchis:~/XML -> ./xmlcatalog -shell test/catalogs/docbook.xml \
"-//OASIS//DTD DocBook XML V4.1.2//EN"
> help
Commands available:
public PublicID: make a PUBLIC identifier lookup
system SystemID: make a SYSTEM identifier lookup
resolve PublicID SystemID: do a full resolver lookup
add 'type' 'orig' 'replace' : add an entry
del 'values' : remove values
dump: print the current catalog state
debug: increase the verbosity level
quiet: decrease the verbosity level
exit: quit the shell
> public "-//OASIS//DTD DocBook XML V4.1.2//EN"
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
> quit
orchis:~/XML -> This should be sufficient for most debugging purpose, this wasactuallyused
heavily to debug the XML Catalog implementation itself. Basically XML Catalogs are XML files, you can either use XML toolstomanage
them or use xmlcatalogfor this. The basic stepisto create a
catalog the -create option provide this facility: orchis:~/XML -> ./xmlcatalog --create tst.xml
<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"
"http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"/>
orchis:~/XML -> By default xmlcatalog does not overwrite the original catalog and
savetheresult on the standard output, this can be overridden using
the-nooutoption. The -add command allows to add entries
inthecatalog: orchis:~/XML -> ./xmlcatalog --noout --create --add "public" \
"-//OASIS//DTD DocBook XML V4.1.2//EN" \
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd tst.xml
orchis:~/XML -> cat tst.xml
<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" \
"http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//OASIS//DTD DocBook XML V4.1.2//EN"
uri="http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"/>
</catalog>
orchis:~/XML -> The -add option will always take 3 parameters even if
someofthe XML Catalog constructs (like nextCatalog) will have only
asingleargument, just pass a third empty string, it will be ignored. Similarly the -del option remove matching entries
fromthecatalog: orchis:~/XML -> ./xmlcatalog --del \
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" tst.xml
<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"
"http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"/>
orchis:~/XML -> The catalog is now empty. Note that the matching
of-del isexact and would have worked in a similar fashion with
thePublic IDstring. This is rudimentary but should be sufficient to manage a not
toocomplexcatalog tree of resources. First, and like for every other module of libxml, there is
anautomaticallygenerated API page
forcatalogsupport. The header for the catalog interfaces should be included as: #include <libxml/catalog.h> The API is voluntarily kept very simple. First it is not
obviousthatapplications really need access to it since it is the default
behaviouroflibxml2 (Note: it is possible to completely override libxml2
defaultcatalogby using xmlSetExternalEntityLoadertoplug
anapplication specific resolver). Basically libxml2 support 2 catalog lists: - the default one, global shared by all the application
- a per-document catalog, this one is built if the document
usesthe
oasis-xml-catalog PIs to specify its own catalog list,
itisassociated to the parser context and destroyed when the
parsingcontextis destroyed.
the document one will be used first if it exists. Initialization routines:xmlInitializeCatalog(), xmlLoadCatalog() and xmlLoadCatalogs()
shouldbeused at startup to initialize the catalog, if the catalog
shouldbeinitialized with specific values xmlLoadCatalog()
orxmlLoadCatalogs()should be called before xmlInitializeCatalog() which
wouldotherwise do adefault initialization first. The xmlCatalogAddLocal() call is used by the parser to grow thedocumentown
catalog list if needed. Preferences setup:The XML Catalog spec requires the possibility to select
defaultpreferencesbetween public and system
delegation,xmlCatalogSetDefaultPrefer() allowsthis, xmlCatalogSetDefaults()
andxmlCatalogGetDefaults() allow to control ifXML Catalogs resolution
shouldbe forbidden, allowed for global catalog, fordocument catalog or both,
thedefault is to allow both. And of course xmlCatalogSetDebug() allows to generate
debugmessages(through the xmlGenericError() mechanism). Querying routines:xmlCatalogResolve(),
xmlCatalogResolveSystem(),xmlCatalogResolvePublic()and xmlCatalogResolveURI()
are relatively explicitif you read the XMLCatalog specification they
correspond to section 7algorithms, they shouldalso work if you have loaded an
SGML catalog with asimplified semantic. xmlCatalogLocalResolve() and xmlCatalogLocalResolveURI() are the
samebutoperate on the document catalog list Cleanup and Miscellaneous:xmlCatalogCleanup() free-up the global catalog, xmlCatalogFreeLocal()isthe
per-document equivalent. xmlCatalogAdd() and xmlCatalogRemove() are used to dynamically
modifythefirst catalog in the global list, and xmlCatalogDump() allows to
dumpacatalog state, those routines are primarily designed for xmlcatalog,
I'mnotsure that exposing more complex interfaces (like navigation ones)
wouldbereally useful. The xmlParseCatalogFile() is a function used to load XML Catalogfiles,it's
similar as xmlParseFile() except it bypass all catalog lookups,it'sprovided
because this functionality may be useful for client tools. threaded environments:Since the catalog tree is built progressively, some care has been
takentotry to avoid troubles in multithreaded environments. The code is
nowthreadsafe assuming that the libxml2 library has been compiled
withthreadssupport. The XML Catalog specification is relatively recent so there
isn'tmuchliterature to point at: If you have suggestions for corrections or additions, simply contactme: Daniel Veillard |