From a7e9d3f37d5e9fba4b9acaa43e7c12b6d9a669ae Mon Sep 17 00:00:00 2001 From: Mike Hommey Date: Thu, 8 Jun 2006 10:59:26 +0200 Subject: Load /tmp/libxml2-2.6.26 into libxml2/branches/upstream/current. --- doc/encoding.html | 312 +++++++++++++++++++++++++++--------------------------- 1 file changed, 156 insertions(+), 156 deletions(-) (limited to 'doc/encoding.html') diff --git a/doc/encoding.html b/doc/encoding.html index 1f4558d..8db787e 100644 --- a/doc/encoding.html +++ b/doc/encoding.html @@ -7,44 +7,44 @@ H1 {font-family: Verdana,Arial,Helvetica} H2 {font-family: Verdana,Arial,Helvetica} H3 {font-family: Verdana,Arial,Helvetica} A:link, A:visited, A:active { text-decoration: underline } -Encodings support

The XML C parser and toolkit of Gnome

Encodings support

Main Menu

Related links

If you are not really familiar with Internationalization (usual shortcut -is I18N) , Unicode, characters and glyphs, I suggest you read a presentation -by Tim Bray on Unicode and why you should care about it.

If you don't understand why it does not make sense to have a string -without knowing what encoding it uses, then as Joel Spolsky said please do not -write another line of code until you finish reading that article.. It is -a prerequisite to understand this page, and avoid a lot of problems with -libxml2, XML or text processing in general.

Table of Content:

What does internationalization support - mean ?
The internal encoding, how and - why

Encodings support

The XML C parser and toolkit of Gnome

Encodings support

Main Menu

Related links

If you are not really familiar with Internationalization (usual +shortcutisI18N) , Unicode, characters and glyphs, I suggest you read a presentationbyTim +Bray on Unicode and why you should care about it.

If you don't understand why it does not make sense to have +astringwithout knowing what encoding it uses, then as Joel Spolsky said +please do +notwriteanother line of code until you finish reading that article.. It +isaprerequisite to understand this page, and avoid a lot of +problemswithlibxml2, XML or text processing in general.

Table of Content:

What does internationalization support mean ?

XML was designed from the start to allow the support of any character set -by using Unicode. Any conformant XML parser has to support the UTF-8 and -UTF-16 default encodings which can both express the full unicode ranges. UTF8 -is a variable length encoding whose greatest points are to reuse the same -encoding for ASCII and to save space for Western encodings, but it is a bit -more complex to handle in practice. UTF-16 use 2 bytes per character (and -sometimes combines two pairs), it makes implementation easier, but looks a -bit overkill for Western languages encoding. Moreover the XML specification -allows the document to be encoded in other encodings at the condition that -they are clearly labeled as such. For example the following is a wellformed -XML document encoded in ISO-8859-1 and using accentuated letters that we -French like for both markup and content:

<?xml version="1.0" encoding="ISO-8859-1"?>
+  How to extend theexistingsupport
+What does internationalization support mean ?
XML was designed from the start to allow the support of any charactersetby
+using Unicode. Any conformant XML parser has to support the UTF-8andUTF-16
+default encodings which can both express the full unicode ranges.UTF8is a
+variable length encoding whose greatest points are to reuse thesameencoding
+for ASCII and to save space for Western encodings, but it is abitmore complex
+to handle in practice. UTF-16 use 2 bytes per character(andsometimes combines
+two pairs), it makes implementation easier, but looksabit overkill for
+Western languages encoding. Moreover the XMLspecificationallows the document
+to be encoded in other encodings at thecondition thatthey are clearly labeled
+as such. For example the following isa wellformedXML document encoded in
+ISO-8859-1 and using accentuated lettersthat weFrench like for both markup
+and content:
<?xml version="1.0" encoding="ISO-8859-1"?>
 <très>là</très>
Having internationalization support in libxml2 means the following:
the document is properly parsed
   informations about it's encoding are saved
   it can be modified
   it can be saved in its original encoding
-  it can also be saved in another encoding supported by libxml2 (for
-    example straight UTF8 or even an ASCII form)
-
Another very important point is that the whole libxml2 API, with the
-exception of a few routines to read with a specific encoding or save to a
-specific encoding, is completely agnostic about the original encoding of the
-document.
It should be noted too that the HTML parser embedded in libxml2 now obey
-the same rules too, the following document will be (as of 2.2.2) handled  in
-an internationalized fashion by libxml2 too:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
+  it can also be saved in another encoding supported by
+    libxml2(forexample straight UTF8 or even an ASCII form)
+Another very important point is that the whole libxml2 API,
+withtheexception of a few routines to read with a specific encoding or save
+toaspecific encoding, is completely agnostic about the original encoding
+ofthedocument.
It should be noted too that the HTML parser embedded in libxml2 nowobeythe
+same rules too, the following document will be (as of 2.2.2) handledinan
+internationalized fashion by libxml2 too:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
                       "http://www.w3.org/TR/REC-html40/loose.dtd">
 <html lang="fr">
 <head>
@@ -52,59 +52,60 @@ an internationalized fashion by libxml2 too:
<!DOCTYPE HTML PUBLIC "-
 </head>
 <body>
 <p>W3C crée des standards pour le Web.</body>
-</html>
The internal encoding, how and why
One of the core decisions was to force all documents to be converted to a
-default internal encoding, and that encoding to be UTF-8, here are the
-rationales for those choices:
keeping the native encoding in the internal form would force the libxml
-    users (or the code associated) to be fully aware of the encoding of the
-    original document, for examples when adding a text node to a document,
-    the content would have to be provided in the document encoding, i.e. the
-    client code would have to check it before hand, make sure it's conformant
-    to the encoding, etc ... Very hard in practice, though in some specific
-    cases this may make sense.
-  the second decision was which encoding. From the XML spec only UTF8 and
-    UTF16 really makes sense as being the two only encodings for which there
-    is mandatory support. UCS-4 (32 bits fixed size encoding) could be
-    considered an intelligent choice too since it's a direct Unicode mapping
-    support. I selected UTF-8 on the basis of efficiency and compatibility
-    with surrounding software:
-    UTF-8 while a bit more complex to convert from/to (i.e. slightly
-        more costly to import and export CPU wise) is also far more compact
-        than UTF-16 (and UCS-4) for a majority of the documents I see it used
-        for right now (RPM RDF catalogs, advogato data, various configuration
-        file formats, etc.) and the key point for today's computer
-        architecture is efficient uses of caches. If one nearly double the
-        memory requirement to store the same amount of data, this will trash
-        caches (main memory/external caches/internal caches) and my take is
-        that this harms the system far more than the CPU requirements needed
-        for the conversion to UTF-8
-      Most of libxml2 version 1 users were using it with straight ASCII
-        most of the time, doing the conversion with an internal encoding
-        requiring all their code to be rewritten was a serious show-stopper
-        for using UTF-16 or UCS-4.
-      UTF-8 is being used as the de-facto internal encoding standard for
-        related code like the pango
-        upcoming Gnome text widget, and a lot of Unix code (yet another place
-        where Unix programmer base takes a different approach from Microsoft
-        - they are using UTF-16)
+</html>
The internal encoding, how and why
One of the core decisions was to force all documents to be converted
+toadefault internal encoding, and that encoding to be UTF-8, here
+aretherationales for those choices:
keeping the native encoding in the internal form would force
+    thelibxmlusers (or the code associated) to be fully aware of the encoding
+    oftheoriginal document, for examples when adding a text node to
+    adocument,the content would have to be provided in the document
+    encoding,i.e. theclient code would have to check it before hand, make
+    sure it'sconformantto the encoding, etc ... Very hard in practice, though
+    in somespecificcases this may make sense.
+  the second decision was which encoding. From the XML spec only
+    UTF8andUTF16 really makes sense as being the two only encodings for
+    whichthereis mandatory support. UCS-4 (32 bits fixed size encoding)
+    couldbeconsidered an intelligent choice too since it's a direct
+    Unicodemappingsupport. I selected UTF-8 on the basis of efficiency
+    andcompatibilitywith surrounding software:
+    UTF-8 while a bit more complex to convert from/to (i.e.slightlymore
+        costly to import and export CPU wise) is also far morecompactthan
+        UTF-16 (and UCS-4) for a majority of the documents I seeit usedfor
+        right now (RPM RDF catalogs, advogato data, variousconfigurationfile
+        formats, etc.) and the key point for today'scomputerarchitecture is
+        efficient uses of caches. If one nearlydouble thememory requirement
+        to store the same amount of data, thiswill trashcaches (main
+        memory/external caches/internal caches) and mytake isthat this harms
+        the system far more than the CPU requirementsneededfor the conversion
+        to UTF-8
+      Most of libxml2 version 1 users were using it with
+        straightASCIImost of the time, doing the conversion with an
+        internalencodingrequiring all their code to be rewritten was a
+        seriousshow-stopperfor using UTF-16 or UCS-4.
+      UTF-8 is being used as the de-facto internal encoding
+        standardforrelated code like the pangoupcoming Gnome text widget,
+        anda lot of Unix code (yet another placewhere Unix programmer base
+        takesa different approach from Microsoft- they are using UTF-16)
     
-
What does this mean in practice for the libxml2 user:
xmlChar, the libxml2 data type is a byte, those bytes must be assembled
-    as UTF-8 valid strings. The proper way to terminate an xmlChar * string
-    is simply to append 0 byte, as usual.
-  One just need to make sure that when using chars outside the ASCII set,
-    the values has been properly converted to UTF-8
-
How is it implemented ?
Let's describe how all this works within libxml, basically the I18N
-(internationalization) support get triggered only during I/O operation, i.e.
-when reading a document or saving one. Let's look first at the reading
-sequence:
when a document is processed, we usually don't know the encoding, a
-    simple heuristic allows to detect UTF-16 and UCS-4 from encodings where
-    the ASCII range (0-0x7F) maps with ASCII
-  the xml declaration if available is parsed, including the encoding
-    declaration. At that point, if the autodetected encoding is different
-    from the one declared a call to xmlSwitchEncoding() is issued.
-  If there is no encoding declaration, then the input has to be in either
-    UTF-8 or UTF-16, if it is not then at some point when processing the
-    input, the converter/checker of UTF-8 form will raise an encoding error.
-    You may end-up with a garbled document, or no document at all ! Example:
+What does this mean in practice for the libxml2 user:
xmlChar, the libxml2 data type is a byte, those bytes must
+    beassembledas UTF-8 valid strings. The proper way to terminate an xmlChar
+    *stringis simply to append 0 byte, as usual.
+  One just need to make sure that when using chars outside the
+    ASCIIset,the values has been properly converted to UTF-8
+
How is it implemented ?
Let's describe how all this works within libxml, basically
+theI18N(internationalization) support get triggered only during I/O
+operation,i.e.when reading a document or saving one. Let's look first at
+thereadingsequence:
when a document is processed, we usually don't know the
+    encoding,asimple heuristic allows to detect UTF-16 and UCS-4 from
+    encodingswherethe ASCII range (0-0x7F) maps with ASCII
+  the xml declaration if available is parsed, including
+    theencodingdeclaration. At that point, if the autodetected encoding
+    isdifferentfrom the one declared a call to xmlSwitchEncoding()
+  isissued.
+  If there is no encoding declaration, then the input has to be
+    ineitherUTF-8 or UTF-16, if it is not then at some point when
+    processingtheinput, the converter/checker of UTF-8 form will raise an
+    encodingerror.You may end-up with a garbled document, or no document at
+    all !Example:
     ~/XML -> ./xmllint err.xml 
 err.xml:1: error: Input is not proper UTF-8, indicate encoding !
 <très>là</très>
@@ -113,94 +114,93 @@ err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
 <très>là</très>
    ^
   
-  xmlSwitchEncoding() does an encoding name lookup, canonicalize it, and
-    then search the default registered encoding converters for that encoding.
-    If it's not within the default set and iconv() support has been compiled
-    it, it will ask iconv for such an encoder. If this fails then the parser
-    will report an error and stops processing:
+  
xmlSwitchEncoding() does an encoding name lookup, canonicalize
+    it,andthen search the default registered encoding converters for
+    thatencoding.If it's not within the default set and iconv() support has
+    beencompiledit, it will ask iconv for such an encoder. If this fails then
+    theparserwill report an error and stops processing:
     ~/XML -> ./xmllint err2.xml 
 err2.xml:1: error: Unsupported encoding UnsupportedEnc
 <?xml version="1.0" encoding="UnsupportedEnc"?>
                                              ^
   
-  From that point the encoder processes progressively the input (it is
-    plugged as a front-end to the I/O module) for that entity. It captures
-    and converts on-the-fly the document to be parsed to UTF-8. The parser
-    itself just does UTF-8 checking of this input and process it
-    transparently. The only difference is that the encoding information has
-    been added to the parsing context (more precisely to the input
-    corresponding to this entity).
-  The result (when using DOM) is an internal form completely in UTF-8
-    with just an encoding information on the document node.
-
Ok then what happens when saving the document (assuming you
-collected/built an xmlDoc DOM like structure) ? It depends on the function
-called, xmlSaveFile() will just try to save in the original encoding, while
-xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given
-encoding:
if no encoding is given, libxml2 will look for an encoding value
-    associated to the document and if it exists will try to save to that
-    encoding,
+  
From that point the encoder processes progressively the input
+    (itisplugged as a front-end to the I/O module) for that entity.
+    Itcapturesand converts on-the-fly the document to be parsed to UTF-8.
+    Theparseritself just does UTF-8 checking of this input and
+    processittransparently. The only difference is that the encoding
+    informationhasbeen added to the parsing context (more precisely to
+    theinputcorresponding to this entity).
+  The result (when using DOM) is an internal form completely in
+    UTF-8withjust an encoding information on the document node.
+
Ok then what happens when saving the document (assuming
+youcollected/builtan xmlDoc DOM like structure) ? It depends on the
+functioncalled,xmlSaveFile() will just try to save in the original
+encoding,whilexmlSaveFileTo() and xmlSaveFileEnc() can optionally save to
+agivenencoding:
if no encoding is given, libxml2 will look for an
+    encodingvalueassociated to the document and if it exists will try to save
+    tothatencoding,
     otherwise everything is written in the internal form, i.e. UTF-8
   
-  so if an encoding was specified, either at the API level or on the
-    document, libxml2 will again canonicalize the encoding name, lookup for a
-    converter in the registered set or through iconv. If not found the
-    function will return an error code
-  the converter is placed before the I/O buffer layer, as another kind of
-    buffer, then libxml2 will simply push the UTF-8 serialization to through
-    that buffer, which will then progressively be converted and pushed onto
-    the I/O layer.
-  It is possible that the converter code fails on some input, for example
-    trying to push an UTF-8 encoded Chinese character through the UTF-8 to
-    ISO-8859-1 converter won't work. Since the encoders are progressive they
-    will just report the error and the number of bytes converted, at that
-    point libxml2 will decode the offending character, remove it from the
-    buffer and replace it with the associated charRef encoding &#123; and
-    resume the conversion. This guarantees that any document will be saved
-    without losses (except for markup names where this is not legal, this is
-    a problem in the current version, in practice avoid using non-ascii
-    characters for tag or attribute names). A special "ascii" encoding name
-    is used to save documents to a pure ascii form can be used when
-    portability is really crucial
+  so if an encoding was specified, either at the API level or
+    onthedocument, libxml2 will again canonicalize the encoding name,
+    lookupfor aconverter in the registered set or through iconv. If not
+    foundthefunction will return an error code
+  the converter is placed before the I/O buffer layer, as another
+    kindofbuffer, then libxml2 will simply push the UTF-8 serialization
+    tothroughthat buffer, which will then progressively be converted and
+    pushedontothe I/O layer.
+  It is possible that the converter code fails on some input,
+    forexampletrying to push an UTF-8 encoded Chinese character through
+    theUTF-8 toISO-8859-1 converter won't work. Since the encoders
+    areprogressive theywill just report the error and the number of
+    bytesconverted, at thatpoint libxml2 will decode the offending
+    character,remove it from thebuffer and replace it with the associated
+    charRefencoding &#123; andresume the conversion. This guarantees that
+    anydocument will be savedwithout losses (except for markup names where
+    thisis not legal, this isa problem in the current version, in practice
+    avoidusing non-asciicharacters for tag or attribute names). A special
+    "ascii"encoding nameis used to save documents to a pure ascii form can be
+    usedwhenportability is really crucial
 
Here are a few examples based on the same test document:
~/XML -> ./xmllint isolat1 
 <?xml version="1.0" encoding="ISO-8859-1"?>
 <très>là</très>
 ~/XML -> ./xmllint --encode UTF-8 isolat1 
 <?xml version="1.0" encoding="UTF-8"?>
 <trÃ¨s>lÃ  </trÃ¨s>
-~/XML -> 
The same processing is applied (and reuse most of the code) for HTML I18N
-processing. Looking up and modifying the content encoding is a bit more
-difficult since it is located in a <meta> tag under the <head>,
-so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have
-been provided. The parser also attempts to switch encoding on the fly when
-detecting such a tag on input. Except for that the processing is the same
-(and again reuses the same code).
Default supported encodings
libxml2 has a set of default converters for the following encodings
-(located in encoding.c):
UTF-8 is supported by default (null handlers)
+~/XML -> 
The same processing is applied (and reuse most of the code) for
+HTMLI18Nprocessing. Looking up and modifying the content encoding is a
+bitmoredifficult since it is located in a <meta> tag under
+the<head>,so a couple of functions htmlGetMetaEncoding()
+andhtmlSetMetaEncoding() havebeen provided. The parser also attempts to
+switchencoding on the fly whendetecting such a tag on input. Except for that
+theprocessing is the same(and again reuses the same code).
Default supported encodings
libxml2 has a set of default converters for the followingencodings(located
+in encoding.c):
UTF-8 is supported by default (null handlers)
   UTF-16, both little and big endian
   ISO-Latin-1 (ISO-8859-1) covering most western languages
   ASCII, useful mostly for saving
-  HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML
-    predefined entities like &copy; for the Copyright sign.
-
More over when compiled on an Unix platform with iconv support the full
-set of encodings supported by iconv can be instantly be used by libxml. On a
-linux machine with glibc-2.1 the list of supported encodings and aliases fill
-3 full pages, and include UCS-4, the full set of ISO-Latin encodings, and the
-various Japanese ones.
To convert from the UTF-8 values returned from the API to another encoding
-then it is possible to use the function provided from the encoding module like UTF8Toisolat1, or use the
-POSIX iconv()
-API directly.
Encoding aliases
From 2.2.3, libxml2 has support to register encoding names aliases. The
-goal is to be able to parse document whose encoding is supported but where
-the name differs (for example from the default set of names accepted by
-iconv). The following functions allow to register and handle new aliases for
-existing encodings. Once registered libxml2 will automatically lookup the
-aliases when handling a document:
int xmlAddEncodingAlias(const char *name, const char *alias);
+  HTML, a specific handler for the conversion of UTF-8 to ASCII
+    withHTMLpredefined entities like &copy; for the Copyright sign.
+More over when compiled on an Unix platform with iconv support the
+fullsetof encodings supported by iconv can be instantly be used by libxml. On
+alinuxmachine with glibc-2.1 the list of supported encodings and aliases
+fill3 fullpages, and include UCS-4, the full set of ISO-Latin encodings, and
+thevariousJapanese ones.
To convert from the UTF-8 values returned from the API to
+anotherencodingthen it is possible to use the function provided from the encoding modulelike UTF8Toisolat1, or
+usethePOSIX iconv()APIdirectly.
Encoding aliases
From 2.2.3, libxml2 has support to register encoding names aliases.Thegoal
+is to be able to parse document whose encoding is supported butwherethe name
+differs (for example from the default set of names acceptedbyiconv). The
+following functions allow to register and handle new aliasesforexisting
+encodings. Once registered libxml2 will automatically lookupthealiases when
+handling a document:
int xmlAddEncodingAlias(const char *name, const char *alias);
   int xmlDelEncodingAlias(const char *alias);
   const char * xmlGetEncodingAlias(const char *alias);
   void xmlCleanupEncodingAliases(void);
-
How to extend the existing support
Well adding support for new encoding, or overriding one of the encoders
-(assuming it is buggy) should not be hard, just write input and output
-conversion routines to/from UTF-8, and register them using
-xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx),  and they will be
-called automatically if the parser(s) encounter such an encoding name
-(register it uppercase, this will help). The description of the encoders,
-their arguments and expected return values are described in the encoding.h
-header.
Daniel Veillard

How to extend the existing support

Well adding support for new encoding, or overriding one of +theencoders(assuming it is buggy) should not be hard, just write input +andoutputconversion routines to/from UTF-8, and register +themusingxmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they +willbecalled automatically if the parser(s) encounter such an +encodingname(register it uppercase, this will help). The description of +theencoders,their arguments and expected return values are described in +theencoding.hheader.

Daniel Veillard

-- cgit v1.2.3