From a7e9d3f37d5e9fba4b9acaa43e7c12b6d9a669ae Mon Sep 17 00:00:00 2001 From: Mike Hommey Date: Thu, 8 Jun 2006 10:59:26 +0200 Subject: Load /tmp/libxml2-2.6.26 into libxml2/branches/upstream/current. --- doc/encoding.html | 312 +++++++++++++++++++++++++++--------------------------- 1 file changed, 156 insertions(+), 156 deletions(-) (limited to 'doc/encoding.html') diff --git a/doc/encoding.html b/doc/encoding.html index 1f4558d..8db787e 100644 --- a/doc/encoding.html +++ b/doc/encoding.html @@ -7,44 +7,44 @@ H1 {font-family: Verdana,Arial,Helvetica} H2 {font-family: Verdana,Arial,Helvetica} H3 {font-family: Verdana,Arial,Helvetica} A:link, A:visited, A:active { text-decoration: underline } -Encodings support
Action against software patentsGnome2 LogoW3C LogoRed Hat Logo
Made with Libxml2 Logo

The XML C parser and toolkit of Gnome

Encodings support

Main Menu
Related links

If you are not really familiar with Internationalization (usual shortcut -is I18N) , Unicode, characters and glyphs, I suggest you read a presentation -by Tim Bray on Unicode and why you should care about it.

If you don't understand why it does not make sense to have a string -without knowing what encoding it uses, then as Joel Spolsky said please do not -write another line of code until you finish reading that article.. It is -a prerequisite to understand this page, and avoid a lot of problems with -libxml2, XML or text processing in general.

Table of Content:

  1. What does internationalization support - mean ?
  2. -
  3. The internal encoding, how and - why
  4. +Encodings support
    Action against software patentsGnome2 LogoW3C LogoRed Hat Logo
    Made with Libxml2 Logo

    The XML C parser and toolkit of Gnome

    Encodings support

    Main Menu
    Related links

    If you are not really familiar with Internationalization (usual +shortcutisI18N) , Unicode, characters and glyphs, I suggest you read a presentationbyTim +Bray on Unicode and why you should care about it.

    If you don't understand why it does not make sense to have +astringwithout knowing what encoding it uses, then as Joel Spolsky said +please do +notwriteanother line of code until you finish reading that article.. It +isaprerequisite to understand this page, and avoid a lot of +problemswithlibxml2, XML or text processing in general.

    Table of Content:

    1. What does internationalization + supportmean?
    2. +
    3. The internal encoding, + howandwhy
    4. How is it implemented ?
    5. Default supported encodings
    6. -
    7. How to extend the existing - support
    8. -

    What does internationalization support mean ?

    XML was designed from the start to allow the support of any character set -by using Unicode. Any conformant XML parser has to support the UTF-8 and -UTF-16 default encodings which can both express the full unicode ranges. UTF8 -is a variable length encoding whose greatest points are to reuse the same -encoding for ASCII and to save space for Western encodings, but it is a bit -more complex to handle in practice. UTF-16 use 2 bytes per character (and -sometimes combines two pairs), it makes implementation easier, but looks a -bit overkill for Western languages encoding. Moreover the XML specification -allows the document to be encoded in other encodings at the condition that -they are clearly labeled as such. For example the following is a wellformed -XML document encoded in ISO-8859-1 and using accentuated letters that we -French like for both markup and content:

    <?xml version="1.0" encoding="ISO-8859-1"?>
    +  
  5. How to extend theexistingsupport
  6. +

    What does internationalization support mean ?

    XML was designed from the start to allow the support of any charactersetby +using Unicode. Any conformant XML parser has to support the UTF-8andUTF-16 +default encodings which can both express the full unicode ranges.UTF8is a +variable length encoding whose greatest points are to reuse thesameencoding +for ASCII and to save space for Western encodings, but it is abitmore complex +to handle in practice. UTF-16 use 2 bytes per character(andsometimes combines +two pairs), it makes implementation easier, but looksabit overkill for +Western languages encoding. Moreover the XMLspecificationallows the document +to be encoded in other encodings at thecondition thatthey are clearly labeled +as such. For example the following isa wellformedXML document encoded in +ISO-8859-1 and using accentuated lettersthat weFrench like for both markup +and content:

    <?xml version="1.0" encoding="ISO-8859-1"?>
     <très>là</très>

    Having internationalization support in libxml2 means the following:

    • the document is properly parsed
    • informations about it's encoding are saved
    • it can be modified
    • it can be saved in its original encoding
    • -
    • it can also be saved in another encoding supported by libxml2 (for - example straight UTF8 or even an ASCII form)
    • -

    Another very important point is that the whole libxml2 API, with the -exception of a few routines to read with a specific encoding or save to a -specific encoding, is completely agnostic about the original encoding of the -document.

    It should be noted too that the HTML parser embedded in libxml2 now obey -the same rules too, the following document will be (as of 2.2.2) handled in -an internationalized fashion by libxml2 too:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
    +  
  7. it can also be saved in another encoding supported by + libxml2(forexample straight UTF8 or even an ASCII form)
  8. +

    Another very important point is that the whole libxml2 API, +withtheexception of a few routines to read with a specific encoding or save +toaspecific encoding, is completely agnostic about the original encoding +ofthedocument.

    It should be noted too that the HTML parser embedded in libxml2 nowobeythe +same rules too, the following document will be (as of 2.2.2) handledinan +internationalized fashion by libxml2 too:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
                           "http://www.w3.org/TR/REC-html40/loose.dtd">
     <html lang="fr">
     <head>
    @@ -52,59 +52,60 @@ an internationalized fashion by libxml2 too:

    <!DOCTYPE HTML PUBLIC "-
     </head>
     <body>
     <p>W3C crée des standards pour le Web.</body>
    -</html>

    The internal encoding, how and why

    One of the core decisions was to force all documents to be converted to a -default internal encoding, and that encoding to be UTF-8, here are the -rationales for those choices:

    • keeping the native encoding in the internal form would force the libxml - users (or the code associated) to be fully aware of the encoding of the - original document, for examples when adding a text node to a document, - the content would have to be provided in the document encoding, i.e. the - client code would have to check it before hand, make sure it's conformant - to the encoding, etc ... Very hard in practice, though in some specific - cases this may make sense.
    • -
    • the second decision was which encoding. From the XML spec only UTF8 and - UTF16 really makes sense as being the two only encodings for which there - is mandatory support. UCS-4 (32 bits fixed size encoding) could be - considered an intelligent choice too since it's a direct Unicode mapping - support. I selected UTF-8 on the basis of efficiency and compatibility - with surrounding software: -
      • UTF-8 while a bit more complex to convert from/to (i.e. slightly - more costly to import and export CPU wise) is also far more compact - than UTF-16 (and UCS-4) for a majority of the documents I see it used - for right now (RPM RDF catalogs, advogato data, various configuration - file formats, etc.) and the key point for today's computer - architecture is efficient uses of caches. If one nearly double the - memory requirement to store the same amount of data, this will trash - caches (main memory/external caches/internal caches) and my take is - that this harms the system far more than the CPU requirements needed - for the conversion to UTF-8
      • -
      • Most of libxml2 version 1 users were using it with straight ASCII - most of the time, doing the conversion with an internal encoding - requiring all their code to be rewritten was a serious show-stopper - for using UTF-16 or UCS-4.
      • -
      • UTF-8 is being used as the de-facto internal encoding standard for - related code like the pango - upcoming Gnome text widget, and a lot of Unix code (yet another place - where Unix programmer base takes a different approach from Microsoft - - they are using UTF-16)
      • +</html>

    The internal encoding, how and why

    One of the core decisions was to force all documents to be converted +toadefault internal encoding, and that encoding to be UTF-8, here +aretherationales for those choices:

    • keeping the native encoding in the internal form would force + thelibxmlusers (or the code associated) to be fully aware of the encoding + oftheoriginal document, for examples when adding a text node to + adocument,the content would have to be provided in the document + encoding,i.e. theclient code would have to check it before hand, make + sure it'sconformantto the encoding, etc ... Very hard in practice, though + in somespecificcases this may make sense.
    • +
    • the second decision was which encoding. From the XML spec only + UTF8andUTF16 really makes sense as being the two only encodings for + whichthereis mandatory support. UCS-4 (32 bits fixed size encoding) + couldbeconsidered an intelligent choice too since it's a direct + Unicodemappingsupport. I selected UTF-8 on the basis of efficiency + andcompatibilitywith surrounding software: +
      • UTF-8 while a bit more complex to convert from/to (i.e.slightlymore + costly to import and export CPU wise) is also far morecompactthan + UTF-16 (and UCS-4) for a majority of the documents I seeit usedfor + right now (RPM RDF catalogs, advogato data, variousconfigurationfile + formats, etc.) and the key point for today'scomputerarchitecture is + efficient uses of caches. If one nearlydouble thememory requirement + to store the same amount of data, thiswill trashcaches (main + memory/external caches/internal caches) and mytake isthat this harms + the system far more than the CPU requirementsneededfor the conversion + to UTF-8
      • +
      • Most of libxml2 version 1 users were using it with + straightASCIImost of the time, doing the conversion with an + internalencodingrequiring all their code to be rewritten was a + seriousshow-stopperfor using UTF-16 or UCS-4.
      • +
      • UTF-8 is being used as the de-facto internal encoding + standardforrelated code like the pangoupcoming Gnome text widget, + anda lot of Unix code (yet another placewhere Unix programmer base + takesa different approach from Microsoft- they are using UTF-16)
    • -

    What does this mean in practice for the libxml2 user:

    • xmlChar, the libxml2 data type is a byte, those bytes must be assembled - as UTF-8 valid strings. The proper way to terminate an xmlChar * string - is simply to append 0 byte, as usual.
    • -
    • One just need to make sure that when using chars outside the ASCII set, - the values has been properly converted to UTF-8
    • -

    How is it implemented ?

    Let's describe how all this works within libxml, basically the I18N -(internationalization) support get triggered only during I/O operation, i.e. -when reading a document or saving one. Let's look first at the reading -sequence:

    1. when a document is processed, we usually don't know the encoding, a - simple heuristic allows to detect UTF-16 and UCS-4 from encodings where - the ASCII range (0-0x7F) maps with ASCII
    2. -
    3. the xml declaration if available is parsed, including the encoding - declaration. At that point, if the autodetected encoding is different - from the one declared a call to xmlSwitchEncoding() is issued.
    4. -
    5. If there is no encoding declaration, then the input has to be in either - UTF-8 or UTF-16, if it is not then at some point when processing the - input, the converter/checker of UTF-8 form will raise an encoding error. - You may end-up with a garbled document, or no document at all ! Example: +

      What does this mean in practice for the libxml2 user:

      • xmlChar, the libxml2 data type is a byte, those bytes must + beassembledas UTF-8 valid strings. The proper way to terminate an xmlChar + *stringis simply to append 0 byte, as usual.
      • +
      • One just need to make sure that when using chars outside the + ASCIIset,the values has been properly converted to UTF-8
      • +

      How is it implemented ?

      Let's describe how all this works within libxml, basically +theI18N(internationalization) support get triggered only during I/O +operation,i.e.when reading a document or saving one. Let's look first at +thereadingsequence:

      1. when a document is processed, we usually don't know the + encoding,asimple heuristic allows to detect UTF-16 and UCS-4 from + encodingswherethe ASCII range (0-0x7F) maps with ASCII
      2. +
      3. the xml declaration if available is parsed, including + theencodingdeclaration. At that point, if the autodetected encoding + isdifferentfrom the one declared a call to xmlSwitchEncoding() + isissued.
      4. +
      5. If there is no encoding declaration, then the input has to be + ineitherUTF-8 or UTF-16, if it is not then at some point when + processingtheinput, the converter/checker of UTF-8 form will raise an + encodingerror.You may end-up with a garbled document, or no document at + all !Example:
        ~/XML -> ./xmllint err.xml 
         err.xml:1: error: Input is not proper UTF-8, indicate encoding !
         <très>là</très>
        @@ -113,94 +114,93 @@ err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
         <très>là</très>
            ^
      6. -
      7. xmlSwitchEncoding() does an encoding name lookup, canonicalize it, and - then search the default registered encoding converters for that encoding. - If it's not within the default set and iconv() support has been compiled - it, it will ask iconv for such an encoder. If this fails then the parser - will report an error and stops processing: +
      8. xmlSwitchEncoding() does an encoding name lookup, canonicalize + it,andthen search the default registered encoding converters for + thatencoding.If it's not within the default set and iconv() support has + beencompiledit, it will ask iconv for such an encoder. If this fails then + theparserwill report an error and stops processing:
        ~/XML -> ./xmllint err2.xml 
         err2.xml:1: error: Unsupported encoding UnsupportedEnc
         <?xml version="1.0" encoding="UnsupportedEnc"?>
                                                      ^
      9. -
      10. From that point the encoder processes progressively the input (it is - plugged as a front-end to the I/O module) for that entity. It captures - and converts on-the-fly the document to be parsed to UTF-8. The parser - itself just does UTF-8 checking of this input and process it - transparently. The only difference is that the encoding information has - been added to the parsing context (more precisely to the input - corresponding to this entity).
      11. -
      12. The result (when using DOM) is an internal form completely in UTF-8 - with just an encoding information on the document node.
      13. -

      Ok then what happens when saving the document (assuming you -collected/built an xmlDoc DOM like structure) ? It depends on the function -called, xmlSaveFile() will just try to save in the original encoding, while -xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given -encoding:

      1. if no encoding is given, libxml2 will look for an encoding value - associated to the document and if it exists will try to save to that - encoding, +
      2. From that point the encoder processes progressively the input + (itisplugged as a front-end to the I/O module) for that entity. + Itcapturesand converts on-the-fly the document to be parsed to UTF-8. + Theparseritself just does UTF-8 checking of this input and + processittransparently. The only difference is that the encoding + informationhasbeen added to the parsing context (more precisely to + theinputcorresponding to this entity).
      3. +
      4. The result (when using DOM) is an internal form completely in + UTF-8withjust an encoding information on the document node.
      5. +

      Ok then what happens when saving the document (assuming +youcollected/builtan xmlDoc DOM like structure) ? It depends on the +functioncalled,xmlSaveFile() will just try to save in the original +encoding,whilexmlSaveFileTo() and xmlSaveFileEnc() can optionally save to +agivenencoding:

      1. if no encoding is given, libxml2 will look for an + encodingvalueassociated to the document and if it exists will try to save + tothatencoding,

        otherwise everything is written in the internal form, i.e. UTF-8

      2. -
      3. so if an encoding was specified, either at the API level or on the - document, libxml2 will again canonicalize the encoding name, lookup for a - converter in the registered set or through iconv. If not found the - function will return an error code
      4. -
      5. the converter is placed before the I/O buffer layer, as another kind of - buffer, then libxml2 will simply push the UTF-8 serialization to through - that buffer, which will then progressively be converted and pushed onto - the I/O layer.
      6. -
      7. It is possible that the converter code fails on some input, for example - trying to push an UTF-8 encoded Chinese character through the UTF-8 to - ISO-8859-1 converter won't work. Since the encoders are progressive they - will just report the error and the number of bytes converted, at that - point libxml2 will decode the offending character, remove it from the - buffer and replace it with the associated charRef encoding &#123; and - resume the conversion. This guarantees that any document will be saved - without losses (except for markup names where this is not legal, this is - a problem in the current version, in practice avoid using non-ascii - characters for tag or attribute names). A special "ascii" encoding name - is used to save documents to a pure ascii form can be used when - portability is really crucial
      8. +
      9. so if an encoding was specified, either at the API level or + onthedocument, libxml2 will again canonicalize the encoding name, + lookupfor aconverter in the registered set or through iconv. If not + foundthefunction will return an error code
      10. +
      11. the converter is placed before the I/O buffer layer, as another + kindofbuffer, then libxml2 will simply push the UTF-8 serialization + tothroughthat buffer, which will then progressively be converted and + pushedontothe I/O layer.
      12. +
      13. It is possible that the converter code fails on some input, + forexampletrying to push an UTF-8 encoded Chinese character through + theUTF-8 toISO-8859-1 converter won't work. Since the encoders + areprogressive theywill just report the error and the number of + bytesconverted, at thatpoint libxml2 will decode the offending + character,remove it from thebuffer and replace it with the associated + charRefencoding &#123; andresume the conversion. This guarantees that + anydocument will be savedwithout losses (except for markup names where + thisis not legal, this isa problem in the current version, in practice + avoidusing non-asciicharacters for tag or attribute names). A special + "ascii"encoding nameis used to save documents to a pure ascii form can be + usedwhenportability is really crucial

      Here are a few examples based on the same test document:

      ~/XML -> ./xmllint isolat1 
       <?xml version="1.0" encoding="ISO-8859-1"?>
       <très>là</très>
       ~/XML -> ./xmllint --encode UTF-8 isolat1 
       <?xml version="1.0" encoding="UTF-8"?>
       <très>là  </très>
      -~/XML -> 

      The same processing is applied (and reuse most of the code) for HTML I18N -processing. Looking up and modifying the content encoding is a bit more -difficult since it is located in a <meta> tag under the <head>, -so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have -been provided. The parser also attempts to switch encoding on the fly when -detecting such a tag on input. Except for that the processing is the same -(and again reuses the same code).

      Default supported encodings

      libxml2 has a set of default converters for the following encodings -(located in encoding.c):

      1. UTF-8 is supported by default (null handlers)
      2. +~/XML ->

    The same processing is applied (and reuse most of the code) for +HTMLI18Nprocessing. Looking up and modifying the content encoding is a +bitmoredifficult since it is located in a <meta> tag under +the<head>,so a couple of functions htmlGetMetaEncoding() +andhtmlSetMetaEncoding() havebeen provided. The parser also attempts to +switchencoding on the fly whendetecting such a tag on input. Except for that +theprocessing is the same(and again reuses the same code).

    Default supported encodings

    libxml2 has a set of default converters for the followingencodings(located +in encoding.c):

    1. UTF-8 is supported by default (null handlers)
    2. UTF-16, both little and big endian
    3. ISO-Latin-1 (ISO-8859-1) covering most western languages
    4. ASCII, useful mostly for saving
    5. -
    6. HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML - predefined entities like &copy; for the Copyright sign.
    7. -

    More over when compiled on an Unix platform with iconv support the full -set of encodings supported by iconv can be instantly be used by libxml. On a -linux machine with glibc-2.1 the list of supported encodings and aliases fill -3 full pages, and include UCS-4, the full set of ISO-Latin encodings, and the -various Japanese ones.

    To convert from the UTF-8 values returned from the API to another encoding -then it is possible to use the function provided from the encoding module like UTF8Toisolat1, or use the -POSIX iconv() -API directly.

    Encoding aliases

    From 2.2.3, libxml2 has support to register encoding names aliases. The -goal is to be able to parse document whose encoding is supported but where -the name differs (for example from the default set of names accepted by -iconv). The following functions allow to register and handle new aliases for -existing encodings. Once registered libxml2 will automatically lookup the -aliases when handling a document:

    • int xmlAddEncodingAlias(const char *name, const char *alias);
    • +
    • HTML, a specific handler for the conversion of UTF-8 to ASCII + withHTMLpredefined entities like &copy; for the Copyright sign.
    • +

      More over when compiled on an Unix platform with iconv support the +fullsetof encodings supported by iconv can be instantly be used by libxml. On +alinuxmachine with glibc-2.1 the list of supported encodings and aliases +fill3 fullpages, and include UCS-4, the full set of ISO-Latin encodings, and +thevariousJapanese ones.

      To convert from the UTF-8 values returned from the API to +anotherencodingthen it is possible to use the function provided from the encoding modulelike UTF8Toisolat1, or +usethePOSIX iconv()APIdirectly.

      Encoding aliases

      From 2.2.3, libxml2 has support to register encoding names aliases.Thegoal +is to be able to parse document whose encoding is supported butwherethe name +differs (for example from the default set of names acceptedbyiconv). The +following functions allow to register and handle new aliasesforexisting +encodings. Once registered libxml2 will automatically lookupthealiases when +handling a document:

      • int xmlAddEncodingAlias(const char *name, const char *alias);
      • int xmlDelEncodingAlias(const char *alias);
      • const char * xmlGetEncodingAlias(const char *alias);
      • void xmlCleanupEncodingAliases(void);
      • -

      How to extend the existing support

      Well adding support for new encoding, or overriding one of the encoders -(assuming it is buggy) should not be hard, just write input and output -conversion routines to/from UTF-8, and register them using -xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they will be -called automatically if the parser(s) encounter such an encoding name -(register it uppercase, this will help). The description of the encoders, -their arguments and expected return values are described in the encoding.h -header.

      Daniel Veillard

    +

    How to extend the existing support

    Well adding support for new encoding, or overriding one of +theencoders(assuming it is buggy) should not be hard, just write input +andoutputconversion routines to/from UTF-8, and register +themusingxmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they +willbecalled automatically if the parser(s) encounter such an +encodingname(register it uppercase, this will help). The description of +theencoders,their arguments and expected return values are described in +theencoding.hheader.

    Daniel Veillard

-- cgit v1.2.3