summaryrefslogtreecommitdiff
path: root/doc/encoding.html
diff options
context:
space:
mode:
Diffstat (limited to 'doc/encoding.html')
-rw-r--r--doc/encoding.html30
1 files changed, 20 insertions, 10 deletions
diff --git a/doc/encoding.html b/doc/encoding.html
index 93de5bf..7c7953f 100644
--- a/doc/encoding.html
+++ b/doc/encoding.html
@@ -13,7 +13,8 @@ by Tim Bray on Unicode and why you should care about it.</p><p>If you don't unde
without knowing what encoding it uses</b>, then as Joel Spolsky said <a href="http://www.joelonsoftware.com/articles/Unicode.html">please do not
write another line of code until you finish reading that article.</a>. It is
a prerequisite to understand this page, and avoid a lot of problems with
-libxml2, XML or text processing in general.</p><p>Table of Content:</p><ol><li><a href="encoding.html#What">What does internationalization support
+libxml2, XML or text processing in general.</p><p>Table of Content:</p><ol>
+ <li><a href="encoding.html#What">What does internationalization support
mean ?</a></li>
<li><a href="encoding.html#internal">The internal encoding, how and
why</a></li>
@@ -33,7 +34,8 @@ allows the document to be encoded in other encodings at the condition that
they are clearly labeled as such. For example the following is a wellformed
XML document encoded in ISO-8859-1 and using accentuated letters that we
French like for both markup and content:</p><pre>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
-&lt;très&gt;là &lt;/très&gt;</pre><p>Having internationalization support in libxml2 means the following:</p><ul><li>the document is properly parsed</li>
+&lt;très&gt;là &lt;/très&gt;</pre><p>Having internationalization support in libxml2 means the following:</p><ul>
+ <li>the document is properly parsed</li>
<li>information about it's encoding is saved</li>
<li>it can be modified</li>
<li>it can be saved in its original encoding</li>
@@ -54,7 +56,8 @@ an internationalized fashion by libxml2 too:</p><pre>&lt;!DOCTYPE HTML PUBLIC "-
&lt;p&gt;W3C crée des standards pour le Web.&lt;/body&gt;
&lt;/html&gt;</pre><h3><a name="internal" id="internal">The internal encoding, how and why</a></h3><p>One of the core decisions was to force all documents to be converted to a
default internal encoding, and that encoding to be UTF-8, here are the
-rationales for those choices:</p><ul><li>keeping the native encoding in the internal form would force the libxml
+rationales for those choices:</p><ul>
+ <li>keeping the native encoding in the internal form would force the libxml
users (or the code associated) to be fully aware of the encoding of the
original document, for examples when adding a text node to a document,
the content would have to be provided in the document encoding, i.e. the
@@ -67,7 +70,8 @@ rationales for those choices:</p><ul><li>keeping the native encoding in the inte
considered an intelligent choice too since it's a direct Unicode mapping
support. I selected UTF-8 on the basis of efficiency and compatibility
with surrounding software:
- <ul><li>UTF-8 while a bit more complex to convert from/to (i.e. slightly
+ <ul>
+ <li>UTF-8 while a bit more complex to convert from/to (i.e. slightly
more costly to import and export CPU wise) is also far more compact
than UTF-16 (and UCS-4) for a majority of the documents I see it used
for right now (RPM RDF catalogs, advogato data, various configuration
@@ -86,8 +90,10 @@ rationales for those choices:</p><ul><li>keeping the native encoding in the inte
upcoming Gnome text widget, and a lot of Unix code (yet another place
where Unix programmer base takes a different approach from Microsoft
- they are using UTF-16)</li>
- </ul></li>
-</ul><p>What does this mean in practice for the libxml2 user:</p><ul><li>xmlChar, the libxml2 data type is a byte, those bytes must be assembled
+ </ul>
+ </li>
+</ul><p>What does this mean in practice for the libxml2 user:</p><ul>
+ <li>xmlChar, the libxml2 data type is a byte, those bytes must be assembled
as UTF-8 valid strings. The proper way to terminate an xmlChar * string
is simply to append 0 byte, as usual.</li>
<li>One just need to make sure that when using chars outside the ASCII set,
@@ -95,7 +101,8 @@ rationales for those choices:</p><ul><li>keeping the native encoding in the inte
</ul><h3><a name="implemente" id="implemente">How is it implemented ?</a></h3><p>Let's describe how all this works within libxml, basically the I18N
(internationalization) support get triggered only during I/O operation, i.e.
when reading a document or saving one. Let's look first at the reading
-sequence:</p><ol><li>when a document is processed, we usually don't know the encoding, a
+sequence:</p><ol>
+ <li>when a document is processed, we usually don't know the encoding, a
simple heuristic allows to detect UTF-16 and UCS-4 from encodings where
the ASCII range (0-0x7F) maps with ASCII</li>
<li>the xml declaration if available is parsed, including the encoding
@@ -136,7 +143,8 @@ err2.xml:1: error: Unsupported encoding UnsupportedEnc
collected/built an xmlDoc DOM like structure) ? It depends on the function
called, xmlSaveFile() will just try to save in the original encoding, while
xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given
-encoding:</p><ol><li>if no encoding is given, libxml2 will look for an encoding value
+encoding:</p><ol>
+ <li>if no encoding is given, libxml2 will look for an encoding value
associated to the document and if it exists will try to save to that
encoding,
<p>otherwise everything is written in the internal form, i.e. UTF-8</p>
@@ -175,7 +183,8 @@ so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have
been provided. The parser also attempts to switch encoding on the fly when
detecting such a tag on input. Except for that the processing is the same
(and again reuses the same code).</p><h3><a name="Default" id="Default">Default supported encodings</a></h3><p>libxml2 has a set of default converters for the following encodings
-(located in encoding.c):</p><ol><li>UTF-8 is supported by default (null handlers)</li>
+(located in encoding.c):</p><ol>
+ <li>UTF-8 is supported by default (null handlers)</li>
<li>UTF-16, both little and big endian</li>
<li>ISO-Latin-1 (ISO-8859-1) covering most western languages</li>
<li>ASCII, useful mostly for saving</li>
@@ -193,7 +202,8 @@ goal is to be able to parse document whose encoding is supported but where
the name differs (for example from the default set of names accepted by
iconv). The following functions allow to register and handle new aliases for
existing encodings. Once registered libxml2 will automatically lookup the
-aliases when handling a document:</p><ul><li>int xmlAddEncodingAlias(const char *name, const char *alias);</li>
+aliases when handling a document:</p><ul>
+ <li>int xmlAddEncodingAlias(const char *name, const char *alias);</li>
<li>int xmlDelEncodingAlias(const char *alias);</li>
<li>const char * xmlGetEncodingAlias(const char *alias);</li>
<li>void xmlCleanupEncodingAliases(void);</li>