summaryrefslogtreecommitdiff
path: root/doc/draft/draft-ietf-idn-dude-00.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/draft/draft-ietf-idn-dude-00.txt')
-rw-r--r--doc/draft/draft-ietf-idn-dude-00.txt596
1 files changed, 596 insertions, 0 deletions
diff --git a/doc/draft/draft-ietf-idn-dude-00.txt b/doc/draft/draft-ietf-idn-dude-00.txt
new file mode 100644
index 00000000..86e05d46
--- /dev/null
+++ b/doc/draft/draft-ietf-idn-dude-00.txt
@@ -0,0 +1,596 @@
+Internet Engineering Task Force (IETF) Mark Welter
+INTERNET-DRAFT Brian W. Spolarich
+draft-ietf-idn-dude-00.txt WALID, Inc.
+November 16, 2000 Expires May 16, 2001
+
+
+ DUDE: Differential Unicode Domain Encoding
+
+
+Status of this memo
+
+This document is an Internet-Draft and is in full conformance with all
+provisions of Section 10 of RFC2026.
+
+Internet-Drafts are working documents of the Internet Engineering Task
+Force (IETF), its areas, and its working groups. Note that other
+groups may also distribute working documents as Internet-Drafts.
+
+Internet-Drafts are draft documents valid for a maximum of six months
+and may be updated, replaced, or obsoleted by other documents at any
+time. It is inappropriate to use Internet-Drafts as reference
+material or to cite them other than as "work in progress."
+
+ The list of current Internet-Drafts can be accessed at
+ http://www.ietf.org/ietf/1id-abstracts.txt
+
+ The list of Internet-Draft Shadow Directories can be accessed at
+ http://www.ietf.org/shadow.html.
+
+The distribution of this document is unlimited.
+
+Copyright (c) The Internet Society (2000). All Rights Reserved.
+
+Abstract
+
+This document describes a tranformation method for representing
+Unicode character codepoints in host name parts in a fashion that is
+completely compatible with the current Domain Name System. It provides
+for very efficient representation of typical Unicode sequences as
+host name parts, while preserving simplicity. It is proposed as a
+potential candidate for an ASCII-Compatible Encoding (ACE) for supporting
+the deployment of an internationalized Domain Name System.
+
+
+Table of Contents
+
+1. Introduction
+1.1 Terminology
+2. Hostname Part Transformation
+2.1 Post-Converted Name Prefix
+2.2 Radix Selection
+2.3 Hostname Prepartion
+2.4 Definitions
+2.5 DUDE Encoding
+2.5.1 Extended Variable Length Hex Encoding
+2.5.2 DUDE Compression Algorithm
+2.5.3 Forward Transformation Algorithm
+2.6 DUDE Decoding
+2.6.1 Extended Variable Length Hex Decoding
+2.6.2 DUDE Decompression Algorithm
+2.6.3 Reverse Transformation Algorithm
+3. Examples
+3.1 'www.walid.com' (in Arabic)
+4. DUDE Extensions
+4.1 Extended DUDE Encoding
+4.1.1 Modified Extended Variable Length Hex Encoding
+4.1.2 Extended Compression Algorithm
+4.1.3 Extended Forward Transformation Algorithm
+4.2 Extended DUDE Decoding
+4.2.1 Modified Extended Variable Length Hex Decoding
+4.2.2 Extended Decompression Algorithm
+4.2.3 Extended Reverse Transformation Algorithm
+5. Security Considerations
+6. References
+
+
+1. Introduction
+
+DUDE describes an encoding scheme of the ISO/IEC 10646 [ISO10646]
+character set (whose character code assignments are synchronized
+with Unicode [UNICODE3]), and the procedures for using this scheme
+to transform host name parts containing Unicode character sequences
+into sequences that are compatible with the current DNS protocol
+[STD13]. As such, it satisfies the definition of a 'charset' as
+defined in [IDNREQ].
+
+1.1 Terminology
+
+The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
+"MAY" in this document are to be interpreted as described in RFC 2119
+[RFC2119].
+
+Hexadecimal values are shown preceded with an "0x". For example,
+"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are
+shown preceded with an "0b". For example, a nine-bit value might be
+shown as "0b101101111".
+
+Examples in this document use the notation from the Unicode Standard
+[UNICODE3] as well as the ISO 10646 names. For example, the letter "a"
+may be represented as either "U+0061" or "LATIN SMALL LETTER A".
+
+DUDE converts strings with internationalized characters into
+strings of US-ASCII that are acceptable as host name parts in current
+DNS host naming usage. The former are called "pre-converted" and the
+latter are called "post-converted". This specification defines both
+a forward and reverse transformation algorithm.
+
+
+2. Hostname Part Transformation
+
+According to [STD13], hostname parts must start and end with a letter
+or digit, and contain only letters, digits, and the hyphen character
+("-"). This, of course, excludes most characters used by non-English
+speakers, characters, as well as many other characters in the ASCII
+character repertoire. Further, domain name parts must be 63 octets or
+shorter in length.
+
+2.1 Post-Converted Name Prefix
+
+This document defines the string 'dq--' as a prefix to identify
+DUDE-encoded sequences. For the purposes of comparison in the IDN
+Working Group activities, the 'dq--' prefix should be used solely to
+identify DUDE sequences. However, should this document proceed beyond
+draft status the prefix should be changed to whatever prefix, if any,
+is the final consensus of the IDN working group.
+
+Note that the prepending of a fixed identifier sequence is only one
+mechanism for differentiating ASCII character encoded international
+domain names from 'ordinary' domain names. One method, as proposed in
+[IDNRACE], is to include a character prefix or suffix that does not
+appear in any name in any zone file. A second method is to insert a
+domain component which pushes off any international names one or more
+levels deeper into the DNS hierarchy. There are trade-offs between
+these two methods which are independent of the Unicode to ASCII
+transcoding method finally chosen. We do not address the international
+vs. 'ordinary' name differention issue in this paper.
+
+2.2 Radix Selection
+
+There are many proposed methods for representing Unicode characters
+within the allowed target character set, which can be split into groups
+on the basis of the underlying radix. We have chosen a method with
+radix 16 because both UTF-16 and ASCII are represented by even multiples
+of four bits. This allows a Unicode character to be encoded as a
+whole number of ASCII characters, and permits easier manipulation of
+the resulting encoded data by humans.
+
+2.3 Hostname Prepartion
+
+The hostname part is assumed to have at least one character disallowed
+by [STD13], and that is has been processed for logically equivalent
+character mapping, filtering of disallowed characters (if any), and
+compatibility composition/decomposition before presentation to the DUDE
+conversion algorithm.
+
+While it is possible to invent a transcoding mechanism that relies
+on certain Unicode characters being deemed illegal within domain names
+and hence available to the transcoding mechanism for improving encoding
+efficiency, we feel that such a proposal would complicate matters
+excessively. We also believe that Unicode name preprocessing for
+both name resolution and name registration should be considered as
+separate, independent issues, which we will address in a separate
+document.
+
+2.4 Definitions
+
+For clarity:
+
+ 'integer' is an unsigned binary quantity;
+ 'byte' is an 8-bit integer quantity;
+ 'nibble' is a 4-bit integer quantity.
+
+2.5 DUDE Encoding
+
+The idea behind this scheme is to provide compression by encoding the
+contiguous least significant nibbles of a character that differ from the
+preceding character. Using a variant of the variable length hex encoding
+desribed in [IDNDUERST] and elsewhere, by encoding leading zero nibbles
+this technique allows recovery of the differential length. The encoding
+is, with some practice, easy to perform manually.
+
+There are two extensions to this basic idea: one enables encoding the
+preferred case for each charcter (for reverse DNS resolution) and
+another improves the worse case behaviour related to surrogates. The
+basic algorithms will be formally described first and then the extended
+algorithms will be described.
+
+2.5.1 Extended Variable Length Hex Encoding
+
+The variable length hex encoding algorithm was introduced by Duerst in
+[IDNDUERST]. It encodes an integer value in a slight modification of
+traditional hexadecimal notation, the difference being that the most
+significant digit is represented with an alternate set of "digits"
+- -- 'g through 'v' are used to represent 0 through 15. The result is a
+variable length encoding which can efficiently represent integers of
+arbitrary length.
+
+This specification extends the variable length hex encoding algorithm
+to support the compression scheme defined below by potentially not
+supressing leading zero nibbles.
+
+The extended variable length nibble encoding of an integer, C,
+to length N, is defined as follows:
+
+ 1. Start with I, the Nth least significant nibble from the least
+ significant nibble of C;
+
+ 2. Emit the Ith character of the sequence [ghijklmnopqrstuv];
+
+ 3. Continue from the most to least significant, encoding each
+ remaining nibble J by emitting the Jth character of the
+ sequence [0123456789abcdef].
+
+2.5.2 DUDE Compression Algorithm
+
+ 1. Let PREV = 0;
+
+ 2. If there are no more characters in the input, terminate successfully;
+
+ 4. Let C be the next character in the input;
+
+ 5. If C != '-' , then go to step 5;
+
+ 6. Consume the input character, emit '-', and go to step 2;
+
+ 7. Let D be the result of PREV exclusive ORed with C;
+
+ 8. Find the least positive value N such that
+ D bitwise ANDed with M is zero
+ where M = the bitwise complement of (16**N) - 1;
+
+ 9. Let V be C ANDed with the bitwise complement of M;
+
+ 10. Variable length hex encode V to length N and emit the result;
+
+ 11. Let PREV = C and go to step 2.
+
+
+2.5.3 Forward Transformation Algorithm
+
+The DUDE transformation algorithm accepts a string in UTF-16
+[ISO10646] format as input. The encoding algorithm is as follows:
+
+ 1. Break the hostname string into dot-separated hostname parts.
+ For each hostname part which contains one or more characters
+ disallowed by [STD13], perform steps 2 and 3 below;
+
+ 2. Compress the hostname part using the method described in section
+ 2.5.2 above, and encode using the encoding described in section
+ 2.5.1;
+
+ 3. Prepend the post-converted name prefix 'dq--' (see section 2.1
+ above) to the resulting string.
+
+
+2.6 DUDE Decoding
+
+2.6.1 Extended Variable Length Hex Decoding
+
+ Decoding extended variable length hex encoded strings is identical
+to the standard variable length hex encoding, and is defined as
+follows:
+
+ 1. Let CL be the lower case of the first input character,
+
+ If CL is not in set [ghijklmnopqrstuv],
+ return error,
+ else
+ consume the input character;
+
+ 2. Let R = CL - 'g',
+ Let N = 1;
+
+ 3. If no more input characters exist, go to step 9.
+
+ 4. Let CL be the lower case of the next input character;
+
+ 5. If CL is not in the set [0123456789abcdef], go to Step 9;
+
+ 6. Consume the next input character,
+ Let N = N + 1;
+ Let R = R * 16;
+
+ 7. If N is in set [0123456789],
+ then let R = R + (N - '0')
+ else let R = R + (N - 'a') + 10;
+
+ 8. Go to step 3;
+
+ 9. Let MASK be the bitwise complement of (16**N) - 1;
+
+ 10. Return decoded result R as well as MASK.
+
+2.6.2 DUDE Decompression Algorithm
+
+ 1. Let PREV = 0;
+
+ 2. If there are no more input characters then terminate successfully;
+
+ 3. Let C be the next input character;
+
+ 4. If C == '-', append '-' to the result string, consume the character,
+ and go to step 2,
+
+ 5. Let VPART, MASK be the next variable length hex decoded
+ value and mask;
+
+ 6. If VPART > 0xFFFF then return error status,
+
+ 7. Let CU = ( PREV bitwise-AND MASK) + VPART,
+ Let PREV = CU;
+
+ 8. Append the UTF-16 character CU to the result string;
+
+ 9. Go to step 2.
+
+
+2.6.3 Reverse Transformation Algorithm
+
+ 1. Break the string into dot-separated components and apply Steps
+ 2 through 4 to each component;
+
+ 2. Remove the post converted name prefix 'dq--' (see Section 2.1);
+
+ 3. Decompress the component using the decompression algorithm
+ described above;
+
+ 4. Concatenate the decoded segments with dot separators and return.
+
+3. Examples
+
+The examples below illustrate the encoding algorithm and provide
+comparisons to alternate encoding schemes. UTF-5 sequences are
+prefixed with '----', as no ACE prefix was defined for that encoding.
+
+3.1 'www.walid.com' (in Arabic):
+
+ UTF-16: U+0645 U+0648 U+0642 U+0639 . U+0648 U+0644 U+064A U+062F .
+ U+0634 U+0631 U+0643 U+0629
+
+ DUDE: dq--m45oij9.dq--m48kqif.dq--m34hk3i9
+
+ UTF-6: wq--ymk5k8k2j9.wq--ymk8k4kaif.wq--ymj4j1k3i9
+
+ UTF-5: ----m45m48m42m39.----m48m44m4am2f.----m34m31m43m29
+
+ RACE: bq--azcuqqrz.bq--azeeisrp.bq--ay2dcqzj
+
+ LACE: bq--aqdekscche.bq--aqdeqrckf5.bq--aqddimkdfe
+
+ (more examples to come)
+
+4. DUDE Extensions
+
+The first extension to the DUDE concept recognizes that the first
+character emitted by the variable length hex encoding algorithm is
+always alphabetic. We encode the case (if any) of the original Unicode
+character in the case of the initial "hex" character. Because the DNS
+performs case-insensitive comparisons, mixed case international domain
+names behave in exactly the same way as traditional domain names.
+In particular, this enables reverse lookups to return names in the
+preferred case.
+
+The second extension regards the treatment of Unicode surrogate
+characters. If surrogates are not expanded, two 16-bit surrogates are
+needed to represent a single codepoint in the range of 0x10000
+through 0x10FFFF. This cuts the worse case limits in half for most
+proposals. We will assume that our input and output Unicode are in
+UTF-32 format -- that is, any surrogates are expanded to their UCS-4
+equivalents. If the input codes all fall under 0x10000, then the
+extended method will emit the same length string as the basic method.
+One final modification takes note of the fact that the only only
+codepoints forcing the use of six hex digits is for those with a "10"
+as the fifth and sixth digits. We will encode the fifth digit using
+a seventeenth digit as a special case to avoid this extra expansion.
+
+4.1 Extended DUDE Encoding
+
+4.1.1 Modified Extended Variable Length Hex Encoding
+
+The modified extended variable length hex encoding of an integer C to
+length N with case U is performed as follows:
+
+ 1. If C > 0x10FFFF return error status;
+
+ 2. If N < 6 go to step 5; (this is true for characters from
+ the first 16 Planes)
+
+ 3. If U is 'Uppercase' then emit 'W'
+ else emit 'w'; (special case for the 17th Plane)
+
+ 4. go to step 7;
+
+ 5. Let I be the Nth nibble from the right of C;
+
+ 6. If U is 'Uppercase'
+ then emit the Ith character of sequence [GHIJKLMNOPQRSTUV],
+ else emit the Ith character of sequence [ghijklmnopqrstuv];
+
+ 7. Let N = N - 1;
+
+ 8. Continue from N to 1, encoding each remaining nibble, J, by
+ emitting the Jth character of sequence [0123456789abcdef].
+
+
+4.1.2 Extended Compression Algorithm
+
+ 1. Let PREV = 0;
+
+ 2. If there are no more characters in the input, terminate successfully;
+
+ 4. Let U be the case of the next character in the input;
+ Let C be the lowercase value of the next input character;
+
+ 5. If C != '-' , then go to step 7;
+
+ 6. Consume the input character, emit '-', and go to step 2;
+
+ 7. Let D be the result of PREV exclusive ORed with C;
+
+ 8. Find the least positive value N such that
+ D bitwise ANDed with M is zero
+ where M = the bitwise complement of (16**N) - 1;
+
+ 9. Let V = C ANDed with the bitwise complement of M;
+
+ 10. Emit the modified variable length hex encoding of V to length
+ N with case U;
+
+ 11. Let PREV = C and go to step 2.
+
+4.1.3 Extended Forward Transformation Algorithm
+
+The overall extended encoding algorithm is as follows:
+
+ 1. Break the hostname string into dot-separated hostname parts.
+ For each hostname part, perform steps 2 and 3 below;
+
+ 2. Compress the component using the method described in section
+ 4.1.2 above, and encode using the encoding described in section
+ 4.1.1;
+
+ 3. Prepend the post-converted name prefix 'dq--' (see section 2.1
+ above) to the resulting string.
+
+4.2 Extended DUDE Decoding
+
+4.2.1 Modified Extended Variable Length Hex Decoding
+
+ 1. Let U be the case of the next input character,
+ Let C0 be the lower case of the next input character;
+
+ 2. If C0 is not in set [ghijklmnopqrstuw] then return error status,
+ else, consume the input character;
+
+ 3. Let R = C0 - 'g'
+ Let N = 1;
+
+ 4. If no more input characters exist then go to step 8;
+
+ 5. Let CL be the lower case of the next input character,
+ If CL is not in set [0123456789abcdef] then go to step 8;
+
+ 6. Consume the next input character,
+ Let N = N + 1,
+ Let R = R * 16,
+ If CL is in set [0-9]
+ then let R = R + (CL - '0')
+ else let R = R + (CL - 'a') + 10;
+
+ 7. Go to step 4;
+
+ 8. If R < 0x100000 then go to step 10;
+
+ 9. Let N = N + 1,
+ If (N > 6) or (C0 != 'w')
+ then return error status;
+
+ 10. Let MASK be the bitwise complement of (16**N) - 1. Return
+ result R, MASK, and U.
+
+4.2.2 Extended Decompression Algorithm
+
+ 1. Let PREV = 0;
+
+ 2. If there are no more input characters then terminate successfully;
+
+ 3. Let C be the next input character;
+
+ 4. If C == '-', append '-' to the result
+ string, consume the character, and go to step 2;
+
+ 5. Let VPART, MASK, and U be the result of the modified extended
+ variable length decoded value;
+
+ 6. Let CU = (PREV 'bitwise AND' MASK) + VPART,
+ Let PREV = CU;
+
+ 7. If U == 'Uppercase' then let CU = the corresponding upper case value
+ of CU;
+
+ 8. Append CU to the result string and go to step 2.
+
+4.2.3 Extended Reverse Transformation Algorithm
+
+ 1. Break the string into dot-separated components and apply Steps
+ 2 through 4 to each component;
+
+ 2. Remove the post converted name prefix 'dq--' (see Section 2.1);
+
+ 3. Decompress the component using the extended decompression
+ algorithm described in section 4.2.2 above;
+
+ 4. Concatenate the decoded segments with dot separators and return.
+
+Note that DUDE decoding will return error for input strings which do
+not comply with RFC1035.
+
+5. Security Considerations
+
+Much of the security of the Internet relies on the DNS and any
+change to the characteristics of the DNS may change the security of
+much of the Internet. Therefore DUDE makes no changes to the DNS itself.
+
+DUDE is designed so that distinct Unicode sequences map to distinct
+domain name sequences (modulo the Unicode and DNS equivalence rules).
+Therefore use of DUDE with DNS will not negatively affect security.
+
+
+6. References
+
+[IDNCOMP] Paul Hoffman, "Comparison of Internationalized Domain Name
+Proposals", draft-ietf-idn-compare;
+
+[IDNRACE] Paul Hoffman, "RACE: Row-Based ASCII Compatible Encoding for
+IDN", draft-ietf-idn-race;
+
+[IDNREQ] James Seng, "Requirements of Internationalized Domain Names",
+draft-ietf-idn-requirement;
+
+[IDNNAMEPREP] Paul Hoffman and Marc Blanchet, "Preparation of
+Internationalized Host Names", draft-ietf-idn-nameprep;
+
+[IDNDUERST] M. Duerst, "Internationalization of Domain Names",
+draft-duerst-dns-i18n;
+
+[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
+technology -- Universal Multiple-Octet Coded Character Set (UCS) --
+Part 1: Architecture and Basic Multilingual Plane. Five amendments and
+a technical corrigendum have been published up to now. UTF-16 is
+described in Annex Q, published as Amendment 1. 17 other amendments are
+currently at various stages of standardization;
+
+[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
+Requirement Levels", March 1997, RFC 2119;
+
+[STD13] Paul Mockapetris, "Domain names - implementation and
+specification", November 1987, STD 13 (RFC 1035);
+
+[UNICODE3] The Unicode Consortium, "The Unicode Standard -- Version
+3.0", ISBN 0-201-61633-5. Described at
+<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.
+
+
+A. Acknowledgements
+
+The structure (and some of the structural text) of this document is
+intentionally borrowed from the LACE IDN draft (draft-ietf-idn-lace-00)
+by Mark Davis and Paul Hoffman.
+
+B. IANA Considerations
+
+There are no IANA considerations in this document.
+
+
+C. Author Contact Information
+
+Mark Welter
+Brian W. Spolarich
+WALID, Inc.
+State Technology Park
+2245 S. State St.
+Ann Arbor, MI 48104
++1-734-822-2020
+
+mwelter@walid.com
+briansp@walid.com
+-----BEGIN PGP SIGNATURE-----
+Version: GnuPG v1.0.1 (GNU/Linux)
+Comment: For info see http://www.gnupg.org
+
+iD8DBQE6FZ/D/DkPcNgtD/0RAoswAKCUGBTSFJv96+Z+YnA8m47qrnheAgCeLQ6C
+1+knyHluauC+66esCtPVoKU=
+=hbT+
+-----END PGP SIGNATURE-----
+