summaryrefslogtreecommitdiff
path: root/doc/draft/draft-ietf-idn-lace-00.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/draft/draft-ietf-idn-lace-00.txt')
-rw-r--r--doc/draft/draft-ietf-idn-lace-00.txt522
1 files changed, 522 insertions, 0 deletions
diff --git a/doc/draft/draft-ietf-idn-lace-00.txt b/doc/draft/draft-ietf-idn-lace-00.txt
new file mode 100644
index 00000000..464b8755
--- /dev/null
+++ b/doc/draft/draft-ietf-idn-lace-00.txt
@@ -0,0 +1,522 @@
+Internet Draft Mark Davis
+draft-ietf-idn-lace-00.txt IBM
+November 6, 2000 Paul Hoffman
+Expires May 6, 2001 IMC & VPNC
+
+ LACE: Length-based ASCII Compatible Encoding for IDN
+
+Status of this memo
+
+This document is an Internet-Draft and is in full conformance with all
+provisions of Section 10 of RFC2026.
+
+Internet-Drafts are working documents of the Internet Engineering Task
+Force (IETF), its areas, and its working groups. Note that other
+groups may also distribute working documents as Internet-Drafts.
+
+Internet-Drafts are draft documents valid for a maximum of six months
+and may be updated, replaced, or obsoleted by other documents at any
+time. It is inappropriate to use Internet-Drafts as reference
+material or to cite them other than as "work in progress."
+
+ The list of current Internet-Drafts can be accessed at
+ http://www.ietf.org/ietf/1id-abstracts.txt
+
+ The list of Internet-Draft Shadow Directories can be accessed at
+ http://www.ietf.org/shadow.html.
+
+
+
+Abstract
+
+This document describes a transformation method for representing
+non-ASCII characters in host name parts in a fashion that is completely
+compatible with the current DNS. It is a potential candidate for an
+ASCII-Compatible Encoding (ACE) for internationalized host names, as
+described in the comparison document from the IETF IDN Working Group.
+This method is based on the observation that many internationalized host
+name parts will have a few substrings from a small number of rows of the
+ISO 10646 repertoire. Run-length encoding for these types of
+host names will be fairly compact, and is fairly easy to describe.
+
+
+1. Introduction
+
+There is a strong world-wide desire to use characters other than plain
+ASCII in host names. Host names have become the equivalent of business
+or product names for many services on the Internet, so there is a need
+to make them usable by people whose native scripts are not representable
+by ASCII. The requirements for internationalizing host names are
+described in the IDN WG's requirements document, [IDNReq].
+
+The IDN WG's comparison document [IDNComp] describes three potential
+main architectures for IDN: arch-1 (just send binary), arch-2 (send
+binary or ACE), and arch-3 (just send ACE). LACE is an ACE, called
+Row-based ACE or LACE, that can be used with protocols that match arch-2
+or arch-3. LACE specifies an ACE format as specified in ace-1 in
+[IDNComp]. Further, it specifies an identifying mechanism for ace-2 in
+[IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to the
+beginning of the name part).
+
+In formal terms, LACE describes a character encoding scheme of the
+ISO/IEC 10646 [ISO10646] coded character set (whose assignment of
+characters is synchronized with Unicode [Unicode3]) and the rules for
+using that scheme in the DNS. As such, it could also be called a
+"charset" as defined in [IDNReq].
+
+The LACE protocol has the following features:
+
+- There is exactly one way to convert internationalized host parts to
+and from LACE parts. Host name part uniqueness is preserved.
+
+- Host parts that have no international characters are not changed.
+
+- Names using LACE can include more internationalized characters than
+with other ACE protocols that have been suggested to date. LACE-encoded
+names are variable length, depending on the number of transitions
+between rows in the ISO 10646 repertoire that appear in the name part.
+Name parts that cannot be compressed using run-length encoding can have
+up to 17 characters, and names that can be compressed can have up to 35
+characters. Further, a name that has just a few row transitions
+typically can have over 30 characters.
+
+It is important to note that the following sections contain many
+normative statements with "MUST" and "MUST NOT". Any implementation that
+does not follow these statements exactly is likely to cause damage to
+the Internet by creating non-unique representations of host names.
+
+1.1 Terminology
+
+The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
+"MAY" in this document are to be interpreted as described in RFC 2119
+[RFC2119].
+
+Hexadecimal values are shown preceded with an "0x". For example,
+"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are
+shown preceded with an "0b". For example, a nine-bit value might be
+shown as "0b101101111".
+
+Examples in this document use the notation from the Unicode Standard
+[Unicode3] as well as the ISO 10646 names. For example, the letter "a"
+may be represented as either "U+0061" or "LATIN SMALL LETTER A".
+
+LACE converts strings with internationalized characters into
+strings of US-ASCII that are acceptable as host name parts in current
+DNS host naming usage. The former are called "pre-converted" and the
+latter are called "post-converted".
+
+1.2 IDN summary
+
+Using the terminology in [IDNComp], LACE specifies an ACE format as
+specified in ace-1. Further, it specifies an identifying mechanism for
+ace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the beginning
+of the name part).
+
+LACE has the following length characteristics. In this list, "row" means
+a row from ISO 10646.
+
+- LACE-encoded names are variable length, depending on the number of
+transitions between rows that appear in the name part.
+
+- Name parts that cannot be compressed using run-length encoding can
+have up to 17 characters.
+
+- Names that can be compressed can have up to 35 characters.
+
+-A name that has just a few row transitions typically can have over 30
+characters.
+
+
+2. Host Part Transformation
+
+According to [STD13], host parts must be case-insensitive, start and
+end with a letter or digit, and contain only letters, digits, and the
+hyphen character ("-"). This, of course, excludes any internationalized
+characters, as well as many other characters in the ASCII character
+repertoire. Further, domain name parts must be 63 octets or shorter in
+length.
+
+2.1 Name tagging
+
+All post-converted name parts that contain internationalized characters
+begin with the string "bq--". (Of course, because host name parts are
+case-insensitive, this might also be represented as "Bq--" or "bQ--" or
+"BQ--".) The string "bq--" was chosen because it is extremely unlikely
+to exist in host parts before this specification was produced. As a
+historical note, in late August 2000, none of the second-level host name
+parts in any of the .com, .edu, .net, and .org top-level domains began
+with "bq--"; there are many tens of thousands of other strings of three
+characters followed by a hyphen that have this property and could be
+used instead. The string "bq--" will change to other strings with the
+same properties in future versions of this draft.
+
+Note that a zone administrator might still choose to use "bq--" at the
+beginning of a host name part even if that part does not contain
+internationalized characters. Zone administrators SHOULD NOT create host
+part names that begin with "bq--" unless those names are post-converted
+names. Creating host part names that begin with "bq--" but that are not
+post-converted names may cause two distinct problems. Some display
+systems, after converting the post-converted name part back to an
+internationalized name part, might display the name parts in a
+possibly-confusing fashion to users. More seriously, some resolvers,
+after converting the post-converted name part back to an
+internationalized name part, might reject the host name if it contains
+illegal characters.
+
+2.2 Converting an internationalized name to an ACE name part
+
+To convert a string of internationalized characters into an ACE name
+part, the following steps MUST be preformed in the exact order of the
+subsections given here.
+
+If a name part consists exclusively of characters that conform to the
+host name requirements in [STD13], the name MUST NOT be converted to
+LACE. That is, a name part that can be represented without LACE MUST NOT
+be encoded using LACE. This absolute requirement prevents there from
+being two different encodings for a single DNS host name.
+
+If any checking for prohibited name parts (such as ones that are
+prohibited characters, case-folding, or canonicalization) is to be done,
+it MUST be done before doing the conversion to an ACE name part.
+
+The input name string consists of characters from the ISO 10646
+character set in big-endian UTF-16 encoding. This is the pre-converted
+string.
+
+Characters outside the first plane of characters
+(those with codepoints above U+FFFF) MUST be represented using surrogates, as
+described in the UTF-16 description in ISO 10646.
+
+2.2.1 Compress the pre-converted string
+
+The entire pre-converted string MUST be compressed using the compression
+algorithm specified in section 2.4. The result of this step is the
+compressed string.
+
+2.2.2 Check the length of the compressed string
+
+The compressed string MUST be 36 octets or shorter. If the compressed
+string is 37 octets or longer, the conversion MUST stop with an error.
+
+2.2.3 Encode the compressed string with Base32
+
+The compressed string MUST be converted using the Base32 encoding
+described in section 2.5. The result of this step is the encoded string.
+
+2.2.4 Prepend "bq--" to the encoded string and finish
+
+Prepend the characters "bq--" to the encoded string. This is the host
+name part that can be used in DNS resolution.
+
+2.3 Converting a host name part to an internationalized name
+
+The input string for conversion is a valid host name part. Note that if
+any checking for prohibited name parts (such as prohibited characters,
+case-folding, or canonicalization is to be done, it MUST be done after
+doing the conversion from an ACE name part.
+
+If a decoded name part consists exclusively of characters that conform
+to the host name requirements in [STD13], the conversion from LACE MUST
+fail. Because a name part that can be represented without LACE MUST NOT
+be encoded using LACE, the decoding process MUST check for name parts
+that consists exclusively of characters that conform to the host name
+requirements in [STD13] and, if such a name part is found, MUST
+beconsidered an error (and possibly a security violation).
+
+2.3.1 Strip the "bq--"
+
+The input string MUST begin with the characters "bq--". If it does not,
+the conversion MUST stop with an error. Otherwise, remove the characters
+"bq--" from the input string. The result of this step is the stripped
+string.
+
+2.3.2 Decode the stripped string with Base32
+
+The entire stripped string MUST be checked to see if it is valid Base32
+output. The entire stripped string MUST be changed to all lower-case
+letters and digits. If any resulting characters are not in Table 1, the
+conversion MUST stop with an error; the input string is the
+post-converted string. Otherwise, the entire resulting string MUST be
+converted to a binary format using the Base32 decoding described in
+section 2.5. The result of this step is the decoded string.
+
+2.3.3 Decompress the decoded string
+
+The entire decoded string MUST be converted to ISO 10646 characters
+using the decompression algorithm described in section 2.4. The result
+of this is the internationalized string.
+
+2.4 Compression algorithm
+
+The basic method for compression is to reduce a substring that consists
+of characters all from a single row of the ISO 10646 repertoire to a
+count octet followed by the row header followed by the lower octets of
+the characters. If this ends up being longer than the input, the string
+is not compressed, but instead has a unique one-octet header attached.
+
+Although the uncompressed mode limits the number of characters in a LACE
+name part to 17, this is still generally enough for almost all names in
+almost scripts. Also, this limit is close to the limits set by other
+encoding proposals.
+
+Note that the compression and decompression rules MUST be followed
+exactly. This requirement prevents a single host name part from having
+two encodings. Thus, for any input to the algorithm, there is only one
+possible output. An implementation cannot chose to use one-octet mode or
+two-octet mode using anything other than the logic given in this
+section.
+
+2.4.1 Compressing a string
+
+The input string is in big-endian UTF-16 encoding with no byte order
+mark.
+
+Design note: No checking is done on the input to this algorithm. It is
+assumed that all checking for valid ISO/IEC 10646 characters has already
+been done by a previous step in the conversion process.
+
+1) If the length of the input is not even, or is less than 2, stop with
+an error.
+
+2) Set the input pointer, called IP, to the first octet of the input
+string.
+
+3) Set the variable called HIGH to the octet at IP.
+
+4) Determine the number of pairs at or after IP that have HIGH as the
+first octet; call this COUNT.
+
+5) Put into an output buffer the single octet for COUNT followed by the
+single octet for HIGH, followed by all those low octets. Move IP to the
+end of those pairs; that is, set IP to IP+(2*(COUNT+1)).
+
+6) If IP is not at the end of the input string, go to step 3.
+
+7) If the length of the output buffer is less than or equal to the
+length of the input buffer (in octets, not in characters), output the
+buffer. Otherwise, output the octet 0xFF followed by the input buffer.
+Note that there can only be one possible representation for a name part,
+so that outputting the wrong name part is a serious security error.
+Decompression schemes MUST accept only the valid form and MUST NOT
+accept invalid forms.
+
+
+2.4.2 Decompressing a string
+
+1. Set the input pointer, called IP, to the first octet of the input
+string. If there is no first octet, stop with an error.
+
+2. If the octet at IP is 0xFF, go to step 10.
+
+3. Get the octet at IP, call it COUNT. Set IP to IP+1. If IP is now at
+the end of the input string, stop with an error.
+
+4. Get the octet at IP, call it HIGH. Set IP to IP+1. If IP is now at
+the end of the input string, stop with an error.
+
+5. Get the octet at IP, call it LOW. Set IP to IP+1.
+
+6. Output HIGH, then LOW, to the output buffer.
+
+7. Decrement COUNT. If COUNT is greater than 0, go to step 5.
+
+8. If IP is not at the end of the input buffer, go to step 3.
+
+9. Compare the length of the input string with the length of the output
+buffer. If the length of the output buffer is longer than the length of
+the input buffer, stop with an error because the wrong compression form
+was used. Otherwise, send out the output buffer and stop.
+
+10. Set IP to IP+1. Copy the rest of the input buffer to the output
+buffer. Compress the output buffer into a separate comparison buffer
+following the steps for compression above. If the length of the
+comparison buffer is less than or equal to the length of the output
+buffer, stop with an error because the wrong compression form was used.
+Otherwise, send out the output buffer and stop.
+
+2.4.3 Compression examples
+
+The five input characters <U+30E6 U+30CB U+30B3 U+30FC U+30C9> are
+represented in big-endian UTF-16 as the ten octets <30 E6 30 CB 30 B3 30
+FC 30 C9>. All the code units are in the same row (03). The output
+buffer has seven octets <05 30 E6 CB B3 FC C9>, which is shorter than
+the input string. Thus the output is <05 30 E6 CB B3 FC C9>.
+
+The four input characters <U+012E U+0110 U+014A U+00C5> are represented
+in big-endian UTF-16 as the eight octets <01 2E 01 10 01 4A 00 C5>. The
+output buffer has eight octets <03 01 2E 10 4A 01 00 C5>, which is the
+same length as the input string. Thus, the output is <03 01 2E 10 4A 01
+00 C5>.
+
+The three input characters <U+012E U+00D0 U+014A> are represented in
+big-endian UTF-16 as the six octets <01 2E 00 D0 01 4A>. The output
+buffer is nine octets <01 01 2E 01 00 D0 01 01 4A>, which is longer than
+the input buffer. Thus, the output is <FF 01 2E 00 D0 01 4A>.
+
+2.5 Base32
+
+In order to encode non-ASCII characters in DNS-compatible host name parts,
+they must be converted into legal characters. This is done with Base32
+encoding, described here.
+
+Table 1 shows the mapping between input bits and output characters in
+Base32. Design note: the digits used in Base32 are "2" through "7"
+instead of "0" through "6" in order to avoid digits "0" and "1". This
+helps reduce errors for users who are entering a Base32 stream and may
+misinterpret a "0" for an "O" or a "1" for an "l".
+
+ Table 1: Base32 conversion
+ bits char hex bits char hex
+ 00000 a 0x61 10000 q 0x71
+ 00001 b 0x62 10001 r 0x72
+ 00010 c 0x63 10010 s 0x73
+ 00011 d 0x64 10011 t 0x74
+ 00100 e 0x65 10100 u 0x75
+ 00101 f 0x66 10101 v 0x76
+ 00110 g 0x67 10110 w 0x77
+ 00111 h 0x68 10111 x 0x78
+ 01000 i 0x69 11000 y 0x79
+ 01001 j 0x6a 11001 z 0x7a
+ 01010 k 0x6b 11010 2 0x32
+ 01011 l 0x6c 11011 3 0x33
+ 01100 m 0x6d 11100 4 0x34
+ 01101 n 0x6e 11101 5 0x35
+ 01110 o 0x6f 11110 6 0x36
+ 01111 p 0x70 11111 7 0x37
+
+2.5.1 Encoding octets as Base32
+
+The input is a stream of octets. However, the octets are then treated
+as a stream of bits.
+
+Design note: The assumption that the input is a stream of octets
+(instead of a stream of bits) was made so that no padding was needed.
+If you are reusing this algorithm for a stream of bits, you must add a
+padding mechanism in order to differentiate different lengths of input.
+
+1) Set the read pointer to the beginning of the input bit stream.
+
+2) Look at the five bits after the read pointer. If there are not five
+bits, go to step 5.
+
+3) Look up the value of the set of five bits in the bits column of
+Table 1, and output the character from the char column (whose hex value
+is in the hex column).
+
+4) Move the read pointer five bits forward. If the read pointer is at
+the end of the input bit stream (that is, there are no more bits in the
+input), stop. Otherwise, go to step 2.
+
+5) Pad the bits seen until there are five bits.
+
+6) Look up the value of the set of five bits in the bits column of
+Table 1, and output the character from the char column (whose hex value
+is in the hex column).
+
+2.5.2 Decoding Base32 as octets
+
+The input is octets in network byte order. The input octets MUST be
+values from the second column in Table 1.
+
+1) Set the read pointer to the beginning of the input octet stream.
+
+2) Look up the character value of the octet in the char column (or hex
+value in hex column) of Table 1, and output the five bits from the bits
+column.
+
+3) Move the read pointer one octet forward. If the read pointer is at
+the end of the input octet stream (that is, there are no more octets in
+the input), stop. Otherwise, go to step 2.
+
+2.5.3 Base32 example
+
+Assume you want to encode the value 0x3a270f93. The bit string is:
+
+3 a 2 7 0 f 9 3
+00111010 00100111 00001111 10010011
+
+Broken into chunks of five bits, this is:
+
+00111 01000 10011 10000 11111 00100 11
+
+Padding is added to make the last chunk five bits:
+
+00111 01000 10011 10000 11111 00100 11000
+
+The output of encoding is:
+
+00111 01000 10011 10000 11111 00100 11000
+ h i t q 7 e y
+or "hitq7ey".
+
+
+3. Security Considerations
+
+Much of the security of the Internet relies on the DNS. Thus, any
+change to the characteristics of the DNS can change the security of
+much of the Internet. Thus, LACE makes no changes to the DNS
+itself.
+
+Host names are used by users to connect to Internet servers. The
+security of the Internet would be compromised if a user entering a
+single internationalized name could be connected to different servers
+based on different interpretations of the internationalized host
+name.
+
+LACE is designed so that every internationalized host name part
+can be represented as one and only one DNS-compatible string. If there
+is any way to follow the steps in this document and get two or more
+different results, it is a severe and fatal error in the protocol.
+
+
+4. References
+
+[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name Proposals",
+draft-ietf-idn-compare.
+
+[IDNReq] James Seng, "Requirements of Internationalized Domain Names",
+draft-ietf-idn-requirement.
+
+[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
+technology -- Universal Multiple-Octet Coded Character Set (UCS) --
+Part 1: Architecture and Basic Multilingual Plane. Five amendments and
+a technical corrigendum have been published up to now. UTF-16 is
+described in Annex Q, published as Amendment 1. 17 other amendments are
+currently at various stages of standardization. [[[ THIS REFERENCE
+NEEDS TO BE UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]]
+
+[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
+Requirement Levels", March 1997, RFC 2119.
+
+[STD13] Paul Mockapetris, "Domain names - implementation and
+specification", November 1987, STD 13 (RFC 1035).
+
+[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
+3.0", ISBN 0-201-61633-5. Described at
+<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.
+
+
+A. Acknowledgements
+
+Base32 is quite obviously inspired by the tried-and-true Base64
+Content-Transfer-Encoding from MIME.
+
+
+B. IANA Considerations
+
+There are no IANA considerations in this document.
+
+
+C. Author Contact Information
+
+Mark Davis
+IBM
+10275 N. De Anza Blvd
+Cupertino, CA 95014
+mark.davis@us.ibm.com and mark.davis@macchiato.com
+
+Paul Hoffman
+Internet Mail Consortium and VPN Consortium
+127 Segre Place
+Santa Cruz, CA 95060 USA
+paul.hoffman@imc.org and paul.hoffman@vpnc.org