- From: Chris Lilley <chris@w3.org>
- Date: Mon, 14 Oct 2002 09:05:39 +0200
- To: www-tag@w3.org, www-tag-request@w3.org, Elliotte Rusty Harold <elharo@metalab.unc.edu>
- CC: xml-names-editor@w3.org, www-international@w3.org
On Friday, October 11, 2002, 9:40:26 PM, Elliotte wrote: ERH> A further thought on IRIs based on my experience today trying to add ERH> support for them to XOM: ERH> These things are complex. The process of taking a UTF-16 encoded Java ERH> (or C++, or C#) string, encoding it in UTF-8, and then hex escaping ERH> some of it, is non-trivial. It's absolutely doable, but it requires ERH> way more knowledge of Unicode and the intricacies of various ERH> encodings of the Unicode character set than most developers possess. I found that report surprising. I don't dispute that you found it tricky, since you say that you did (or worried that other less skilled practitioners would find it tricky) but the process seemed, on first examination, well established to me: 1) Look at the Unicode FAQ (most programmers will go looking for a FAQ) http://www.unicode.org/unicode/faq/utf_bom.html 2) Go from there to the definition of UTF-8 http://www.unicode.org/unicode/reports/tr27/#conformance http://www.unicode.org/unicode/reports/tr27/#notation Ah - okay, its presented as a diff to Unicode 3.0 section 2.3 Encoding Forms http://www.unicode.org/unicode/uni2book/ch02.pdf I can see that might be an issue. It would be preferable to have one document that defined UTF-8 and how to convert to and from it. 3) Read this FAQ section Q: Is it correct to interpret a surrogate pair encoded in UTF-8 as two separate 3-byte sequences? For example, interpreting the UTF-8 sequence <ED A0 80 ED B0 80> as UTF-16 <D800 DC00> (equivalently as UTF-32 <00010000>)? A: There is a widespread practice of generating those types of sequences in older software, especially software which pre-dates the introduction of UTF-16. However, such an encoding is not conformant to UTF-8 as defined. The Unicode Technical Committee has debated this issue at length and is preparing a technical report on the subject. The encoding is referred to as CESU-8. A proposed draft technical report is available here: UTR #26: Compatability Encoding Scheme for UTF-16. [MD] Seems to tell the programmer that this is not UTF-8 but something else, so they will lose interest in that and go find how to generate UTF-8 not CESU-8. The FAQ also has links to handy tutorial information http://www-106.ibm.com/developerworks/library/utfencodingforms/ and http://www-106.ibm.com/developerworks/unicode/library/u-encode.html ERH> Converting plane-1 characters encoded with surrogate pairs into UTF-8 ERH> is especially tricky. Most programmers will not know there's anything ERH> special here they have to watch out for. This is very much an experts ERH> only job. I partly disagree there. From that second reference, this summary seems terse yet complete and clearly warns the programmer about surrogates .... ah. Though good in general, it predates Unicode 3.2. UTF-8 encoding UTF-8 encoding is variable-length, and characters are encoded with one, two, three, or four bytes. The first 128 characters of Unicode (BMP), U+0000 through U+007F, are encoded with a single byte, and are equivalent to ASCII. U+0080 through U+07FF (BMP) are encoded with two bytes, and U+0800 through U+FFFF (still BMP) are encoded with three bytes. The 1,048,576 characters of the 16 Supplementary Planes are encoded with four bytes. UTF-16 encoding UTF-16 encoding is variable-length 16-bit representation. Each character is made up of one or two 16-bit units. In terms of bytes, each character is made up of two or four bytes. The single 16-bit portion of this encoding is used to encode the entire BMP, except for 2,048 code points known as "surrogates" that are used in pairs to encode the 1,048,576 characters of the 16 Supplementary Planes. U+D800 through U+DBFF are the 1,024 high surrogates, and U+DC00 through U+DFFF are the 1,024 low surrogates. A high plus low surrogate (that is, two 16-bit units) represent a single character in the 16 Supplementary Planes. OK, you have convinced me that there is a need for a single, authoritative, and up to date page dealing with the definition of UTF-8 and of the UTF-16 forms and their interconversion, specifically, those conversions required by the generation and reading of IRIs. ERH> Unfortunately, there is no support for this in the standard ERH> libraries, at least in Java. Worse yet many of the functions that ERH> allege to do part of this actually have various subtle bugs that ERH> cause them to generate incorrect output. For instance, in Java 1.3 ERH> and earlier the URLEncoder class uses the platform default character ERH> set instead of UTF-8. In Java 1.4, there's finally an option to ERH> specify UTF-8; but if you don't, you still get the platform default ERH> encoding. Even then, a programmer still has to break up an IRI into ERH> parts and encode only some of them. For instance ERH> URLEncoder.encode("http://www.yahoo.com:80/") will encode the colons ERH> and the slashes, even though they should not be encoded. I agree that Java has bugs there, from your description,and that programmers might assume that these functions are correct and just call them with default parameters. ERH> I suspect, over time, if IRIs are adopted, the libraries will catch ERH> up; and eventually the bugs will be worked out. However, we should be ERH> prepared for a lot of buggy, non-conforming code in the meantime. ERH> Worst case scenario: this will be like early HTML where ERH> implementation bugs become standard features out of necessity. Eek, I hope not. So, if the IRI spec had an appendix pointing to good sources of information for implementers and specifically warning about common programming errors, would that be an improvement? .... reads on ... Ok you suggest the same thing. Good. ERH> Some ERH> older methods in Java to this day generate incorrect UTF-8 in the ERH> name of backwards compatibility with errors made in Java 1.0 in 1995. Sheesh. ERH> One way to alleviate the problems: specs that specify IRIs (or ERH> reinvent them as older, pre-IRI specs like XLink do) should include ERH> detailed pseudo-code and perhaps actual code for making the ERH> conversion to URIs. They should not rely on handwaving about ERH> converting strings to UTF-8 and hex encoding certain bytes. The ERH> conversion to UTF-8 will be screwed up, repeatedly. We've seen this ERH> in many other APIs in the past, not the least of which is the Java ERH> class library itself. It is important to warn implementers of the ERH> location of the mines in the field they are about to cross. Thanks for your report. Good call. This is less well specified, at least, less conveniently specified, than I had thought before following up your report. www-international copied on this reply since they are likely to be able to contribute meaningfully to the discussion and the I18N WG can then suggest an optimal resolution. Were you intending to open a TAG issue "IRI deployment at risk from UTF-8 implementation practice' or does this response meet your immediate needs for aresolution mechanism? -- Chris mailto:chris@w3.org
Received on Monday, 14 October 2002 03:05:43 UTC