Re: IRIs everywhere (including XML namespaces) from Chris Lilley on 2002-10-14 (www-international@w3.org from October to December 2002)

From: Chris Lilley <chris@w3.org>
Date: Mon, 14 Oct 2002 09:05:39 +0200
To: www-tag@w3.org, www-tag-request@w3.org, Elliotte Rusty Harold <elharo@metalab.unc.edu>
CC: xml-names-editor@w3.org, www-international@w3.org
Message-ID: <8890605515.20021014090539@w3.org>
On Friday, October 11, 2002, 9:40:26 PM, Elliotte wrote:


ERH> A further thought on IRIs based on my experience today trying to add 
ERH> support for them to XOM:

ERH> These things are complex. The process of taking a UTF-16 encoded Java 
ERH> (or C++, or C#) string, encoding it in UTF-8, and then hex escaping 
ERH> some of it, is non-trivial. It's absolutely doable, but it requires 
ERH> way more knowledge of Unicode and the intricacies of various 
ERH> encodings of the Unicode character set than most developers possess.

I found that report surprising. I don't dispute that you found it
tricky, since you say that you did (or worried that other less skilled
practitioners would find it tricky) but the process seemed, on first
examination, well established to me:

1) Look at the Unicode FAQ (most programmers will go looking for a FAQ)
http://www.unicode.org/unicode/faq/utf_bom.html

2) Go from there to the definition of UTF-8
http://www.unicode.org/unicode/reports/tr27/#conformance
http://www.unicode.org/unicode/reports/tr27/#notation
Ah - okay, its presented as a diff to Unicode 3.0 section 2.3 Encoding
Forms
http://www.unicode.org/unicode/uni2book/ch02.pdf
I can see that might be an issue. It would be preferable to have one
document that defined UTF-8 and how to convert to and from it.

3) Read this FAQ section
Q: Is it correct to interpret a surrogate pair encoded in UTF-8 as two
separate 3-byte sequences? For example, interpreting the UTF-8
sequence <ED A0 80 ED B0 80> as UTF-16 <D800 DC00> (equivalently as
UTF-32 <00010000>)?

A: There is a widespread practice of generating those types of
sequences in older software, especially software which pre-dates the
introduction of UTF-16. However, such an encoding is not conformant to
UTF-8 as defined. The Unicode Technical Committee has debated this
issue at length and is preparing a technical report on the subject.
The encoding is referred to as CESU-8. A proposed draft technical
report is available here: UTR #26: Compatability Encoding Scheme for
UTF-16. [MD]

Seems to tell the programmer that this is not UTF-8 but something
else, so they will lose interest in that and go find how to generate
UTF-8 not CESU-8.

The FAQ also has links to handy tutorial information
http://www-106.ibm.com/developerworks/library/utfencodingforms/
and
http://www-106.ibm.com/developerworks/unicode/library/u-encode.html

ERH> Converting plane-1 characters encoded with surrogate pairs into UTF-8
ERH> is especially tricky. Most programmers will not know there's anything 
ERH> special here they have to watch out for. This is very much an experts 
ERH> only job.

I partly disagree there. From that second reference, this summary
seems terse yet complete and clearly warns the programmer about
surrogates .... ah. Though good in general, it predates Unicode 3.2.

UTF-8 encoding

UTF-8 encoding is variable-length, and characters are encoded with
one, two, three, or four bytes. The first 128 characters of Unicode
(BMP), U+0000 through U+007F, are encoded with a single byte, and are
equivalent to ASCII. U+0080 through U+07FF (BMP) are encoded with two
bytes, and U+0800 through U+FFFF (still BMP) are encoded with three
bytes. The 1,048,576 characters of the 16 Supplementary Planes are
encoded with four bytes.

UTF-16 encoding

UTF-16 encoding is variable-length 16-bit representation. Each
character is made up of one or two 16-bit units. In terms of bytes,
each character is made up of two or four bytes. The single 16-bit
portion of this encoding is used to encode the entire BMP, except for
2,048 code points known as "surrogates" that are used in pairs to
encode the 1,048,576 characters of the 16 Supplementary Planes.

U+D800 through U+DBFF are the 1,024 high surrogates, and U+DC00
through U+DFFF are the 1,024 low surrogates. A high plus low surrogate
(that is, two 16-bit units) represent a single character in the 16
Supplementary Planes.

OK, you have convinced me that there is a need for a single,
authoritative, and up to date page dealing with the definition of
UTF-8 and of the UTF-16 forms and their interconversion, specifically,
those conversions required by the generation and reading of IRIs.

ERH> Unfortunately, there is no support for this in the standard 
ERH> libraries, at least in Java. Worse yet many of the functions that 
ERH> allege to do part of this actually have various subtle bugs that 
ERH> cause them to generate incorrect output. For instance, in Java 1.3 
ERH> and earlier the URLEncoder class uses the platform default character 
ERH> set instead of UTF-8. In Java 1.4, there's finally an option to 
ERH> specify UTF-8; but if you don't, you still get the platform default 
ERH> encoding. Even then, a programmer still has to break up an IRI into 
ERH> parts and encode only some of them. For instance 
ERH> URLEncoder.encode("http://www.yahoo.com:80/") will encode the colons 
ERH> and the slashes, even though they should not be encoded.

I agree that Java has bugs there, from your description,and that
programmers might assume that these functions are correct and just
call them with default parameters.

ERH> I suspect, over time, if IRIs are adopted, the libraries will catch 
ERH> up; and eventually the bugs will be worked out. However, we should be 
ERH> prepared for a lot of buggy, non-conforming code in the meantime. 
ERH> Worst case scenario: this will be like early HTML where 
ERH> implementation bugs become standard features out of necessity.

Eek, I hope not.   So, if the IRI spec had an appendix pointing to
good sources of information for implementers and specifically warning
about common programming errors, would that be an improvement? ....
reads on ... Ok you suggest the same thing. Good.

ERH> Some
ERH> older methods in Java to this day generate incorrect UTF-8 in the 
ERH> name of backwards compatibility with errors made in Java 1.0 in 1995.

Sheesh.

ERH> One way to alleviate the problems: specs that specify IRIs (or 
ERH> reinvent them as older, pre-IRI specs like XLink do) should include 
ERH> detailed pseudo-code and perhaps actual code for making the 
ERH> conversion to URIs. They should not rely on handwaving about 
ERH> converting strings to UTF-8 and hex encoding certain bytes. The 
ERH> conversion to UTF-8 will be screwed up, repeatedly. We've seen this 
ERH> in many other APIs in the past, not the least of which is the Java 
ERH> class library itself. It is important to warn implementers of the 
ERH> location of the mines in the field they are about to cross.

Thanks for your report. Good call. This is less well specified, at
least, less conveniently specified, than I had thought before
following up your report.

www-international copied on this reply since they are likely to be
able to contribute meaningfully to the discussion and the I18N WG can
then suggest an optimal resolution.

Were you intending to open a TAG issue "IRI deployment at risk from
UTF-8 implementation practice' or does this response meet your
immediate needs for aresolution mechanism?

-- 
 Chris                            mailto:chris@w3.org
Received on Monday, 14 October 2002 03:05:43 UTC