Re: IBM's XML 1.1 tests from Karl Waclawek on 2003-11-07 (public-xml-testsuite@w3.org from November 2003)

From: Karl Waclawek <karl@waclawek.net>
Date: Fri, 7 Nov 2003 16:20:20 -0500
To: <public-xml-testsuite@w3.org>
Cc: "Glenn Marcy" <gmarcy@us.ibm.com>, "Sandra Martinez" <sandra.martinez@nist.gov>, "Richard Tobin" <richard@cogsci.ed.ac.uk>
Message-ID: <006b01c3a574$f2cd5c40$9e539696@citkwaclaww2k>

> > > (1) ibm-valid-P02-ibm02v01.xml
> > > The UTF-8 code for LSEP (2028) in this file seems to be wrong.
> > > I believe it should be e2 80 a8, the file has e0 9f ac which is
> > > a non-shortest UTF-8 sequence for something else.

I am no expert on Unicode, but have sometimes the need to understand it.
According to table 3.1B of Unicode 3.2, the sequence e0 9f ac is not
a valid UTF-8 sequence. That much I understand. At this point I was assuming
that this table allows one to check valid/shortest sequences.

> > > [GM] Agree, a typo, the byte sequence corresponds to the character #x7EC
> > > and should be changed to e2 80 a8, but its still a valid document.
> >
> >It's not the shortest sequence for 7EC, so it's a UTF-8 error and
> >therefore not well-formed.

Now, this confuses me. The UTF-8 table allows this sequence, but it cannot
map to 7EC, but must map to somewhere in the range 1000 to CFFF.
So, is it now a sequence for 7EC, and if yes, where am I wrong?

Karl

Received on Friday, 7 November 2003 16:20:29 UTC