- From: Karl Waclawek <karl@waclawek.net>
- Date: Fri, 7 Nov 2003 16:20:20 -0500
- To: <public-xml-testsuite@w3.org>
- Cc: "Glenn Marcy" <gmarcy@us.ibm.com>, "Sandra Martinez" <sandra.martinez@nist.gov>, "Richard Tobin" <richard@cogsci.ed.ac.uk>
> > > (1) ibm-valid-P02-ibm02v01.xml > > > The UTF-8 code for LSEP (2028) in this file seems to be wrong. > > > I believe it should be e2 80 a8, the file has e0 9f ac which is > > > a non-shortest UTF-8 sequence for something else. I am no expert on Unicode, but have sometimes the need to understand it. According to table 3.1B of Unicode 3.2, the sequence e0 9f ac is not a valid UTF-8 sequence. That much I understand. At this point I was assuming that this table allows one to check valid/shortest sequences. > > > [GM] Agree, a typo, the byte sequence corresponds to the character #x7EC > > > and should be changed to e2 80 a8, but its still a valid document. > > > >It's not the shortest sequence for 7EC, so it's a UTF-8 error and > >therefore not well-formed. Now, this confuses me. The UTF-8 table allows this sequence, but it cannot map to 7EC, but must map to somewhere in the range 1000 to CFFF. So, is it now a sequence for 7EC, and if yes, where am I wrong? Karl
Received on Friday, 7 November 2003 16:20:29 UTC