- From: Alex Milowski <alex@milowski.com>
- Date: Sat, 9 Nov 2013 19:05:12 -0800
- To: "Eric Prud'hommeaux" <eric@w3.org>
- Cc: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
This seems quite sufficient. I'd like re-check my implementation against whatever changes were made to the test suite but I'm not sure when I can make that happen at this point until sometime after mid-December. On Sat, Nov 2, 2013 at 5:02 PM, Eric Prud'hommeaux <eric@w3.org> wrote: > * Alex Milowski <alex@milowski.com> [2013-05-17 11:17-0700] >> The attached file (UTF-8 version) is essentially the same as the one I got >> from the test suite with just slightly different line endings. >> >> The problem here is that U+10000 and U+EFFFF are not representable in UCS-2 >> and it appears that most browsers are using UCS-2 at some level for the >> strings in Javascript. As such, these code point are turned into surrogate >> pairs (U+D800 + U+DC00 and U+DB7F + U+DFFF) which aren't actually part of >> UCS-2 as legal code points. The surrogate characters are also not allowed >> by the productions in prefixes and other such names. >> >> The test is unusable in an environment, like most current browsers, that >> produce surrogate pairs from the UTF-8 encoding. That is, unusable without >> some enhanced level of support for surrogates. >> >> There are a number of options to resolve this: >> >> 1. The test will just fail for such environments without special support >> for surrogates. >> 2. We mark this test as requiring surrogate / UTF-16 handling. Passing >> the test is quality of implementation question. >> 3. Such environments are expected to check for pairs of surrogate values >> in place of the code points [#x10000-#xEFFFF] even though the resulting >> identifier is unlikely to be generally usable. >> >> Option (1) means I just fail the test and move on. I'm not particularly >> partial to this. >> >> Option (2) allows greater diversity of processor environments but certainly >> runs counter to promoting proper Unicode support. It would be good to have >> a variant that tests only those in the Basic Multilingual Plane. >> >> Option (3) allows a processor to pass and punts the use of the code point >> back to the user. In the case of these particular code points and >> identifiers, the user of the result will have similar issues when >> inspecting or using values from the processor. This will be no different >> for data received from other places that has the same code points as >> Javascript will treat them all the same and produce surrogate pairs. >> >> There absolutely needs to be much more documentation attached to this test. >> The identifiers (e.g. prefix) are constructed by taking the first and last >> character of each range in the PN_CHARS_BASE production. I have to admit I >> really didn't recognize that until much later in my research into why this >> test didn't pass. In addition, the general issue of UCS-2 vs UTF-16 >> representations and surrogates needs to be minimally referenced in the >> description. >> >> It would be good for everyone else to avoid a lot the research I just did >> with a well-written test description. >> >> BTW, there's a good write up of Javascript's issues at [1] >> >> [1] http://mathiasbynens.be/notes/javascript-encoding > > Dear Alex, > > On 17 May, in > <http://www.w3.org/mid/CABp3FNKKtN8vac7=7gjzL2Y_KfH9wq0p56zAA5KAaO8Rv-SmEg@mail.gmail.com>, > you proposed 3 alternatives for addressing characters outside of the > range of #0000-#FFFD. Your third choice got support from others on rdf-comments: > [[ > 3. Such environments are expected to check for pairs of surrogate values > in place of the code points [#x10000-#xEFFFF] even though the resulting > identifier is unlikely to be generally usable. > ]] > > In response to this, we added the following text to the README in the > turtle test suite: > [[ > CHARACTER ENCODING: > > The Turtle language uses UTF-8 encoding. The following tests include > non-ascii characters: > localName_with_assigned_nfc_bmp_PN_CHARS_BASE_character_boundaries > localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries * > localName_with_nfc_PN_CHARS_BASE_character_boundaries * > labeled_blank_node_with_PN_CHARS_BASE_character_boundaries * > LITERAL1_with_UTF8_boundaries * > LITERAL_LONG1_with_UTF8_boundaries * > LITERAL2_with_UTF8_boundaries * > LITERAL_LONG2_with_UTF8_boundaries * > > Those marked with a * include characters with codepoints greater than > U+FFFD and are thus expressed as a pair of surrogate characters when > represented in UCS2. > ]] > > This did annotate the tests with non-BNP characters but did neither of: > > Define comparison in terms of surrogate pairs (instead of code points). > Indicate that those were not required for conformance. > > Noting that non of the test submitters failed those tests: > <http://www.w3.org/2011/rdf-wg/wiki/Turtle_Candidate_Recommendation_Comments#Implementations> > including your own implementation report: > <http://www.w3.org/mid/CABp3FNLOOHZHwVSUdUK09Kmp_yujfXZmKOeNBC4TC5V8iUp1Nw@mail.gmail.com> > I propose to close this comment as satisfactorily addressed by the > > additional comments in README. If you disagree, please indicate what > would satisfy your comment. If you agree, please respond with the > subject prefixed by "[RESOLVED]". > > >> On Fri, May 17, 2013 at 4:37 AM, Eric Prud'hommeaux <eric@w3.org> wrote: >> >> > * Alex Milowski <alex@milowski.com> [2013-05-16 23:44-0700] >> > > In looking at test: >> > > >> > > prefix_with_PN_CHARS_BASE_character_boundaries.ttl [1] >> > > >> > > There are the code points u+dc00, u+db7f, and u+dfff in the last part of >> > > the prefix. The code points u+d800-u+dfff are not valid unicode >> > characters. >> > > >> > > Why are these in a positive test? >> > > >> > > [1] >> > > >> > https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/tests-ttl/prefix_with_PN_CHARS_BASE_character_boundaries.ttl >> > >> > In the prefix, I see these codepoints: u+41 u+5a u+61 u+7a u+c0 u+d6 >> > u+d8 u+f6 u+f8 u+2ff u+370 u+37d u+37f u+1fff u+200c u+200d u+2070 >> > u+218f u+2c00 u+2fef u+3001 u+d7ff u+f900 u+fdcf u+fdf0 u+fffd u+10000 >> > u+effff . >> > >> > I've attached two variants the test, one encoded in UTF-8, which is >> > legal turtle and has only the codepoints above, and UTF-16, which uses >> > surrogate to encode the codepoints u+10000 and u+effff . Is it >> > possible that your buffer got re-encoded as UTF-16 at some point? >> > >> > If this message resolves your comment, please reply with "[RESOLVED]" >> > in the subject. >> > >> > > -- >> > > --Alex Milowski >> > > "The excellence of grammar as a guide is proportional to the paucity of >> > the >> > > inflexions, i.e. to the degree of analysis effected by the language >> > > considered." >> > > >> > > Bertrand Russell in a footnote of Principles of Mathematics >> > >> > -- >> > -ericP >> > >> >> >> >> -- >> --Alex Milowski >> "The excellence of grammar as a guide is proportional to the paucity of the >> inflexions, i.e. to the degree of analysis effected by the language >> considered." >> >> Bertrand Russell in a footnote of Principles of Mathematics > > -- > -ericP > > office: +1.617.599.3509 > mobile: +33.6.80.80.35.59 > > (eric@w3.org) > Feel free to forward this message to any list for any purpose other than > email address distribution. > > There are subtle nuances encoded in font variation and clever layout > which can only be seen by printing this message on high-clay paper. -- --Alex Milowski "The excellence of grammar as a guide is proportional to the paucity of the inflexions, i.e. to the degree of analysis effected by the language considered." Bertrand Russell in a footnote of Principles of Mathematics
Received on Sunday, 10 November 2013 03:05:40 UTC