- From: Alex Milowski <alex@milowski.com>
- Date: Fri, 17 May 2013 14:17:53 -0700
- To: Peter Occil <poccil14@gmail.com>
- Cc: Richard Cyganiak <richard@cyganiak.de>, "Eric Prud'hommeaux" <eric@w3.org>, "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
- Message-ID: <CABp3FNLy6vZUwTNDmEYuT52oHRmwgosGx1uwfaFiZ991tdcCbg@mail.gmail.com>
It would be great to mark these tests with something to indicate these concerns. Option (3) is really a solution you can implement. On Fri, May 17, 2013 at 2:13 PM, Peter Occil <poccil14@gmail.com> wrote: > I believe the point of what Alex is saying is that an implementation in > JavaScript that represents Turtle documents as strings ought to take care > to interpret > Unicode characters in strings correctly, since Unicode supports many more > characters than the 16-bit code units in JavaScript strings can represent. > For example, the charCodeAt method in JavaScript returns only the value > of the 16-bit code unit in the given index of the string, not necessarily > the Unicode character found there; and therefore can return a surrogate > code point. > > This applies not only to JavaScript, but also Java, .NET, and any other > environment > whose strings consist of 16-bit Unicode code units. > > Accordingly, I believe this issue can be resolved by option 2. > > --Peter > > *From:* Richard Cyganiak <richard@cyganiak.de> > *Sent:* Friday, May 17, 2013 3:45 PM > *To:* Alex Milowski <alex@milowski.com> > *Cc:* Eric Prud'hommeaux <eric@w3.org> ; public-rdf-comments@w3.org > *Subject:* Re: Surrogate Code Points in Tests? > > I understand only half of what's being discussed in this thread. Maybe > this is relevant: RDF Concepts says that IRIs and literals SHOULD be in > Unicode normal form C. > > Apologies if that's not relevant here. > > Best, > Richard > > > On 17 May 2013, at 19:17, Alex Milowski <alex@milowski.com> wrote: > > The attached file (UTF-8 version) is essentially the same as the one I > got from the test suite with just slightly different line endings. > > The problem here is that U+10000 and U+EFFFF are not representable in > UCS-2 and it appears that most browsers are using UCS-2 at some level for > the strings in Javascript. As such, these code point are turned into > surrogate pairs (U+D800 + U+DC00 and U+DB7F + U+DFFF) which aren't actually > part of UCS-2 as legal code points. The surrogate characters are also not > allowed by the productions in prefixes and other such names. > > The test is unusable in an environment, like most current browsers, that > produce surrogate pairs from the UTF-8 encoding. That is, unusable without > some enhanced level of support for surrogates. > > There are a number of options to resolve this: > > 1. The test will just fail for such environments without special > support for surrogates. > 2. We mark this test as requiring surrogate / UTF-16 handling. Passing > the test is quality of implementation question. > 3. Such environments are expected to check for pairs of surrogate > values in place of the code points [#x10000-#xEFFFF] even though the > resulting identifier is unlikely to be generally usable. > > Option (1) means I just fail the test and move on. I'm not particularly > partial to this. > > Option (2) allows greater diversity of processor environments but > certainly runs counter to promoting proper Unicode support. It would be > good to have a variant that tests only those in the Basic Multilingual > Plane. > > Option (3) allows a processor to pass and punts the use of the code point > back to the user. In the case of these particular code points and > identifiers, the user of the result will have similar issues when > inspecting or using values from the processor. This will be no different > for data received from other places that has the same code points as > Javascript will treat them all the same and produce surrogate pairs. > > There absolutely needs to be much more documentation attached to this > test. The identifiers (e.g. prefix) are constructed by taking the first > and last character of each range in the PN_CHARS_BASE production. I have > to admit I really didn't recognize that until much later in my research > into why this test didn't pass. In addition, the general issue of UCS-2 > vs UTF-16 representations and surrogates needs to be minimally referenced > in the description. > > It would be good for everyone else to avoid a lot the research I just did > with a well-written test description. > > BTW, there's a good write up of Javascript's issues at [1] > > [1] http://mathiasbynens.be/notes/javascript-encoding > > > On Fri, May 17, 2013 at 4:37 AM, Eric Prud'hommeaux <eric@w3.org> wrote: > >> * Alex Milowski <alex@milowski.com> [2013-05-16 23:44-0700] >> > In looking at test: >> > >> > prefix_with_PN_CHARS_BASE_character_boundaries.ttl [1] >> > >> > There are the code points u+dc00, u+db7f, and u+dfff in the last part of >> > the prefix. The code points u+d800-u+dfff are not valid unicode >> characters. >> > >> > Why are these in a positive test? >> > >> > [1] >> > >> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/tests-ttl/prefix_with_PN_CHARS_BASE_character_boundaries.ttl >> >> In the prefix, I see these codepoints: u+41 u+5a u+61 u+7a u+c0 u+d6 >> u+d8 u+f6 u+f8 u+2ff u+370 u+37d u+37f u+1fff u+200c u+200d u+2070 >> u+218f u+2c00 u+2fef u+3001 u+d7ff u+f900 u+fdcf u+fdf0 u+fffd u+10000 >> u+effff . >> >> I've attached two variants the test, one encoded in UTF-8, which is >> legal turtle and has only the codepoints above, and UTF-16, which uses >> surrogate to encode the codepoints u+10000 and u+effff . Is it >> possible that your buffer got re-encoded as UTF-16 at some point? >> >> If this message resolves your comment, please reply with "[RESOLVED]" >> in the subject. >> >> > -- >> > --Alex Milowski >> > "The excellence of grammar as a guide is proportional to the paucity of >> the >> > inflexions, i.e. to the degree of analysis effected by the language >> > considered." >> > >> > Bertrand Russell in a footnote of Principles of Mathematics >> >> -- >> -ericP >> > > > > -- > --Alex Milowski > "The excellence of grammar as a guide is proportional to the paucity of the > inflexions, i.e. to the degree of analysis effected by the language > considered." > > Bertrand Russell in a footnote of Principles of Mathematics > > -- --Alex Milowski "The excellence of grammar as a guide is proportional to the paucity of the inflexions, i.e. to the degree of analysis effected by the language considered." Bertrand Russell in a footnote of Principles of Mathematics
Received on Friday, 17 May 2013 21:18:24 UTC