- From: Gregg Kellogg <gregg@greggkellogg.net>
- Date: Thu, 21 Mar 2013 07:38:08 -0700
- To: Andy Seaborne <andy.seaborne@epimorphics.com>
- Cc: "public-rdf-wg@w3.org" <public-rdf-wg@w3.org>
On Mar 21, 2013, at 3:47 AM, Andy Seaborne <andy.seaborne@epimorphics.com> wrote: > > > On 21/03/13 03:51, Eric Prud'hommeaux wrote: >> RDF and I18N folks, we have an interesting situation where we permit >> U+F900-U+FA0D to appear in local names, but advise against anything >> which is not NFC. So, what do we test? > > The grammar is wider than the acceptable URIs in several places - it's inevitable. We're expecting URI checking to be done after parsing in a very strict implementation. > > So test good practice and recognize that not everywhere is completely up-to-date on everything. +1 Gregg > Andy > >> Everything a Turtle parser >> could encouter? Currently assigned characters that are in NFC? >> Identifiers consisting of a single letter 'a', under the assumption >> that all others will work by extension? >> >> >> * Gavin Carothers <gavin@carothers.name> [2013-03-20 13:17-0700] >>> http://www.unicode.org/charts/PDF/U1F00.pdf U+1FFF is not a character. >>> http://www.unicode.org/charts/PDF/U2150.pdf U+218F is not a character. >>> No chart for code point U+2FEF could be located. Most likely this is >>> because no character is assigned to this code point yet. >>> http://www.unicode.org/charts/PDF/UD7B0.pdf U+D7FF is not a character. >>> http://www.unicode.org/charts/PDF/UFB50.pdf U+FDCF is not a character. >>> No chart for code point U+EFFFF could be located. Most likely this is >>> because no character is assigned to this code point yet >>> >>> >>> New string based on the above missing characters tested in Python 3.3 >>> (earlier versions of python not supported, only one with Unicode 6.1.0) >> >> I banged briefly on finding an ubuntu package for Python 3.3 >> (currently at 3.2). Ended up with something called perl. sigh. >> >> use Unicode::Normalize; >> $s = "AZaz\x{00c0}\x{00d6}\x{00d8}\x{00f6}\x{00f8}\x{02ff}\x{0370}\x{037d}\x{0384}\x{1ffe}\x{200c}\x{200d}\x{2070}\x{217f}\x{2c00}\x{2fcf}\x{3001}\x{d7fb}\x{f900}\x{fdc7}\x{fdf0}\x{fffd}\x{00010000}\x{0001f52b}"; >> p $s cmp NFC($s); >> => 1 -- strings are different. so now to look for the first candidate: >> >> for (0xf900..0xfdcf) { >> if (ord(Unicode::Normalize::NFC(chr($_))) == $_) { >> printf("%x\n", $_); >> last; >> } >> } >> => fa0e >> >> # checked with >> $s = "AZaz\x{00c0}\x{00d6}\x{00d8}\x{00f6}\x{00f8}\x{02ff}\x{0370}\x{037d}\x{0384}\x{1ffe}\x{200c}\x{200d}\x{2070}\x{217f}\x{2c00}\x{2fcf}\x{3001}\x{d7fb}\x{fa0e}\x{fdc7}\x{fdf0}\x{fffd}\x{00010000}\x{0001f52b}"; >> p $s cmp NFC($s); >> => 0 -- equivalent >> >> The currently unassigned characters don't impact NFC: >> $s = "AZaz\x{00c0}\x{00d6}\x{00d8}\x{00f6}\x{00f8}\x{02ff}\x{0370}\x{037d}\x{037f}\x{1fff}\x{200c}\x{200d}\x{2070}\x{218f}\x{2c00}\x{2fef}\x{3001}\x{d7ff}\x{fa0e}\x{fdcf}\x{fdf0}\x{fffd}\x{10000}\x{effff}" >> p $s cmp NFC($s); >> => 0 -- equivalent >> >> >>> import unicodedata >>> s = >>> "AZaz\u00c0\u00d6\u00d8\u00f6\u00f8\u02ff\u0370\u037d\u0384\u1ffe\u200c\u200d\u2070\u217f\u2c00\u2fcf\u3001\ud7fb\uf900\ufdc7\ufdf0\ufffd\U00010000\U0001f52b" >>> >>> def display_string(s): >>> for c in s: >>> print("""Character: {c!s} >>> Codepoint: {code:x} >>> Name: {name} >>> Combining: {combining} >>> """.format( >>> c=c, >>> code=ord(c), >>> name=unicodedata.name(c), >>> combining=unicodedata.combining(c), >>> )) >>> >>> n = unicodedata.normalize("NFC", s) >>> >>> display_string(s) >>> print("\n ------------------ \n ") >>> display_string(n) >>> >>> assert n == s >>> >>> Yeah, they aren't the same. The offending character is f900: >>> >>> CJK COMPATIBILITY IDEOGRAPH-F900 which in normal form is CJK UNIFIED >>> IDEOGRAPH-8C48 >>> >>> Finding something in the F900ish range is left to Eric. Script above can be >>> modified until it passes. >>> >>> Cheers, >>> Gavin >>> >>> >>> >>> >>> >>> On Wed, Mar 20, 2013 at 12:13 PM, Eric Prud'hommeaux <eric@w3.org> wrote: >>> >>>> * Andy Seaborne <andy.seaborne@epimorphics.com> [2013-03-20 17:36+0000] >>>>> The TTL has U+037E but ... >>>>> >>>>> PN_CHARS_BASE has a hole specifically for that >>>>> >>>>> [#x0370-#x037D] | [#x037F-#x1FFF] >>>>> >>>>> => not a legal char. >>>> >>>> Yeah, I screwed that up. I should have gone the other way 'cause it's at >>>> the bottom of a range (unlike all the other unassigned chars). Attached are >>>> the same tests with s/37f/384/. Could you chop off after the "AZaz" and see >>>> if that works and do a binary search to see what it's complaining about? >>>> >>>> I18N folks, could you tell me why an NFC validator is objecting to this >>>> (beautiful) IRI and if there's some validator I can use for testing:? >>>> <http://a.example/AZazÀÖØöø˿Ͱͽ΄῾⁰↉Ⰰ⿕、ퟻ豈ﷇﷰ�𐀀> >>>> The goal is to test as much as possible the valid input to < >>>> http://www.w3.org/TR/turtle/#grammar-production-PrefixedName>. In turtle, >>>> the localName gets appended to the namespace, hence the url above. The >>>> >>>> [163s] PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | >>>> [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] | >>>> [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | >>>> [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] >>>> >>>> production is taken from <http://www.w3.org/TR/REC-xml/#NT-NameStartChar>: >>>> >>>> [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | >>>> [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | >>>> [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | >>>> [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] >>>> >>>> >>>> >>>>> Removing it (Greek question mark), I then get: >>>>> >>>>> WARN [line: 2, col: 43] Bad IRI: >>>>> <http://a.example/AZaz???????????????????????> Code: 46/NOT_NFC in >>>>> PATH: The IRI is not in Unicode Normal Form C. >>>>> WARN [line: 2, col: 43] Bad IRI: >>>>> <http://a.example/AZaz???????????????????????> Code: 47/NOT_NFKC in >>>>> PATH: The IRI is not in Unicode Normal Form KC. >>>>> WARN [line: 2, col: 43] Bad IRI: >>>>> <http://a.example/AZaz???????????????????????> Code: >>>>> 56/COMPATIBILITY_CHARACTER in PATH: TODO >>>>> >>>>> with or without the last char. >>>>> >>>>>> I poked around looking for composing characters in the PN_CHARS_BASE >>>>>> character ranges. \u02ff MODIFIER LETTER LOW LEFT ARROW seemed like it >>>>>> could be a culprit, but fileformat.info claims it's not in a combining >>>>>> class. Likewise \ufffd REPLACEMENT CHARACTER >>>>>> >>>>>> There are a bunch of yet-unassigned characters which could be confusing >>>>>> a vigilent IRI checkr. I've mapped those to the highest currently- >>>>>> assigned characters in their respective range (per fileformat.info): >>>>>> >>>>>> \u037f 37e >>>>>> \u1fff 1ffe >>>>>> \u218f 2189 >>>>>> \u2fef 2fd5 >>>>>> \ud7ff d7fb >>>>>> \ufdcf fdc7 >>>>>> \U000effff e01ef >>>>>> >>>>>> attached is a variant of >>>>>> localName_with_PN_CHARS_BASE_character_boundaries.{nt,ttl} >>>>>> with the values substituted. (I pass this modified test so there >>>>>> shouldn't be any typos in it.) If it still doesn't work, try chopping >>>>>> off the last character 'cause it's a variation selector which ostensibly >>>>>> is NF{,K}{C,D} valid, but may not have been when jjc wrote your checker. >>>>>> >>>>>> >>>>> >>>> >>>> -- >>>> -ericP >>>> >>>> >> >
Received on Thursday, 21 March 2013 14:38:43 UTC