- From: Gavin Carothers <gavin@carothers.name>
- Date: Wed, 20 Mar 2013 13:17:13 -0700
- To: "Eric Prud'hommeaux" <eric@w3.org>
- Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, I18N folks <www-international@w3.org>, RDF-WG WG <public-rdf-wg@w3.org>
- Message-ID: <CAPqY83yEgy9DbtjpfdKNTuPXheFUwqxu8n28inQC4BHwX-UXDQ@mail.gmail.com>
http://www.unicode.org/charts/PDF/U1F00.pdf U+1FFF is not a character. http://www.unicode.org/charts/PDF/U2150.pdf U+218F is not a character. No chart for code point U+2FEF could be located. Most likely this is because no character is assigned to this code point yet. http://www.unicode.org/charts/PDF/UD7B0.pdf U+D7FF is not a character. http://www.unicode.org/charts/PDF/UFB50.pdf U+FDCF is not a character. No chart for code point U+EFFFF could be located. Most likely this is because no character is assigned to this code point yet New string based on the above missing characters tested in Python 3.3 (earlier versions of python not supported, only one with Unicode 6.1.0) import unicodedata s = "AZaz\u00c0\u00d6\u00d8\u00f6\u00f8\u02ff\u0370\u037d\u0384\u1ffe\u200c\u200d\u2070\u217f\u2c00\u2fcf\u3001\ud7fb\uf900\ufdc7\ufdf0\ufffd\U00010000\U0001f52b" def display_string(s): for c in s: print("""Character: {c!s} Codepoint: {code:x} Name: {name} Combining: {combining} """.format( c=c, code=ord(c), name=unicodedata.name(c), combining=unicodedata.combining(c), )) n = unicodedata.normalize("NFC", s) display_string(s) print("\n ------------------ \n ") display_string(n) assert n == s Yeah, they aren't the same. The offending character is f900: CJK COMPATIBILITY IDEOGRAPH-F900 which in normal form is CJK UNIFIED IDEOGRAPH-8C48 Finding something in the F900ish range is left to Eric. Script above can be modified until it passes. Cheers, Gavin On Wed, Mar 20, 2013 at 12:13 PM, Eric Prud'hommeaux <eric@w3.org> wrote: > * Andy Seaborne <andy.seaborne@epimorphics.com> [2013-03-20 17:36+0000] > > The TTL has U+037E but ... > > > > PN_CHARS_BASE has a hole specifically for that > > > > [#x0370-#x037D] | [#x037F-#x1FFF] > > > > => not a legal char. > > Yeah, I screwed that up. I should have gone the other way 'cause it's at > the bottom of a range (unlike all the other unassigned chars). Attached are > the same tests with s/37f/384/. Could you chop off after the "AZaz" and see > if that works and do a binary search to see what it's complaining about? > > I18N folks, could you tell me why an NFC validator is objecting to this > (beautiful) IRI and if there's some validator I can use for testing:? > <http://a.example/AZazÀÖØöø˿Ͱͽ΄῾⁰↉Ⰰ⿕、ퟻ豈ﷇﷰ�𐀀󠇯> > The goal is to test as much as possible the valid input to < > http://www.w3.org/TR/turtle/#grammar-production-PrefixedName>. In turtle, > the localName gets appended to the namespace, hence the url above. The > > [163s] PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | > [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] | > [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | > [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] > > production is taken from <http://www.w3.org/TR/REC-xml/#NT-NameStartChar>: > > [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | > [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | > [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | > [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] > > > > > Removing it (Greek question mark), I then get: > > > > WARN [line: 2, col: 43] Bad IRI: > > <http://a.example/AZaz???????????????????????> Code: 46/NOT_NFC in > > PATH: The IRI is not in Unicode Normal Form C. > > WARN [line: 2, col: 43] Bad IRI: > > <http://a.example/AZaz???????????????????????> Code: 47/NOT_NFKC in > > PATH: The IRI is not in Unicode Normal Form KC. > > WARN [line: 2, col: 43] Bad IRI: > > <http://a.example/AZaz???????????????????????> Code: > > 56/COMPATIBILITY_CHARACTER in PATH: TODO > > > > with or without the last char. > > > > >I poked around looking for composing characters in the PN_CHARS_BASE > > >character ranges. \u02ff MODIFIER LETTER LOW LEFT ARROW seemed like it > > >could be a culprit, but fileformat.info claims it's not in a combining > > >class. Likewise \ufffd REPLACEMENT CHARACTER > > > > > >There are a bunch of yet-unassigned characters which could be confusing > > >a vigilent IRI checkr. I've mapped those to the highest currently- > > >assigned characters in their respective range (per fileformat.info): > > > > > > \u037f 37e > > > \u1fff 1ffe > > > \u218f 2189 > > > \u2fef 2fd5 > > > \ud7ff d7fb > > > \ufdcf fdc7 > > >\U000effff e01ef > > > > > >attached is a variant of > > > localName_with_PN_CHARS_BASE_character_boundaries.{nt,ttl} > > >with the values substituted. (I pass this modified test so there > > >shouldn't be any typos in it.) If it still doesn't work, try chopping > > >off the last character 'cause it's a variation selector which ostensibly > > >is NF{,K}{C,D} valid, but may not have been when jjc wrote your checker. > > > > > > > > > > -- > -ericP > >
Received on Wednesday, 20 March 2013 20:17:42 UTC