W3C home > Mailing lists > Public > www-international@w3.org > January to March 2013

Re: claimed completion on "ACTION-233: Publish the consolidated test suite"

From: Eric Prud'hommeaux <eric@w3.org>
Date: Wed, 20 Mar 2013 23:51:04 -0400
To: Gavin Carothers <gavin@carothers.name>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, I18N folks <www-international@w3.org>, RDF-WG WG <public-rdf-wg@w3.org>
Message-ID: <20130321035101.GA17807@w3.org>
RDF and I18N folks, we have an interesting situation where we permit
U+F900-U+FA0D to appear in local names, but advise against anything
which is not NFC. So, what do we test? Everything a Turtle parser
could encouter? Currently assigned characters that are in NFC?
Identifiers consisting of a single letter 'a', under the assumption
that all others will work by extension?


* Gavin Carothers <gavin@carothers.name> [2013-03-20 13:17-0700]
> http://www.unicode.org/charts/PDF/U1F00.pdf U+1FFF is not a character.
> http://www.unicode.org/charts/PDF/U2150.pdf U+218F is not a character.
> No chart for code point U+2FEF could be located. Most likely this is
> because no character is assigned to this code point yet.
> http://www.unicode.org/charts/PDF/UD7B0.pdf U+D7FF is not a character.
> http://www.unicode.org/charts/PDF/UFB50.pdf U+FDCF is not a character.
> No chart for code point U+EFFFF could be located. Most likely this is
> because no character is assigned to this code point yet
> 
> 
> New string based on the above missing characters tested in Python 3.3
> (earlier versions of python not supported, only one with Unicode 6.1.0)

I banged briefly on finding an ubuntu package for Python 3.3
(currently at 3.2). Ended up with something called perl. sigh.

use Unicode::Normalize;
$s = "AZaz\x{00c0}\x{00d6}\x{00d8}\x{00f6}\x{00f8}\x{02ff}\x{0370}\x{037d}\x{0384}\x{1ffe}\x{200c}\x{200d}\x{2070}\x{217f}\x{2c00}\x{2fcf}\x{3001}\x{d7fb}\x{f900}\x{fdc7}\x{fdf0}\x{fffd}\x{00010000}\x{0001f52b}";
p $s cmp NFC($s);
=> 1 -- strings are different. so now to look for the first candidate:

for (0xf900..0xfdcf) {
    if (ord(Unicode::Normalize::NFC(chr($_))) == $_) {
        printf("%x\n", $_);
        last;
    }
}
=> fa0e

# checked with
$s = "AZaz\x{00c0}\x{00d6}\x{00d8}\x{00f6}\x{00f8}\x{02ff}\x{0370}\x{037d}\x{0384}\x{1ffe}\x{200c}\x{200d}\x{2070}\x{217f}\x{2c00}\x{2fcf}\x{3001}\x{d7fb}\x{fa0e}\x{fdc7}\x{fdf0}\x{fffd}\x{00010000}\x{0001f52b}";
p $s cmp NFC($s);
=> 0 -- equivalent

The currently unassigned characters don't impact NFC:
$s = "AZaz\x{00c0}\x{00d6}\x{00d8}\x{00f6}\x{00f8}\x{02ff}\x{0370}\x{037d}\x{037f}\x{1fff}\x{200c}\x{200d}\x{2070}\x{218f}\x{2c00}\x{2fef}\x{3001}\x{d7ff}\x{fa0e}\x{fdcf}\x{fdf0}\x{fffd}\x{10000}\x{effff}"
p $s cmp NFC($s);
=> 0 -- equivalent


> import unicodedata
> s =
> "AZaz\u00c0\u00d6\u00d8\u00f6\u00f8\u02ff\u0370\u037d\u0384\u1ffe\u200c\u200d\u2070\u217f\u2c00\u2fcf\u3001\ud7fb\uf900\ufdc7\ufdf0\ufffd\U00010000\U0001f52b"
> 
> def display_string(s):
> for c in s:
> print("""Character: {c!s}
> Codepoint: {code:x}
> Name: {name}
> Combining: {combining}
> """.format(
> c=c,
> code=ord(c),
> name=unicodedata.name(c),
> combining=unicodedata.combining(c),
> ))
> 
> n = unicodedata.normalize("NFC", s)
> 
> display_string(s)
> print("\n ------------------ \n ")
> display_string(n)
> 
> assert n == s
> 
> Yeah, they aren't the same. The offending character is f900:
> 
> CJK COMPATIBILITY IDEOGRAPH-F900 which in normal form is CJK UNIFIED
> IDEOGRAPH-8C48
> 
> Finding something in the F900ish range is left to Eric. Script above can be
> modified until it passes.
> 
> Cheers,
> Gavin
> 
> 
> 
> 
> 
> On Wed, Mar 20, 2013 at 12:13 PM, Eric Prud'hommeaux <eric@w3.org> wrote:
> 
> > * Andy Seaborne <andy.seaborne@epimorphics.com> [2013-03-20 17:36+0000]
> > > The TTL has U+037E but ...
> > >
> > > PN_CHARS_BASE has a hole specifically for that
> > >
> > > [#x0370-#x037D] | [#x037F-#x1FFF]
> > >
> > > => not a legal char.
> >
> > Yeah, I screwed that up. I should have gone the other way 'cause it's at
> > the bottom of a range (unlike all the other unassigned chars). Attached are
> > the same tests with s/37f/384/. Could you chop off after the "AZaz" and see
> > if that works and do a binary search to see what it's complaining about?
> >
> > I18N folks, could you tell me why an NFC validator is objecting to this
> > (beautiful) IRI and if there's some validator I can use for testing:?
> >   <http://a.example/AZazÀÖØöø˿Ͱͽ΄῾‌‍⁰↉Ⰰ⿕、ퟻ豈ﷇﷰ�𐀀>
> > The goal is to test as much as possible the valid input to <
> > http://www.w3.org/TR/turtle/#grammar-production-PrefixedName>. In turtle,
> > the localName gets appended to the namespace, hence the url above. The
> >
> >   [163s] PN_CHARS_BASE ::=    [A-Z] | [a-z] | [#x00C0-#x00D6] |
> > [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] |
> > [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
> > [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
> >
> > production is taken from <http://www.w3.org/TR/REC-xml/#NT-NameStartChar>:
> >
> >   [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
> > [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
> > [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
> > [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
> >
> >
> >
> > > Removing it (Greek question mark), I then get:
> > >
> > > WARN  [line: 2, col: 43] Bad IRI:
> > > <http://a.example/AZaz???????????????????????> Code: 46/NOT_NFC in
> > > PATH: The IRI is not in Unicode Normal Form C.
> > > WARN  [line: 2, col: 43] Bad IRI:
> > > <http://a.example/AZaz???????????????????????> Code: 47/NOT_NFKC in
> > > PATH: The IRI is not in Unicode Normal Form KC.
> > > WARN  [line: 2, col: 43] Bad IRI:
> > > <http://a.example/AZaz???????????????????????> Code:
> > > 56/COMPATIBILITY_CHARACTER in PATH: TODO
> > >
> > > with or without the last char.
> > >
> > > >I poked around looking for composing characters in the PN_CHARS_BASE
> > > >character ranges. \u02ff MODIFIER LETTER LOW LEFT ARROW seemed like it
> > > >could be a culprit, but fileformat.info claims it's not in a combining
> > > >class. Likewise \ufffd REPLACEMENT CHARACTER
> > > >
> > > >There are a bunch of yet-unassigned characters which could be confusing
> > > >a vigilent IRI checkr. I've mapped those to the highest currently-
> > > >assigned characters in their respective range (per fileformat.info):
> > > >
> > > >     \u037f   37e
> > > >     \u1fff  1ffe
> > > >     \u218f  2189
> > > >     \u2fef  2fd5
> > > >     \ud7ff  d7fb
> > > >     \ufdcf  fdc7
> > > >\U000effff e01ef
> > > >
> > > >attached is a variant of
> > > >   localName_with_PN_CHARS_BASE_character_boundaries.{nt,ttl}
> > > >with the values substituted. (I pass this modified test so there
> > > >shouldn't be any typos in it.) If it still doesn't work, try chopping
> > > >off the last character 'cause it's a variation selector which ostensibly
> > > >is NF{,K}{C,D} valid, but may not have been when jjc wrote your checker.
> > > >
> > > >
> > >
> >
> > --
> > -ericP
> >
> >

-- 
-ericP
Received on Thursday, 21 March 2013 03:51:37 GMT

This archive was generated by hypermail 2.3.1 : Thursday, 21 March 2013 03:51:44 GMT