RE: Final ping on turtle i18n issues from Phillips, Addison on 2012-12-11 (www-international@w3.org from October to December 2012)

From: Phillips, Addison <addison@lab126.com>
Date: Mon, 10 Dec 2012 22:35:03 -0800
To: "Eric Prud'hommeaux" <eric@w3.org>
CC: Sandro Hawke <sandro@w3.org>, "www-international@w3.org" <www-international@w3.org>, RDF WG <public-rdf-wg@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC34773A90CC92C@EX-SEA31-D.ant.amazon.com>
Hello Eric, et al,

Some responses follow.

Addison

> -----Original Message-----
> From: Eric Prud'hommeaux [mailto:ericw3c@gmail.com] 
> Sent: Monday, December 10, 2012 3:46 PM
> 
> = I18N-ISSUE-178 =
> > I'm *not* satisfied with the resolution of this issue.
> >
> > In Section 6.5 you added a directly link to the IANA Language Subtag registry,
> calling the contents "registered language tags" (which is incorrect, these are
> language subtags, used to form language tags).
> 
> s{registered language tags}{registered language subtags} done

Thanks.

> 
> > No reference is made to BCP 47. TURTLE effectively does not specify any
> standard for language tags and does not require validity or even "well-
> formedness" of language tags. Please address this issue by adding an explicit
> reference to BCP 47 (preferably a normative reference).
> 
> Section 6.5 specifies the syntactic constraints on the language. Section 7
> specifies the binding to RDF, including
>   "Literals are composed of a lexical form and an optional language tag [BCP47]
> or datatype IRI."
> Also, the href on "language tag" points to the part of the RDF concepts doc
> which says defines language tag:
>   "a non-empty language tag as defined by [BCP47]. The language tag must be
> well-formed according to section 2.2.9 of [BCP47]…"

I swear I looked at this. Must. Drink. More. Coffee.

> 
> > The link to the language subtag registry, please note, occurs in this sentence:
> >
> >    The strings @prefix and @base match the pattern for LANGTAG, though
> neither "prefix" nor "base" are registered language tags.
> >
> > Shouldn't the EBNF include the magick strings "prefix" and "base" in the
> LANGTAG production? While that production "permits" these strings, it is often
> customary to call out reserved values.
> 
> I floated a version of the grammar which explicitly included "prefix" and "base"
> as alternatives in a productions for langtag
> <http://www.w3.org/mid/20120615185648.GA27073@w3.org> but that thread
> got derailed by a discussion about how parsers work. Because these language
> tags aren't registered and couldn't be tested by valid documents, the WG
> decided to leave their inclusion undefined <http://www.w3.org/2011/rdf-

> wg/meeting/2012-06-20#line0176>.

No worries. I don't think this is critical.
> 
> 
> = I18N-ISSUE-180 =
> > you have added the Unicode references requested, but these take the wrong
> form. For example, at #turtle-literals in your draft, you say:
> >
> >       Literals delimited by ' (U+27), may not contain the characters ', LF (U+0A),
> or CR (U+0D).
> >
> > Unicode literals should always be at least four hex digits (U+0027, U+000D,
> etc.). Please search your document for "U+" and fix each one (this is a quick
> job).
> 
> done, apart from the grammar which follows the XML notation #xN
> <http://www.w3.org/TR/xml/#sec-notation>.

Thanks!

> (This changed "(U+0 to U+FFFF)" to "(U+0000 to U+FFFF)" in the media type,
> which will require a registry update some time this century.)

(Sets alarm on the Long Now clock...)

> 
> 
> = I18N-ISSUE-183 =
> > After your proposal, there were several comments from I18N and from other
> people. You didn't change the example so that the data types would be a good
> example. I think we're not satisfied with this one, although I would class this as
> an editorial comment and recognize that the point of the example is to show
> the different data types, not to model any real-life data. Still... bad examples
> like this are the bane of my existence.
> 
> I'm sympathetic to your existential bane, having several myself. Perhaps I can
> enlist you to help formulate a better example.
> 
...
> 
> While in theory it's better to use e.g. gYear, it's vanishingly rare to find real
> examples. Also, the example is explicitly for the syntactic shortcuts in Turtle
> which produce integers, decimals and doubles.
> 
> My runner-up examples is from chemistry, a not terribly alien domain:
> 
> [[
> @prefix : <http://example.org/elements> .
> <http://en.wikipedia.org/wiki/Helium>
>     :atomicNumber 2 ;               # xsd:integer
>     :atomicMass 4.002602 ;          # xsd:decimal
>     :specificGravity 1.663E-4 .     # xsd:double
> ]]
> (Half-lives and that sort of thing get us back into the gYear problem.)
> 
> This one's a bit specialized, plus the element that has an atomic mass of
> 4.002602 isn't really the same thing as the gas with the specific gravity.
> Thoughts?

I like this example, even if it is slightly arcane. It meets the various quibbles. As I said, I hate bad examples, but I'd need to spend some time to invent a similar example. Perhaps on my flight to Seattle tomorrow.....

> 
> = I18N-ISSUE-189 =
> > We requested that you incorporate the obs-language-tag production directly
> or by reference. I'm satisfied with your reasons for not modifying the EBNF, but
> not with the text that describes the handling of language tags (see issue 178
> above).
> 
> You may be now that I've pointed out that the parsing section stipulates that
> LANGTAGs must be BP47-compliant.

Yup.
> 
> 
> = I18N-ISSUE-190 =
> > Eric explained that PN_CHARS_BASE was derived from [2] and "presumably
> leveraged the wisdom that went into XML identifiers". However, the XML
> reference is incomplete in "erasing" combining marks (which was, in fact, the
> purpose) and it was created a Long Time Ago. The additional complexity that
> having this production introduces to TURTLE is probably unnecessary. However,
> I have no objection to keeping things as they are, as EBNF contains no ready
> means of doing anything better and it doesn't hurt anything to keep some
> combining marks from being used badly.
> 
> I read your response as saying that today the XML WG would use a different
> production because they missed removing some combining characters (so XML
> names are limited to those characters which have a single codepoint in NFC?).
> Presumably, this would change no existing data. Do you have the list of the
> changes?

It's... not that simple.

XML names are not limited to characters with single NFC code points and purposefully so. While we often use Latin letters and combining accents as examples of combining marks, these are, in practice, rarely used and the recommendation is to use NFC (pre-composed) code points where possible for interoperability reasons. However, what is often overlooked is that many scripts use combining marks as an inherent and unavoidable way that the text is encoded.  Cf. http://people.w3.org/rishida/docs/unicode-tutorial/part3#vowel-signs


So the problem isn't that there are a few more combining marks in the block reserved for that. It is that there are many combining marks strewn around Unicode and that making an EBNF list of them is prohibitively complex. What's more, new scripts get added to Unicode in each release and many of these require combining marks (making the old list obsolete). So what I'm basically saying is that, were we to do it over, we might use the Unicode properties instead of trying to make a list or we might handle this in some different manner (several approaches exist). While I applaud Turtle for trying to address the normalization problem, in practice, though, it's somewhat moot. Defining a triple name (or other token) that starts with a combining mark won't happen naturally with "real" data and may hurt if you do it--so enforcing it at the grammar level may not be a huge priority. Somewhat like we used to (try to) say that NFC/"include normalization" needed to be enforced tooth-and-nail, but now mostly confine ourselves to health warnings.

Ideally, I think you'd put an NFC health warning ("use NFC content for interoperability"), remove the extra rules, and point out that starting a token with a combining mark is a Bad Idea. But given that you have working implementations, et al, I'd just leave most of it alone.

>> Perhaps we can resolve
> things "live"?
> 
> That'd be like party time at TPAC.
> Let's see what remains after a few cycles and then we can see if we need to
> board each others' ships.

Sounds good (resists temptation to add pirate joke).
Received on Tuesday, 11 December 2012 06:36:32 UTC