W3C home > Mailing lists > Public > public-rdf-wg@w3.org > December 2012

RE: Final ping on turtle i18n issues

From: Eric Prud'hommeaux <eric@w3.org>
Date: Mon, 10 Dec 2012 18:46:25 -0500
To: "Phillips, Addison" <addison@lab126.com>
Cc: Sandro Hawke <sandro@w3.org>, "www-international@w3.org" <www-international@w3.org>, RDF WG <public-rdf-wg@w3.org>
Message-ID: <20121210234622.GE25523@w3.org>
* Phillips, Addison <addison@lab126.com> [2012-12-10 08:41-0800]
> Hello Sandro,
> Thanks for this ping. I regret that I didn't see all of these messages at the time (we responded to a few of them, but the bulk of your responses appear to have come while I was on vacation, and so this is, alas, my first look at them). The complete list of our comments is at [1].
> In reviewing your responses, I have the following comments, which are based on the assumption that your current editor's draft is:
>    http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html 

= I18N-ISSUE-178 =
> I'm *not* satisfied with the resolution of this issue.
> In Section 6.5 you added a directly link to the IANA Language Subtag registry, calling the contents "registered language tags" (which is incorrect, these are language subtags, used to form language tags).

s{registered language tags}{registered language subtags} done

> No reference is made to BCP 47. TURTLE effectively does not specify any standard for language tags and does not require validity or even "well-formedness" of language tags. Please address this issue by adding an explicit reference to BCP 47 (preferably a normative reference).

Section 6.5 specifies the syntactic constraints on the language. Section 7 specifies the binding to RDF, including
  "Literals are composed of a lexical form and an optional language tag [BCP47] or datatype IRI."
Also, the href on "language tag" points to the part of the RDF concepts doc which says defines language tag:
  "a non-empty language tag as defined by [BCP47]. The language tag must be well-formed according to section 2.2.9 of [BCP47]…"

You'll see BCP47 at the top of the Normative References <http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#normative-references>.

> The link to the language subtag registry, please note, occurs in this sentence:
>    The strings @prefix and @base match the pattern for LANGTAG, though neither "prefix" nor "base" are registered language tags.
> Shouldn't the EBNF include the magick strings "prefix" and "base" in the LANGTAG production? While that production "permits" these strings, it is often customary to call out reserved values.

I floated a version of the grammar which explicitly included "prefix" and "base" as alternatives in a productions for langtag <http://www.w3.org/mid/20120615185648.GA27073@w3.org> but that thread got derailed by a discussion about how parsers work. Because these language tags aren't registered and couldn't be tested by valid documents, the WG decided to leave their inclusion undefined <http://www.w3.org/2011/rdf-wg/meeting/2012-06-20#line0176>.

= I18N-ISSUE-180 =
> you have added the Unicode references requested, but these take the wrong form. For example, at #turtle-literals in your draft, you say:
>       Literals delimited by ' (U+27), may not contain the characters ', LF (U+0A), or CR (U+0D).
> Unicode literals should always be at least four hex digits (U+0027, U+000D, etc.). Please search your document for "U+" and fix each one (this is a quick job).

done, apart from the grammar which follows the XML notation #xN <http://www.w3.org/TR/xml/#sec-notation>.
(This changed "(U+0 to U+FFFF)" to "(U+0000 to U+FFFF)" in the media type, which will require a registry update some time this century.)

= I18N-ISSUE-183 =
> After your proposal, there were several comments from I18N and from other people. You didn't change the example so that the data types would be a good example. I think we're not satisfied with this one, although I would class this as an editorial comment and recognize that the point of the example is to show the different data types, not to model any real-life data. Still... bad examples like this are the bane of my existence.

I'm sympathetic to your existential bane, having several myself. Perhaps I can enlist you to help formulate a better example.

Current text:
@prefix : <http://example.org/stats> .
    :censusYear 2007 ;              # xsd:integer
    :birthsPerPerson .0135 ;        # xsd:decimal
    :gdpDollars 14074.2E9 ;         # xsd:double
+ terse -- no extra triples.
+ approachable -- not some specialized domain.
+ realistic -- actual values from 2007 census.
- bad practice -- all these literals have implicit datatypes.

Richard proposed some ideas for either pure numbers or numbers where the datatype was structurally set apart from the value <http://www.w3.org/mid/6FBBDE2A-9EB1-4204-9BA9-53BB1E326966@cyganiak.de>.
I have to disagree. Neither untyped integers nor untyped doubles are =
hard to find; examples for integers include world rankings (e.g., 34th =
in the world) and unit multipliers (e.g., 9 for billions); examples for =
doubles include ratios (e.g., change compared to previous year) and, =
indeed, structured typed values:

  :landArea [ :unit :km2; :value 9.827E6 ];

The current example does not just lack rigour, but shows two examples of =
poor practice:

1. the use of integers where a more appropriate XSD type (gYear) exists; =
don't do that!
2. the use of floating point values for currencies; don't do that!

Also, a census does not produce GDP figures. Call it </stat-facts2007> =
or whatever.
The last point is of course trivial to take care of; the challenges like in the first two points.

While in theory it's better to use e.g. gYear, it's vanishingly rare to find real examples. Also, the example is explicitly for the syntactic shortcuts in Turtle which produce integers, decimals and doubles.

My runner-up examples is from chemistry, a not terribly alien domain:

@prefix : <http://example.org/elements> .
    :atomicNumber 2 ;               # xsd:integer
    :atomicMass 4.002602 ;          # xsd:decimal
    :specificGravity 1.663E-4 .     # xsd:double
(Half-lives and that sort of thing get us back into the gYear problem.)

This one's a bit specialized, plus the element that has an atomic mass of 4.002602 isn't really the same thing as the gas with the specific gravity.

= I18N-ISSUE-184 =
> See issue 180 (above). The format of your U+ syntax is invalid.


= I18N-ISSUE-187 =
> We're okay with \u and \U syntaxes, but you didn't address part of our comment, which is that the \u syntax doesn't address surrogate pair handling. You might do so by saying "Unicode character" instead of "Unicode codepoint" (not all code points represent characters).

done (@#$%ing surrogates)
tx for the wording correction.

= I18N-ISSUE-188 =
> I'm satisfied by Eric's response.

= I18N-ISSUE-189 =
> We requested that you incorporate the obs-language-tag production directly or by reference. I'm satisfied with your reasons for not modifying the EBNF, but not with the text that describes the handling of language tags (see issue 178 above).

You may be now that I've pointed out that the parsing section stipulates that LANGTAGs must be BP47-compliant.

= I18N-ISSUE-190 =
> Eric explained that PN_CHARS_BASE was derived from [2] and "presumably leveraged the wisdom that went into XML identifiers". However, the XML reference is incomplete in "erasing" combining marks (which was, in fact, the purpose) and it was created a Long Time Ago. The additional complexity that having this production introduces to TURTLE is probably unnecessary. However, I have no objection to keeping things as they are, as EBNF contains no ready means of doing anything better and it doesn't hurt anything to keep some combining marks from being used badly.

I read your response as saying that today the XML WG would use a different production because they missed removing some combining characters (so XML names are limited to those characters which have a single codepoint in NFC?). Presumably, this would change no existing data. Do you have the list of the changes?

> ===
> We have a small timing problem, given that our next WG teleconference is scheduled for 10a.m. EST. Internationalization Working Group members should comment on-list if they feel my responses above are not consistent with working group consensus. If your WG has concerns with the any of the above, it seems that our teleconferences are at the same time. Perhaps we can resolve things "live"?

That'd be like party time at TPAC.
Let's see what remains after a few cycles and then we can see if we need to board each others' ships.

> Thanks,
> Addison
> Addison Phillips
> Globalization Architect (Lab126)
> Chair (W3C I18N WG)
> Internationalization is not a feature.
> It is an architecture.
> [1] http://www.w3.org/International/track/products/34 
> [2] http://www.w3.org/TR/2008/REC-xml-20081126/#NT-NameStartChar 
> > -----Original Message-----
> > From: Sandro Hawke [mailto:sandro@w3.org]
> > Sent: Friday, December 07, 2012 8:45 AM
> > To: www-international@w3.org
> > Cc: Eric Prud'hommeaux; RDF WG
> > Subject: Final ping on turtle i18n issues
> > 
> > We still haven't heard back from you on our proposals [1] for addressing your
> > review comments [1] on the LC WD of Turtle.  It's been almost two months
> > since our responses to your comments, and we need to move
> > forward.   If we don't hear from you by 10am ET on Wed 12 Dec, we'll
> > assume our responses are satisfactory.  If they're not, please let us know ASAP,
> > since we'd like to resolve things and move forward at our 12 Dec meeting.
> > 
> >        -- Sandro
> > 
> > [1] everything by Gavin Carothers or Eric Prud'hommeaux in
> > http://lists.w3.org/Archives/Public/www-international/2012OctDec/author.html
> > [2] http://www.w3.org/International/track/products/34
> > 
> > 

Received on Monday, 10 December 2012 23:46:58 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:04:23 UTC