Re: Equality of rdf:langString from Eric Prud'hommeaux on 2013-05-07 (public-rdf-wg@w3.org from May 2013)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Tue, 7 May 2013 06:49:47 -0400
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: RDF-WG <public-rdf-wg@w3.org>
Message-ID: <20130507104945.GB28539@w3.org>
* Andy Seaborne <andy.seaborne@epimorphics.com> [2013-05-07 10:12+0100]
> BCP47 does not require a fixed case for lamnguage tags; it talks
> about equivalence.  We should do the same.

Turtle comment 31 was about this as well <http://www.w3.org/2011/rdf-wg/wiki/Turtle_Candidate_Recommendation_Comments#c31>.
The commentor was satisfied with the textual proposals in <http://www.w3.org/mid/20130329165216.GC887@w3.org>. The commentor's ack <http://www.w3.org/mid/OF191FDF83.9AD29A20-ONC1257B41.002BECC9-C1257B41.003030AC@agfa.com> indicated that he'll be satisfied with langtags being compared case-insensitively and some mild encouragement to use appropriate cases per BCP47.

This differs from your proposal in that it states that "中国"@zh-Hans and "中国"@zh-hans are the same RDF term but does not demand folding to lowercase. It also notes that the lower-case form is not "RECOMMENDED" by BCP47, which uses the ISO15924 script codes c.f. entry 501 in <http://www.unicode.org/iso15924/iso15924-num.html>. BCP47 says
[[
2.1.1.  Formatting of Language Tags

   At all times, language tags and their subtags, including private use
   and extensions, are to be treated as case insensitive: there exist
   conventions for the capitalization of some of the subtags, but these
   MUST NOT be taken to carry meaning.
]] — <http://tools.ietf.org/html/bcp47#section-2.1.1>
and
[[
   Although case distinctions do not carry meaning in language tags,
   consistent formatting and presentation of language tags will aid
   users.  The format of subtags in the registry is RECOMMENDED as the
   form to use in language tags.  This format generally corresponds to
   the common conventions for the various ISO standards from which the
   subtags are derived.

   These conventions include:

   o  [ISO639-1] recommends that language codes be written in lowercase
      ('mn' Mongolian).

   o  [ISO15924] recommends that script codes use lowercase with the
      initial letter capitalized ('Cyrl' Cyrillic).

   o  [ISO3166-1] recommends that country codes be capitalized ('MN'
      Mongolia).
]] — http://tools.ietf.org/html/bcp47#page-1-7

I think that comparing case insensitively as opposed to requiring case folding will keep us from transforming valid language tags into invalid language tags. If we don't specify some case cleverness, equivalence will be at the mercy of RDF authors who probably haven't read ISO15924 (slackers that they are) and won't know how fussy language tags really are.


> Proposal:
> 
> Two literals of rdf:langString are considered equal if they have the
> same lexical form and have equivalent language tags by BCP47.
> 
> Change to RDF Semantics:
> 
> [[
> The value space of rdf:langString is the set of all pairs of a
> string with a language tag.
> ]]
> ==>
> [[
> The value space of rdf:langString is the set of all pairs of a
> string with a language tag converted to lower case (US-ASCII).
> ]]

case-insensitive comparison: use former text.


> Change to RDF Concepts:
> 
> [[
> a non-empty language tag as defined by [BCP47]. The language tag
> must be well-formed according to section 2.2.9 of [BCP47], and must
> be normalized to lowercase.
> ]]
> ==>
> [[
> a non-empty language tag as defined by [BCP47]. The language tag
> must be well-formed according to section 2.2.9 of [BCP47].
> ]]

case-insensitive comparison: use latter text.


> The section on literal term equality remains unchanged:
> [[
> Literal equality: Two literals are equal if and only if the two
> lexical forms, the two datatype IRIs, and the two language tags (if
> any) compare equal, character by character.
> ]]

case-insensitive comparison:
[[
Literal equality: Two literals are equal if and only if the two
lexical forms and the two datatype IRIs are identical unicode strings
and the language tags are equivalent. Two language tags are equivalent
if the strings formed by mapping each character to lower case are
equivalent. No collation for the lower case maping is required as
language tags use only ASCII characters.

Note: While BCP47 section-2.1.1 specifies the appropriate case for
various sub-language forms, RDF treats as equal all variations in
case. For example, a literal "中国" with a language tag of "zh-Hans"
is the same term as that literal with a language tag of "zh-hans" or
"zh-HANS".
]]

-- 
-ericP
Received on Tuesday, 7 May 2013 10:50:18 UTC