Re: Equality of rdf:langString

Comparing is slippery as to when comparison is done.

Case 1: Compare on access:

Asking whether the data contains a triple with object "中国"@zh-Hans 
when the data has "中国"@zh-hans

Case 2: Compare when constructing the graph:

The set nature of an RDF graph.  If comparing at the point of graph 
construction, it's the same as case folding modulo which one actually is 
put in the graph.

Hence the idea of a survey as to what systems current actually do.

On 07/05/13 11:49, Eric Prud'hommeaux wrote:
> * Andy Seaborne <andy.seaborne@epimorphics.com> [2013-05-07 10:12+0100]
>> BCP47 does not require a fixed case for lamnguage tags; it talks
>> about equivalence.  We should do the same.
>
> Turtle comment 31 was about this as well <http://www.w3.org/2011/rdf-wg/wiki/Turtle_Candidate_Recommendation_Comments#c31>.
> The commentor was satisfied with the textual proposals in <http://www.w3.org/mid/20130329165216.GC887@w3.org>. The commentor's ack <http://www.w3.org/mid/OF191FDF83.9AD29A20-ONC1257B41.002BECC9-C1257B41.003030AC@agfa.com> indicated that he'll be satisfied with langtags being compared case-insensitively and some mild encouragement to use appropriate cases per BCP47.

The text isn't in the doc?

The core seems to be "treated as the same node" which can be taken to eb 
at graph construction or  graph access.

There is then a proposed edit to concepts - it's better text but is it 
going into concepts?


[[
The I18N folks might look askance at having ill-formed language tag
]]

Rest easy.  They are not ill-formed by BCP47.  "Well-formed-ness" is 
acceptance by the grammar and they do (the grammar is case insensitive - 
mostly, and a bit unclear about irregulars).

>
> This differs from your proposal in that it states that "中国"@zh-Hans and "中国"@zh-hans are the same RDF term but does not demand folding to lowercase. It also notes that the lower-case form is not "RECOMMENDED" by BCP47, which uses the ISO15924 script codes c.f. entry 501 in <http://www.unicode.org/iso15924/iso15924-num.html>. BCP47 says
> [[
> 2.1.1.  Formatting of Language Tags
>
>     At all times, language tags and their subtags, including private use
>     and extensions, are to be treated as case insensitive: there exist
>     conventions for the capitalization of some of the subtags, but these
>     MUST NOT be taken to carry meaning.
> ]] — <http://tools.ietf.org/html/bcp47#section-2.1.1>
> and
> [[
>     Although case distinctions do not carry meaning in language tags,
>     consistent formatting and presentation of language tags will aid
>     users.  The format of subtags in the registry is RECOMMENDED as the
>     form to use in language tags.  This format generally corresponds to
>     the common conventions for the various ISO standards from which the
>     subtags are derived.
>
>     These conventions include:
>
>     o  [ISO639-1] recommends that language codes be written in lowercase
>        ('mn' Mongolian).
>
>     o  [ISO15924] recommends that script codes use lowercase with the
>        initial letter capitalized ('Cyrl' Cyrillic).
>
>     o  [ISO3166-1] recommends that country codes be capitalized ('MN'
>        Mongolia).
> ]] — http://tools.ietf.org/html/bcp47#page-1-7
>
> I think that comparing case insensitively as opposed to requiring case folding will keep us from transforming valid language tags into invalid language tags. If we don't specify some case cleverness, equivalence will be at the mercy of RDF authors who probably haven't read ISO15924 (slackers that they are) and won't know how fussy language tags really are.

Tut, tut.

>> Proposal:
>>
>> Two literals of rdf:langString are considered equal if they have the
>> same lexical form and have equivalent language tags by BCP47.
>>
>> Change to RDF Semantics:
>>
>> [[
>> The value space of rdf:langString is the set of all pairs of a
>> string with a language tag.
>> ]]
>> ==>
>> [[
>> The value space of rdf:langString is the set of all pairs of a
>> string with a language tag converted to lower case (US-ASCII).
>> ]]
>
> case-insensitive comparison: use former text.

This is defining the value space - we can use the former text if it's 
understood that it refers to the abstract concept of a language tag. 
(Not depending on comparison).

>
>
>> Change to RDF Concepts:
>>
>> [[
>> a non-empty language tag as defined by [BCP47]. The language tag
>> must be well-formed according to section 2.2.9 of [BCP47], and must
>> be normalized to lowercase.
>> ]]
>> ==>
>> [[
>> a non-empty language tag as defined by [BCP47]. The language tag
>> must be well-formed according to section 2.2.9 of [BCP47].
>> ]]
>
> case-insensitive comparison: use latter text.
>
>
>> The section on literal term equality remains unchanged:
>> [[
>> Literal equality: Two literals are equal if and only if the two
>> lexical forms, the two datatype IRIs, and the two language tags (if
>> any) compare equal, character by character.
>> ]]
>
> case-insensitive comparison:
> [[
> Literal equality: Two literals are equal if and only if the two
> lexical forms and the two datatype IRIs are identical unicode strings
> and the language tags are equivalent. Two language tags are equivalent
> if the strings formed by mapping each character to lower case are
> equivalent. No collation for the lower case maping is required as
> language tags use only ASCII characters.
>
> Note: While BCP47 section-2.1.1 specifies the appropriate case for
> various sub-language forms, RDF treats as equal all variations in
> case. For example, a literal "中国" with a language tag of "zh-Hans"
> is the same term as that literal with a language tag of "zh-hans" or
> "zh-HANS".
> ]]
>

Received on Tuesday, 7 May 2013 11:46:05 UTC