Re: Are literal language tags case sensitive? from Eric Prud'hommeaux on 2017-01-13 (semantic-web@w3.org from January 2017)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Fri, 13 Jan 2017 04:25:24 -0500
To: Pat Hayes <phayes@ihmc.us>
Cc: Semantic Web IG <semantic-web@w3.org>, Stian Soiland-Reyes <soiland-reyes@manchester.ac.uk>
Message-ID: <20170113092522.GA6394@w3.org>
* Pat Hayes <phayes@ihmc.us> [2017-01-12 22:47-0800]
> FWIW, my recollecion of the intent of the RDF 1.1 specs agrees with Peter’s. There was never any intention in the design of RDF 1.1 to weaken or change this aspect of the RDF 1.0 specs. The key point is "BCP47 states that language tags are to be treated as case insensitive, so en-GB is not distinct from en-gb.” Given this, stated in a normative standard external to RDF1.1, the requirement that tags be *converted* to lowercase is overkill: rather, tags should be allowed to be written in any case (and this not altered needlessly, on general grounds of make-no-unnecessary-changes) but MUST be compared for identity ignoring case. So in this sense, the RDF 1.1 specs state the requirements slightly more accurately. 

I believe this supports a test like:

  home/eric$ sparql -e 'ASK {FILTER (sameTerm("ab"@eN, "ab"@En))}'

Jenna: yes
  http://sparql.org/books/sparql?query=ASK+%7BFILTER+%28sameTerm%28%22ab%22%40eN%2C+%22ab%22%40En%29%29%7D&output=text&stylesheet=%2Fxml-to-html.xsl
SWObjects: true
This should be true for every compliant engine.


What the spec doesn't say is what case you get back in a SELECT or CONSTRUCT:

  home/eric$ sparql -8 -e 'SELECT ?x ?y {BIND("ab"@eN AS ?x) BIND("ab"@En AS ?y)}'

SWObjects takes the first utterance and ignores subsequent respellings:
  ┌─────────┬─────────┐
  │ ?x      │ ?y      │
  │ "ab"@eN │ "ab"@eN │
  └─────────┴─────────┘

Jenna does a very nice job of preserving the input data:
  ---------------------
  | x       | y       |
  =====================
  | "ab"@eN | "ab"@En |
  ---------------------
  http://sparql.org/books/sparql?query=SELECT+%3Fx+%3Fy+%7BBIND%28%22ab%22%40eN+AS+%3Fx%29+BIND%28%22ab%22%40En+AS+%3Fy%29%7D&output=text&stylesheet=%2Fxml-to-html.xsl


> Pat Hayes
> 
> > On Jan 12, 2017, at 9:45 AM, Peter F. Patel-Schneider <pfpschneider@gmail.com> wrote:
> > 
> > Sumary:  "Hello"@en-gb and "Hello"@en-GB are term equal.  If they are not term
> > equal then they have different literal values.
> > 
> > 
> > The answers to your questions should be easily available from
> > https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
> > but of course the situation is somewhat murky.
> > 
> >> From that section a language-tagged string consists of three elements
> > 1/ a lexical form, which is a Unicode string
> > 2/ the datatype IRI http://www.w3.org/1999/02/22-rdf-syntax-ns#langString
> > 3/ a language tag which is well-formed according to section 2.2.9 of [BCP47].
> > 
> > So, is (the Turtle literal) "Hello"@en-GB a language-tagged string and, if so,
> > what is its language tag?  As en-GB meets all the requirements to be a
> > language tag in BCP47 "Hello"@en-GB is indeed a language-tagged string.  Its
> > lexical form is the Unicode string Hello.  BCP47 states that language tags are
> > to be treated as case insensitive, so en-GB is not distinct from en-gb.  The
> > language tag of "Hello"@en-GB is thus a case-insensitive string, i.e., one
> > where the Unicode character G is considered the same as the Unicode character
> > g.  The language tags en-gb and en-GB then compare equal character by character.
> > 
> > So there is a fairly strong argument to be made that "Hello"@en-gb and
> > "Hello"@en-GB are indeed term equal.  This is also a fairly strong argument
> > that RDF systems SHOULD (not just MAY) convert language tags to lower case.
> > 
> > I haven't said anything about the value space of language tags.  Indeed the
> > value space of language tags doesn't actually affect anything in RDF.   The
> > literal value for a language-tagged string is just "a pair consisting of its
> > lexical form and its language tag".  There is no conversion to value spaces
> > going on at all here.  So if "Hello"@en-gb and "Hello"@en-GB are not
> > term-equal then they have different literal values.  This is yet another
> > argument for their term equality.
> > 
> > peter
> > 
> > PS:  It shouldn't have been so difficult to tease this all out.  There should
> > have been tests in the RDF 1.1 test suite to cover this, but I can't find one.
> > 
> > 
> > 
> > On 01/12/2017 08:14 AM, Stian Soiland-Reyes wrote:
> >> January.. just the right time for some semantic questions, right?
> >> 
> >> I just asked on public-rdf-comments@ about "Are literal language tags compared
> >> in lowercase?" [1] and I think the conclusion was that RDF 1.1 is slightly
> >> ambigious about this - depending on the reader.
> >> 
> >> Could RDF practicioners (in particular implementers) help me clarify 
> >> if RDF Literal's language tags are case sensitive?
> >> 
> >> This came up as a potential bug in Commons RDF [2]
> >> but I guess it is a more general question.
> >> 
> >> 
> >> Example:
> >> 
> >>    "Hello"@en-gb
> >>    "Hello"@en-GB
> >> 
> >> Are they equal?
> >> 
> >> We can agree they are _value equal_, as the *value space* of language tags is
> >> lower case [3] and BCP47 says casing MUST NOT be taken to carry meaning [4].
> >> 
> >> But are these literals _term equal_? Well, they won't compare directly
> >> "character by character" according to RDF 1.1 [5]:
> >> 
> >>> Literal term equality: Two literals are term-equal (the same RDF literal) if
> >>> and only if the two lexical forms, the two datatype IRIs, and the two
> >>> language tags (if any) compare equal, character by character. 
> >> 
> >> 
> >> However they COULD in some implemtations still be _term equal_, because [3]:
> >> 
> >>> Lexical representations of language tags may be converted to lower case.
> >> 
> >> And thus I think it is ambigious how to compare language tags when determining
> >> if two RDF literals are term equal or not in RDF 1.1 - or at least there might
> >> not be consistent behaviour across implementations.
> >> 
> >> So which one is it? What's the actual practice for comparing such language tags, for
> >> instance in SPARQL queries or graph.contains() kind of operations?
> >> 
> >> Note that reading of BCP47 [4] do recommend show/preserve language tag casing
> >> according to recommended casing style (e.g. "en-US") - so I think it's right if
> >> an RDF 1.1 implementations preserves the language tag -- (however it seems not
> >> currently permitted to magically transform them to the recommended style from
> >> lowercase!)
> >> 
> >> 
> >> Your views..? :-)
> >> 
> > 
> > 
> 
> ------------------------------------------------------------
> IHMC                                     (850)434 8903 home
> 40 South Alcaniz St.            (850)202 4416   office
> Pensacola                            (850)202 4440   fax
> FL 32502                              (850)291 0667   mobile (preferred)
> phayes@ihmc.us       http://www.ihmc.us/users/phayes
> 
> 
> 
> 
> 

-- 
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.
Received on Friday, 13 January 2017 09:25:34 UTC