Fwd: Language Tag Case Conflict (between RDF1.1 and BCP47)

Taking this to the WG list.

From a technical point of view, this is a reasonable comment and might be worth considering; at least a pointer to the BCP 47 normalisation strategy would be helpful for implementers.

But from a process point of view, the window for RDF 1.1 might be closed? What options still exist at this point?

Best,
Richard



Begin forwarded message:

> Resent-From: public-rdf-comments@w3.org
> From: "Vladimir Alexiev" <vladimir.alexiev@ontotext.com>
> Subject: Re: Language Tag Case Conflict (between RDF1.1 and BCP47)
> Date: 20 January 2014 09:02:20 GMT
> To: <public-rdf-comments@w3.org>
> Reply-To: <vladimir.alexiev@ontotext.com>
> 
> RDF1.1 talks about lowercasing of language tags:
> http://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
> "Lexical representations of language tags may be converted to lower case. The value space of language tags is always in lower case."
> 
> Normalizing tags makes them easier to compare. Lowercasing is a CHEAP way to normalize. However, there is a BETTER way, I'll call it BCP47-normalization:
> 
> BCP47 2.1.1. "Formatting of Language Tags" says:
> "Although case distinctions do not carry meaning in language tags, consistent formatting and presentation of language tags will aid users.  The format of subtags in the registry is RECOMMENDED as the form to use in language tags.  This format generally corresponds to the common conventions for the various ISO standards from which the subtags are derived."
> and goes on to describe that:
> - script codes are in capital case (e.g. 'Cyrl' Cyrillic).
> - country codes are in upper case (e.g. 'MN' Mongolia).
> 
> I posted an algorithm to do BCP47-normalization (in Perl) in Sep 2013:
> https://rt.cpan.org/Public/Ticket/Attachment/1267147/670949/lang_normalize.pl
> 
> There are issues posted against these implementations:
> - Sesame RIO:
> https://openrdf.atlassian.net/browse/SES-1659
> https://openrdf.atlassian.net/browse/SES-1999
> - perl RDF::Trine::Node::Literal:
> https://rt.cpan.org/Public/Bug/Display.html?id=88964
> - Note: Jena appears to store the lang tag as provided, which IMHO is better than storing as lowercase:
> http://grepcode.com/file_/repo1.maven.org/maven2/org.apache.jena/jena-core/2.11.0/com/hp/hpl/jena/graph/impl/LiteralLabelImpl.java/?v=source
> 
> I therefore propose to change the above RDF1.1 text to:
> 
> "Lexical representations of language tags MAY be normalized, according to BCP47 section 2.1.1. "Formatting of Language Tags" (country codes in upper case, script codes capitalized, the rest in lower case).
> Language tags MAY also be normalized by converting all to lower case, but BCP47 normalization is preferred. 
> No matter which method is chosen, the semantics of language tags MUST NOT depend on case.
> In particular, implementations MUST NOT store as separate statements, two statements that differ only by the case of language tags."
> 
> I also propose to say this about SPARQL lang()
> http://www.w3.org/TR/rdf-sparql-query/#func-lang
> "lang() MAY normalize the language tag as described in RDF 1.1 Concepts and Abstract Syntax sec 3.3 Literals.
> It is recommended that lang() normalizes the literal according to BCP47 section 2.1.1, and not by converting it all to lower case."
> 
> Best regards!
> --
> Vladimir Alexiev, PhD, PMP
> Lead, Data and Ontology Management Group
> Ontotext Corp, www.ontotext.com
> Sirma Group Holding, www.sirma.com
> Email: vladimir.alexiev@ontotext.com, skype:valexiev1  
> Mobile: +359 888 568 132, SMS: 359888568132@sms.mtel.net
> Landline: +359 (988) 106 084, Fax: +359 (2) 975 3226
> Calendar: https://www.google.com/calendar/embed?src=vladimir%40sirma.bg
> 
> 
> 
> 

Received on Monday, 20 January 2014 10:13:46 UTC