Re: Language Tag Case Conflict (between RDF1.1 and BCP47) from Vladimir Alexiev on 2014-01-20 (public-rdf-comments@w3.org from January 2014)

From: Vladimir Alexiev <vladimir.alexiev@ontotext.com>
Date: Mon, 20 Jan 2014 11:02:20 +0200
To: <public-rdf-comments@w3.org>
Message-ID: <002f01cf15be$53d0dc00$fb729400$@alexiev@ontotext.com>

RDF1.1 talks about lowercasing of language tags:
http://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
"Lexical representations of language tags may be converted to lower case. The value space of language tags is always in lower case."

Normalizing tags makes them easier to compare. Lowercasing is a CHEAP way to normalize. However, there is a BETTER way, I'll call it BCP47-normalization:

BCP47 2.1.1. "Formatting of Language Tags" says:
"Although case distinctions do not carry meaning in language tags, consistent formatting and presentation of language tags will aid users. The format of subtags in the registry is RECOMMENDED as the form to use in language tags. This format generally corresponds to the common conventions for the various ISO standards from which the subtags are derived."
and goes on to describe that:
- script codes are in capital case (e.g. 'Cyrl' Cyrillic).
- country codes are in upper case (e.g. 'MN' Mongolia).

I posted an algorithm to do BCP47-normalization (in Perl) in Sep 2013:
https://rt.cpan.org/Public/Ticket/Attachment/1267147/670949/lang_normalize.pl

There are issues posted against these implementations:
- Sesame RIO:
https://openrdf.atlassian.net/browse/SES-1659
https://openrdf.atlassian.net/browse/SES-1999
- perl RDF::Trine::Node::Literal:
https://rt.cpan.org/Public/Bug/Display.html?id=88964
- Note: Jena appears to store the lang tag as provided, which IMHO is better than storing as lowercase:
http://grepcode.com/file_/repo1.maven.org/maven2/org.apache.jena/jena-core/2.11.0/com/hp/hpl/jena/graph/impl/LiteralLabelImpl.java/?v=source

I therefore propose to change the above RDF1.1 text to:

"Lexical representations of language tags MAY be normalized, according to BCP47 section 2.1.1. "Formatting of Language Tags" (country codes in upper case, script codes capitalized, the rest in lower case).
Language tags MAY also be normalized by converting all to lower case, but BCP47 normalization is preferred.
No matter which method is chosen, the semantics of language tags MUST NOT depend on case.
In particular, implementations MUST NOT store as separate statements, two statements that differ only by the case of language tags."

I also propose to say this about SPARQL lang()
http://www.w3.org/TR/rdf-sparql-query/#func-lang
"lang() MAY normalize the language tag as described in RDF 1.1 Concepts and Abstract Syntax sec 3.3 Literals.
It is recommended that lang() normalizes the literal according to BCP47 section 2.1.1, and not by converting it all to lower case."

Best regards!
--
Vladimir Alexiev, PhD, PMP
Lead, Data and Ontology Management Group
Ontotext Corp, www.ontotext.com
Sirma Group Holding, www.sirma.com
Email: vladimir.alexiev@ontotext.com, skype:valexiev1
Mobile: +359 888 568 132, SMS: 359888568132@sms.mtel.net
Landline: +359 (988) 106 084, Fax: +359 (2) 975 3226
Calendar: https://www.google.com/calendar/embed?src=vladimir%40sirma.bg

Received on Monday, 20 January 2014 09:02:45 UTC