- From: Vladimir Alexiev <vladimir.alexiev@ontotext.com>
- Date: Mon, 20 Jan 2014 11:02:20 +0200
- To: <public-rdf-comments@w3.org>
RDF1.1 talks about lowercasing of language tags: http://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal "Lexical representations of language tags may be converted to lower case. The value space of language tags is always in lower case." Normalizing tags makes them easier to compare. Lowercasing is a CHEAP way to normalize. However, there is a BETTER way, I'll call it BCP47-normalization: BCP47 2.1.1. "Formatting of Language Tags" says: "Although case distinctions do not carry meaning in language tags, consistent formatting and presentation of language tags will aid users. The format of subtags in the registry is RECOMMENDED as the form to use in language tags. This format generally corresponds to the common conventions for the various ISO standards from which the subtags are derived." and goes on to describe that: - script codes are in capital case (e.g. 'Cyrl' Cyrillic). - country codes are in upper case (e.g. 'MN' Mongolia). I posted an algorithm to do BCP47-normalization (in Perl) in Sep 2013: https://rt.cpan.org/Public/Ticket/Attachment/1267147/670949/lang_normalize.pl There are issues posted against these implementations: - Sesame RIO: https://openrdf.atlassian.net/browse/SES-1659 https://openrdf.atlassian.net/browse/SES-1999 - perl RDF::Trine::Node::Literal: https://rt.cpan.org/Public/Bug/Display.html?id=88964 - Note: Jena appears to store the lang tag as provided, which IMHO is better than storing as lowercase: http://grepcode.com/file_/repo1.maven.org/maven2/org.apache.jena/jena-core/2.11.0/com/hp/hpl/jena/graph/impl/LiteralLabelImpl.java/?v=source I therefore propose to change the above RDF1.1 text to: "Lexical representations of language tags MAY be normalized, according to BCP47 section 2.1.1. "Formatting of Language Tags" (country codes in upper case, script codes capitalized, the rest in lower case). Language tags MAY also be normalized by converting all to lower case, but BCP47 normalization is preferred. No matter which method is chosen, the semantics of language tags MUST NOT depend on case. In particular, implementations MUST NOT store as separate statements, two statements that differ only by the case of language tags." I also propose to say this about SPARQL lang() http://www.w3.org/TR/rdf-sparql-query/#func-lang "lang() MAY normalize the language tag as described in RDF 1.1 Concepts and Abstract Syntax sec 3.3 Literals. It is recommended that lang() normalizes the literal according to BCP47 section 2.1.1, and not by converting it all to lower case." Best regards! -- Vladimir Alexiev, PhD, PMP Lead, Data and Ontology Management Group Ontotext Corp, www.ontotext.com Sirma Group Holding, www.sirma.com Email: vladimir.alexiev@ontotext.com, skype:valexiev1 Mobile: +359 888 568 132, SMS: 359888568132@sms.mtel.net Landline: +359 (988) 106 084, Fax: +359 (2) 975 3226 Calendar: https://www.google.com/calendar/embed?src=vladimir%40sirma.bg
Received on Monday, 20 January 2014 09:02:45 UTC