RE: Need XML in multilingual prefLabels: choose XMLLiteral datatype or language tags?

Quick suggestion: generalise to proposals for optional extensions in a new version of SKOS?

-----Original Message-----
From: Christoph LANGE [mailto:c.lange@cs.bham.ac.uk] 
Sent: 30 August 2012 09:41
To: public-esw-thes@w3.org
Subject: Need XML in multilingual prefLabels: choose XMLLiteral datatype or language tags?

Dear all,

I am one of the developers of the Mathematics Subject Classification
(MSC) SKOS dataset (see http://msc2010.org/resources/MSC/2010/info/ and http://thedatahub.org/dataset/msc).

Some of the skos:prefLabels in this dataset contain MathML formulas, and we have labels in different languages.  Thus, if the RDF data model allowed it, we would prefer using (with MathML abbreviated as LaTeX for easier reading):

msc2010:11B57
   skos:prefLabel
     "Farey sequences; the sequences <math>{1^k, 2^k, \cdots}</math>"@en^^rdf:XMLLiteral .

So it seems we have to choose between a rock and a hard place, and I'd like to ask you for advice with what to choose:

Choice 1: Don't use the rdf:XMLLiteral datatype, i.e. use "text <math/>"@language.

Con: We can no longer convey to applications the information that the label consists of well-balanced XML content.

Con: Applications that process the labels but don't expect XML content here will display XML source code.

Choice 2: Encode the language information into the XML, i.e. "text <math xml:lang='en'/>"^^rdf:XMLLiteral.

Pro: Applications that don't know XML will fail (as they should).

Con: In the multilingual case, a skos:Concept would have multiple datatyped skos:prefLabels with "no language".  This violates the convention that skos:prefLabel is only used with plain literals (http://www.w3.org/TR/skos-reference/#L2655).  It _may_ also violate the integrity condition S14 that "a resource has no more than one value of skos:prefLabel per language tag" 
(http://www.w3.org/TR/skos-reference/#L1567; however "no language tag" 
is not "a language tag").

Con: Slows down SPARQL queries: Filtering by language would have to be done by treating the label as text and filtering against regular expression such as "xml:lang='en'".

Con: As the majority of labels doesn't contain formulas, we would most reasonably represent them as plain literals, thus ending up with a mixture of language-tagged plain literals and XML literals.

Note that we absolutely need the formulas in the labels; there is no way of separating them out of the literals into some auxiliary structures, for the following reasons:

* Some labels contain more than one mathematical formula, scattered over multiple places in the text.
* While this is not yet the case in the labels of the MSC dataset and their translations, note that mathematical notation varies with language; consider "Binomial coefficient <math>\binom{n}{k}</math>"@en vs. "Coefficient binomial <math>C^k_n</math>"@fr.

And there is no way of doing without MathML.  We have 23 labels that can't be expressed by just using Unicode, and some more that could theoretically be expressed in Unicode but where practically available fonts don't support it.

What would you recommend?

Cheers, and thanks in advance,

Christoph

--
Christoph Lange, School of Computer Science, University of Birmingham http://cs.bham.ac.uk/~langec, Skype duke4701

→ Building & Exploring Web Based Environments.  Seville, Spain, 27 Jan–
   1 Feb 2013.  Deadline 22 Sep. 
http://iaria.org/conferences2013/WEB13.html

Received on Thursday, 30 August 2012 08:49:59 UTC