Need XML in multilingual prefLabels: choose XMLLiteral datatype or language tags?

Dear all,

I am one of the developers of the Mathematics Subject Classification 
(MSC) SKOS dataset (see http://msc2010.org/resources/MSC/2010/info/ and 
http://thedatahub.org/dataset/msc).

Some of the skos:prefLabels in this dataset contain MathML formulas, and 
we have labels in different languages.  Thus, if the RDF data model 
allowed it, we would prefer using (with MathML abbreviated as LaTeX for 
easier reading):

msc2010:11B57
   skos:prefLabel
     "Farey sequences; the sequences <math>{1^k, 2^k, 
\cdots}</math>"@en^^rdf:XMLLiteral .

So it seems we have to choose between a rock and a hard place, and I'd 
like to ask you for advice with what to choose:

Choice 1: Don't use the rdf:XMLLiteral datatype, i.e. use "text 
<math/>"@language.

Con: We can no longer convey to applications the information that the 
label consists of well-balanced XML content.

Con: Applications that process the labels but don't expect XML content 
here will display XML source code.

Choice 2: Encode the language information into the XML, i.e. "text <math 
xml:lang='en'/>"^^rdf:XMLLiteral.

Pro: Applications that don't know XML will fail (as they should).

Con: In the multilingual case, a skos:Concept would have multiple 
datatyped skos:prefLabels with "no language".  This violates the 
convention that skos:prefLabel is only used with plain literals 
(http://www.w3.org/TR/skos-reference/#L2655).  It _may_ also violate the 
integrity condition S14 that "a resource has no more than one value of 
skos:prefLabel per language tag" 
(http://www.w3.org/TR/skos-reference/#L1567; however "no language tag" 
is not "a language tag").

Con: Slows down SPARQL queries: Filtering by language would have to be 
done by treating the label as text and filtering against regular 
expression such as "xml:lang='en'".

Con: As the majority of labels doesn't contain formulas, we would most 
reasonably represent them as plain literals, thus ending up with a 
mixture of language-tagged plain literals and XML literals.

Note that we absolutely need the formulas in the labels; there is no way 
of separating them out of the literals into some auxiliary structures, 
for the following reasons:

* Some labels contain more than one mathematical formula, scattered over 
multiple places in the text.
* While this is not yet the case in the labels of the MSC dataset and 
their translations, note that mathematical notation varies with 
language; consider "Binomial coefficient <math>\binom{n}{k}</math>"@en 
vs. "Coefficient binomial <math>C^k_n</math>"@fr.

And there is no way of doing without MathML.  We have 23 labels that 
can't be expressed by just using Unicode, and some more that could 
theoretically be expressed in Unicode but where practically available 
fonts don't support it.

What would you recommend?

Cheers, and thanks in advance,

Christoph

-- 
Christoph Lange, School of Computer Science, University of Birmingham
http://cs.bham.ac.uk/~langec, Skype duke4701

→ Building & Exploring Web Based Environments.  Seville, Spain, 27 Jan–
   1 Feb 2013.  Deadline 22 Sep. 
http://iaria.org/conferences2013/WEB13.html

Received on Thursday, 30 August 2012 08:41:53 UTC