Summary of strings, markup, and language tagging in RDF (resend)

I was wondering why I was not getting any answers on this
message. Now I just discovered that I had forgotten to copy
the RDF Core WG. So I'm resending it.    Martin.


------------------------------------------------------------


This is my attempt to summarize the issues in RDF with language
tagging for XML Literals, and the related issues surrounding the
representation of text strings in general.

The earlier part of this mail is mainly a summary/intro to RDF
for the I18N WG. The later part is an analysis of the problem
based on my opinions and an extensive discussion with Ralph
Swick (coauthor of the first RDF spec) and some minutes of
TimBL dropping into that conversation. I have also copied
Misha Wolf, former chair of the I18N WG and member of the original
RDF WG. There is a chance that Misha may supply some important
knowledge about the history of language tagging in RDF.
(even if you find the start boring, you may want to read
later parts).
[if you reply, please don't copy the whole mail!]


(super-short) Intro to RDF (for i18n)
               ============

RDF (Resource Description Framework) is a technology to represent
data in a very simple and straightforward way so that it can
easily be reused. Although there are many other uses of RDF,
I will use an example here that illustrates the name of the
technology, namely an example where RDF is used to describe
a resources, in particular a Web page. It is a very simple
single statement that associates W3C's Web page,
http://www.w3.org/, with the text "World Wide Web Consortium",
saying that the later is the 'title' of the former
(title being defined as the title according to Dublin
Core metadata conventions). To have a look at the example,
please go to:

http://jiggles.w3.org/servlet/ARPServlet?RDF=%3C%3Fxml+version%3D%221.0%22%3 
F%3E%0D%0A%0D%0A%3Crdf%3ARDF+xmlns%3Ardf%3D%22http%3A%2F%2Fwww.w3.org%2F1999 
%2F02%2F22-rdf-syntax-ns%23%22%0D%0A++++++++xmlns%3Adc%3D%22http%3A%2F%2Fpur 
l.org%2Fdc%2Felements%2F1.1%2F%22%3E+%0D%0A+%3Crdf%3ADescription+rdf%3Aabout 
%3D%22http%3A%2F%2Fwww.w3.org%2F%22%3E+%0D%0A++%3Cdc%3Atitle%3EWorld+Wide+We 
b+Consortium%3C%2Fdc%3Atitle%3E+%0D%0A+%3C%2Frdf%3ADescription%3E+%0D%0A%3C% 
2Frdf%3ARDF%3E%0D%0A

Or go to http://www.w3.org/RDF/Validator/ and just click on the
'parse RDF' button. You will then get a result page that shows
the three main representations of RDF:

- As triples: Each statement is listed as a triple of subject,
   predicate, object.
- As a (labeled directed) graph.
- Encoded in XML. This is usually a bit confusing, in particular
   because of the high overhead for this small example.

In some way, this is really all you need to know about RDF, in
practice, there are just more than one statement (which means
more triples, more nodes and arks in the graph, more XML,...).
One advantage is that different collections of RDF statements
can easily be merged, which means that even very small pieces
of RDF have their value.


In what follows, we will concentrate on what RDF calls 'literals',
the square boxes in the graphs. The literal we have used above
is the text string "World Wide Web Consortium". In table/triple
terms, only objects can be literals. This means that literals
are used to say things about resources, but cannot be talked
about (at least not directly).


The orgininal RDF spec (http://www.w3.org/TR/1999/REC-rdf-syntax-19990222)
had very few variation for literals. Besides them being plain
text strings, it was also possible that they included XML markup.
The purpose of this was to allow to include markup in cases
such as titles or descriptions, among else (but not only)
to address I18N needs for marking up stretches of different
language, ruby, bidirectionality,... This raised the question
of what to do with xml:lang. This is what the original RDF spec
says:

 >>>>
The xml:lang attribute may be used as defined by [XML] to associate
a language with the property value. There is no specific data model
representation for xml:lang (i.e., it adds no triples to the data model);
the language of a literal is considered by RDF to be a part of the literal.
An application may ignore language tagging of a string. All RDF applications
must specify whether or not language tagging in literals is significant;
that is, whether or not language is considered when performing string
matching or other processing.
 >>>>

While this is not very explicit, it was always assumed that, RDF
(in XML) being an XML application, xml:lang would behave as it
behaved in any other XML application, namely that it was inherited.
As an example, in the following book title example:

<stockNS:booksInStock xml:lang='en'>
     ...
     <dc:title>A Midsummer Night's Dream</dc:title>
     ...
     <dc:title xml:lang='it'>La traviata</dc:title>
     ...
     <dc:title rdf:parseType='Literal'><html:span xml:lang='fr'
         >La Boheme</html:span> in Full Score</dc:title>


"A Midsummer Night's Dream" would be 'en', "La traviata" would
be 'it', "La Boheme" would be 'fr', and "in Full Score" would
be 'en'.

This example also shows the use of "parseType='Literal'" to indicate
that the element is to be parsed as a literal, rather than as nested
RDF. [For mixed content, the RDF parser would theoretically have
been able to figure this out, because nested RDF cannot be mixed,
but this would have complicated processing too much.]

Please note that according to Ralph, parseType='Literal' was only
an indication of how to parse the element content, and didn't
indicate a difference in underlying nature. I.e. changing
     <dc:title>A Midsummer Night's Dream</dc:title>
to
     <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title>
would still result in the same text string.


A while ago, the RDF Core WG was chartered to rework and clean
up the RDF specification, and they have worked hard on this leading
to the last calls at the start of this year.

The two areas they have worked on that are relevant for this
discussion are:

- The handwaving in the original RDF specification (cited above) that
   RDF applications can either consider or not consider language
   information when comparing literals.
   It is always possible that an application decides to consider
   two strings that differ only in language as equivalent if the
   RDF machinery includes language information (and therefore in
   effect treats these two strings as different). However, this
   does not work the other way round. Therefore, RDF choose to
   treat literals with different languages as different.

- The addition of typed literals, e.g. integers, dates, and so on,
   based on the model of XML Schema Datatypes
   (http://www.w3.org/TR/xmlschema-2/), i.e.:
   - distinction between lexical space and value space
   - locale/language independent representation
   Datatypes other than those of XML Schema can be used, but only
   if they can fit the model of XML Schema Datatypes. Reflecting
   the principle of locale/language-independence, for such datatypes,
   xml:lang is irrelevant.


The RDF Core WG choose to use datatypes to model XML literals
(datatype http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral,
or rdf:XMLLiteral for short), but to create an exception in that
for XML literals, xml:lang is relevant. The RDF Core WG recently
has decided to remove this exception, which lead to the start
of this discussion.

XML Schema datatypes also added the datatype 'string'
(http://www.w3.org/2001/XMLSchema#string, or xsd:string for short).
In the last call draft, this results in the following literal
typology:

      plain literal: inherits xml:lang
      XML literal: inherits xml:lang
      XML Schema string: xml:lang is irrelevant
      [other datatyped literals: xml:lang is irrelevant]

[a question from Ralph to RDF Core: are plain literals
characterized by the absence of a type (i.e. type=no_type),
or just by the fact that their type is (temporarily, maybe)
unknown?]


[From here on, to simplily, I'll put xml:lang directly on the
element in question, and assume that there is none higher up in
the tree, although in practice, it can always be inherited from
higher up, or removed by using xml:lang=''.]

According to this, the following all are treated as different
in the (new) RDF model:

   a plain literal (no language)
     <dc:title>A Midsummer Night's Dream</dc:title>
   an XML literal without markup nor language
     <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title>
   an XML Schema string:
     <dc:title rdf:datatype='http://www.w3.org/2001/XMLSchema#string'
        >A Midsummer Night's Dream</dc:title>

Also the following are treated as different
in the (new) RDF model (and different from all the above):

   a plain literal with language:
     <dc:title xml:lang='en'>A Midsummer Night's Dream</dc:title>
   an XML literal with language but without markup:
     <dc:title xml:lang='en' rdf:parseType='Literal'
        >A Midsummer Night's Dream</dc:title>

The current last call draft treats the following differently:

   an XML literal without markup nor language
     <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title>
   an XML literal with language but without markup:
     <dc:title xml:lang='en' rdf:parseType='Literal'
        >A Midsummer Night's Dream</dc:title>
   an XML literal with another language:
     <dc:title xml:lang='en-gb' rdf:parseType='Literal'
        >A Midsummer Night's Dream</dc:title>

However, the newest change by the RDF Core WG ignores (external)
xml:lang on XML literals, and therefore all the above become
the same. In order to be able to express that in:

     <dc:title rdf:parseType='Literal'><html:span xml:lang='fr'
         >La Boheme</html:span> in Full Score</dc:title>

'in Full Score' is actually 'en', the RDF Core WG proposed that

     <dc:title rdf:parseType='Literal'><html:span xml:lang='en'
         ><html:span xml:lang='fr'
         >La Boheme</html:span> in Full Score</html:span></dc:title>

could be used.

This situation is not at all satisfactory from the viewpoint
of I18N because:
- We have worked hard to eliminate artificial differences between
   text strings that are essentially the same:
   - by basing XML and RDF on Unicode, and therefore eliminating
     differences in character encoding.
   - by working on normalization (NFC) to reduce or avoid accidental
     differences based on remaining encoding choices in Unicode
   It would be very bad if after all that work, we were left with
   gratuitously different ways of representing textual strings due
   to idiosyncrasies of a type system.

- Language tagging is an important aspect of internationalization.
   Also, small-scale markup is important for internationalization
   (multilanguage strings, bidirectionality, ruby, glyph variants,...).
   Both are in many ways natural extensions of plain text strings
   as soon as markup is available.

   The current handling of XML literal strings without any actual
   markup, as well as the recent change to ignore xml:lang on XML
   literals, break this natural extension.

   In addition, the recent change to ignore xml:lang on XML
   literals makes language tagging more tedious in the prevalent
   case of monolingual or mostly monolingual data.



In our discussion, Ralph came up with some nice ideas:

It looks like we have the following things to actually
represent and work with:

1) plain text strings without anything attached
2) text with language and/or markup
3) 'real' datatypes such as integer, date,...

Now here is how Ralph proposed to map the various XML
phenomenas to the above three categories:

   a plain literal (no language)
         <dc:title>A Midsummer Night's Dream</dc:title>
      absence of xml:lang (or alternatively xml:lang='') => 1)

   an XML literal without markup nor language
         <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title>
      absence of xml:lang or markup => 1)

   an XML Schema string:
        <dc:title rdf:datatype='http://www.w3.org/2001/XMLSchema#string'
           >A Midsummer Night's Dream</dc:title>
      xsd:string => 1)

   a plain literal with language:
        <dc:title xml:lang='en'>A Midsummer Night's Dream</dc:title>
      xml:lang => 2) (with 'en')

   an XML literal with language but without markup:
        <dc:title xml:lang='en' rdf:parseType='Literal'
           >A Midsummer Night's Dream</dc:title>
      xml:lang => 2) (with 'en')


This would solve the current problems, and would better model
the reality of the actual data. Of course, other solutions
may be available, too.


Regards,    Martin.

Received on Thursday, 5 June 2003 14:54:52 UTC