Re: Summary of strings, markup, and language tagging in RDF (resend) from Martin Duerst on 2003-06-26 (w3c-rdfcore-wg@w3.org from June 2003)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 26 Jun 2003 18:38:44 -0400
To: Dan Connolly <connolly@w3.org>
Cc: w3c-i18n-ig@w3.org, "Ralph R. Swick" <swick@w3.org>, misha.wolf@reuters.com, Tim Berners-Lee <timbl@w3.org>, w3c-rdfcore-wg@w3.org, lilley@w3.org
Message-Id: <4.2.0.58.J.20030626173942.00a86c40@localhost>
Hello Dan,

Many thanks for your replies.

At 09:03 03/06/26 -0500, Dan Connolly wrote:

>Hi Martin and company,
>
>The RDF Core WG discussed this stuff last Friday
>http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2003Jun/0156.html
>and I took the ball to get back to you.
>
>First, to clarify a bit...
>
>On Thu, 2003-06-05 at 13:54, Martin Duerst wrote:
>[...]
> > The current last call draft treats the following differently:
> >
> >    an XML literal without markup nor language
> >      <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title>
> >    an XML literal with language but without markup:
> >      <dc:title xml:lang='en' rdf:parseType='Literal'
> >         >A Midsummer Night's Dream</dc:title>
> >    an XML literal with another language:
> >      <dc:title xml:lang='en-gb' rdf:parseType='Literal'
> >         >A Midsummer Night's Dream</dc:title>
>
>The RDF specs specify two relationships:
>(1) between an XML document and an RDF graph,
>aka hunk of syntax composed of literal terms, URI terms,
>bnode terms, and the like
>
>(2) between those terms and what they denote
>in an interpretation.
>
>Indeed, per the last call specs, those three are
>both treated as different RDF graphs and
>the terms in them denote different things.
>
>It would be useful to know if making the denotations
>work out to be the same would suffice, or if
>your requirement is actually that the graphs
>work out the same.

I think it is very difficult for us to answer such a question.
I guess nobody in the I18N group is familiar with the concept
of 'denotation', and just trying to guess what it might be is
dangerous. If you could describe what effect this difference
has on test cases, on applications, and so on, that would
help us a lot.


For the above example, I'm not sure whether you copied the wrong
piece, or whether there was some misunderstanding. I explicitly
wrote "the current last call draft". Unfortunately, the RDF Core
WG some weeks ago decided that all the three above examples,
together with others such as

       <dc:title xml:lang='fr' rdf:parseType='Literal'
          >A Midsummer Night's Dream</dc:title>
       <dc:title xml:lang='it' rdf:parseType='Literal'
          >A Midsummer Night's Dream</dc:title>

and so on, would all be the same, i.e. that for rdf:parseType='Literal',
any higher-up language information would effectively be ignored.

If the RDF Core WG has reversed the decision they took a few
weeks ago, then that solves one of our issues (easily the most
important one). Can you please confirm or deny?


I don't remember exactly with whom and when, but we also had some
discussions (among else with Chris Lilley in Budapest) that there
would ideally be some kind of equivalence based on the structure
of the language tag, e.g. that it should be possible to conclude

       <dc:title xml:lang='en' rdf:parseType='Literal'
          >A Midsummer Night's Dream</dc:title>

from
       <dc:title xml:lang='en-gb' rdf:parseType='Literal'
          >A Midsummer Night's Dream</dc:title>

but not the other way round. But maybe this is asking too much
at the level of RDF. Please note that we have not asked for this
one-way-but-not-the-other behavior in any of our comments.

Summary up to here: Having some relationship between identical
strings with related languages may be desirable. Ignoring
language information for rdf:parseType='Literal' is unacceptable.



> > However, the newest change by the RDF Core WG ignores (external)
> > xml:lang on XML literals, and therefore all the above become
> > the same. In order to be able to express that in:
> >
> >      <dc:title rdf:parseType='Literal'><html:span xml:lang='fr'
> >          >La Boheme</html:span> in Full Score</dc:title>
> >
> > 'in Full Score' is actually 'en', the RDF Core WG proposed that
> >
> >      <dc:title rdf:parseType='Literal'><html:span xml:lang='en'
> >          ><html:span xml:lang='fr'
> >          >La Boheme</html:span> in Full Score</html:span></dc:title>
> >
> > could be used.
> >
> > This situation is not at all satisfactory from the viewpoint
> > of I18N because:
> > - We have worked hard to eliminate artificial differences between
> >    text strings that are essentially the same:
> >    - by basing XML and RDF on Unicode, and therefore eliminating
> >      differences in character encoding.
> >    - by working on normalization (NFC) to reduce or avoid accidental
> >      differences based on remaining encoding choices in Unicode
> >    It would be very bad if after all that work, we were left with
> >    gratuitously different ways of representing textual strings due
> >    to idiosyncrasies of a type system.
>
>I presented this as an I18N requirement on RDF, and we
>discussed the proposed design and some nearby designs,
>but I didn't manage to convince the group to accept
>the requirement.

Are you sure that the requirement(s) were correctly understood?

To make sure, I'll list them up here, in short from:

1) Consider language information (xml:lang) for text
    (this is not disputed)

2) Consider language information (xml:lang) in an uniform way
    for plain and XML literals.
    (this was the case in the last call draft, but has been
    changed by the decision of the RDF Core WG a few weeks ago)

3) Treating the same text with the same associated information
    as being the same.
    (we had to discover that this was no longer the case when we
     were informed about the decision of the RDF Core WG a few
     weeks ago)



>It would help some of us if you could cite relevant parts
>of the I18N specs, e.g. charmod.

A year or two ago, we explicitly decided to not deal with
language tagging issues in charmod, sorry. And if we would
have added something about language tagging, then most
probably that would just have said: A) tag your content,
and B) make sure specs and implementations handle this
information. This would address 1) above, which does not
seem to be at stake, but does not specifically address
2) and 3).

The best specification to point to for issue 2) is the XML
specification (http://www.w3.org/TR/REC-xml#sec-lang-tag):
"The intent declared with xml:lang is considered to apply to all
attributes and content of the element where it is specified,
unless overridden with an instance of xml:lang on another
element within that content."

How are you or I supposed to explain to anybody on the planet
that if we see rdf:parseType='Literal' (which the RDF spec
calls *X*M*L* litarals!), the rules that XML defines for
xml:lang suddenly don't apply?



>The idea of having treating these two the same
>seemed to mix layers in our design in distasteful ways...
> >          <dc:title>A Midsummer Night's Dream</dc:title>
>
> >          <dc:title rdf:parseType='Literal'>A Midsummer Night's 
> Dream</dc:title>

I had a discussion with Ralph a couple weeks ago, and
he clearly said that in the original RDF spec, these
two were intended to be equivalent. rdf:parseType='Literal'
just indicated that the parser would have to watch out,
not that there was anything fundamentally different.
This is similar to the other parseType values, e.g.
parseType="Resource", which do not create different
things, but just different ways to serialize and parse
the same thing.


>I explained that this could be handled by the parser
>(i.e. they'd result in the same graph, which would make
>their denotations line up naturally);

Yes, of course.


>that was less
>distasteful than what some folks thought the proposal was:
>that they'd be different graph terms but have the same denotation.
>But even so, the idea that you wouldn't know what sort of
>term you have until you reached </dc:title> was unacceptable
>to several in the group.

Why was it unacceptable? If we say that these two are the same,
then they are not different sorts of terms.



>Hmm... I'm not sure what to suggest as a next step.

First, let's make sure we haven't forgotten anything.

Besides
a)    <dc:title>A Midsummer Night's Dream</dc:title>
and
b)    <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title>
there was also
c)    <dc:title rdf:datatype='http://www.w3.org/2001/XMLSchema#string'
         >A Midsummer Night's Dream</dc:title>

For the average user, it's even more obvious that a) and c) should
be the same than that a) and b) (or c) and b)) should be the same.

When Tim shortly dropped in on the discussion between Ralph and me,
he very quickly agreed that a) and c) should be the same. If RDF
decides to be completely agnostic about Schema types, then that
may be difficult at this level, but OWL has to support 'string',
so it could easily deal with this equivalence.


So the next step is making sure that we understand each other
really.


Regards,    Martin.


> > - Language tagging is an important aspect of internationalization.
> >    Also, small-scale markup is important for internationalization
> >    (multilanguage strings, bidirectionality, ruby, glyph variants,...).
> >    Both are in many ways natural extensions of plain text strings
> >    as soon as markup is available.
> >
> >    The current handling of XML literal strings without any actual
> >    markup, as well as the recent change to ignore xml:lang on XML
> >    literals, break this natural extension.
> >
> >    In addition, the recent change to ignore xml:lang on XML
> >    literals makes language tagging more tedious in the prevalent
> >    case of monolingual or mostly monolingual data.
> >
> >
> >
> > In our discussion, Ralph came up with some nice ideas:
> >
> > It looks like we have the following things to actually
> > represent and work with:
> >
> > 1) plain text strings without anything attached
> > 2) text with language and/or markup
> > 3) 'real' datatypes such as integer, date,...
> >
> > Now here is how Ralph proposed to map the various XML
> > phenomenas to the above three categories:
> >
> >    a plain literal (no language)
> >          <dc:title>A Midsummer Night's Dream</dc:title>
> >       absence of xml:lang (or alternatively xml:lang='') => 1)
> >
> >    an XML literal without markup nor language
> >          <dc:title rdf:parseType='Literal'>A Midsummer Night's 
> Dream</dc:title>
> >       absence of xml:lang or markup => 1)
> >
> >    an XML Schema string:
> >         <dc:title rdf:datatype='http://www.w3.org/2001/XMLSchema#string'
> >            >A Midsummer Night's Dream</dc:title>
> >       xsd:string => 1)
> >
> >    a plain literal with language:
> >         <dc:title xml:lang='en'>A Midsummer Night's Dream</dc:title>
> >       xml:lang => 2) (with 'en')
> >
> >    an XML literal with language but without markup:
> >         <dc:title xml:lang='en' rdf:parseType='Literal'
> >            >A Midsummer Night's Dream</dc:title>
> >       xml:lang => 2) (with 'en')
> >
> >
> > This would solve the current problems, and would better model
> > the reality of the actual data. Of course, other solutions
> > may be available, too.
> >
> >
> > Regards,    Martin.
>--
>Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Thursday, 26 June 2003 18:39:58 UTC