Re: Summary of strings, markup, and language tagging in RDF (resend) from Dan Connolly on 2003-06-26 (w3c-rdfcore-wg@w3.org from June 2003)

From: Dan Connolly <connolly@w3.org>
Date: 26 Jun 2003 09:03:37 -0500
To: Martin Duerst <duerst@w3.org>
Cc: w3c-i18n-ig@w3.org, "Ralph R. Swick" <swick@w3.org>, misha.wolf@reuters.com, Tim Berners-Lee <timbl@w3.org>, w3c-rdfcore-wg@w3.org
Message-Id: <1056636216.24287.1395.camel@dirk.dm93.org>
Hi Martin and company,

The RDF Core WG discussed this stuff last Friday
http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2003Jun/0156.html
and I took the ball to get back to you.

First, to clarify a bit...

On Thu, 2003-06-05 at 13:54, Martin Duerst wrote:
[...]
> The current last call draft treats the following differently:
> 
>    an XML literal without markup nor language
>      <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title>
>    an XML literal with language but without markup:
>      <dc:title xml:lang='en' rdf:parseType='Literal'
>         >A Midsummer Night's Dream</dc:title>
>    an XML literal with another language:
>      <dc:title xml:lang='en-gb' rdf:parseType='Literal'
>         >A Midsummer Night's Dream</dc:title>

The RDF specs specify two relationships:
(1) between an XML document and an RDF graph,
aka hunk of syntax composed of literal terms, URI terms,
bnode terms, and the like

(2) between those terms and what they denote
in an interpretation.

Indeed, per the last call specs, those three are
both treated as different RDF graphs and
the terms in them denote different things.

It would be useful to know if making the denotations
work out to be the same would suffice, or if
your requirement is actually that the graphs
work out the same.



> However, the newest change by the RDF Core WG ignores (external)
> xml:lang on XML literals, and therefore all the above become
> the same. In order to be able to express that in:
> 
>      <dc:title rdf:parseType='Literal'><html:span xml:lang='fr'
>          >La Boheme</html:span> in Full Score</dc:title>
> 
> 'in Full Score' is actually 'en', the RDF Core WG proposed that
> 
>      <dc:title rdf:parseType='Literal'><html:span xml:lang='en'
>          ><html:span xml:lang='fr'
>          >La Boheme</html:span> in Full Score</html:span></dc:title>
> 
> could be used.
> 
> This situation is not at all satisfactory from the viewpoint
> of I18N because:
> - We have worked hard to eliminate artificial differences between
>    text strings that are essentially the same:
>    - by basing XML and RDF on Unicode, and therefore eliminating
>      differences in character encoding.
>    - by working on normalization (NFC) to reduce or avoid accidental
>      differences based on remaining encoding choices in Unicode
>    It would be very bad if after all that work, we were left with
>    gratuitously different ways of representing textual strings due
>    to idiosyncrasies of a type system.

I presented this as an I18N requirement on RDF, and we
discussed the proposed design and some nearby designs,
but I didn't manage to convince the group to accept
the requirement.

It would help some of us if you could cite relevant parts
of the I18N specs, e.g. charmod.

The idea of having treating these two the same
seemed to mix layers in our design in distasteful ways...
>          <dc:title>A Midsummer Night's Dream</dc:title>

>          <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title>

I explained that this could be handled by the parser
(i.e. they'd result in the same graph, which would make
their denotations line up naturally); that was less
distasteful than what some folks thought the proposal was:
that they'd be different graph terms but have the same denotation.
But even so, the idea that you wouldn't know what sort of
term you have until you reached </dc:title> was unacceptable
to several in the group.

Hmm... I'm not sure what to suggest as a next step.

> - Language tagging is an important aspect of internationalization.
>    Also, small-scale markup is important for internationalization
>    (multilanguage strings, bidirectionality, ruby, glyph variants,...).
>    Both are in many ways natural extensions of plain text strings
>    as soon as markup is available.
> 
>    The current handling of XML literal strings without any actual
>    markup, as well as the recent change to ignore xml:lang on XML
>    literals, break this natural extension.
> 
>    In addition, the recent change to ignore xml:lang on XML
>    literals makes language tagging more tedious in the prevalent
>    case of monolingual or mostly monolingual data.
> 
> 
> 
> In our discussion, Ralph came up with some nice ideas:
> 
> It looks like we have the following things to actually
> represent and work with:
> 
> 1) plain text strings without anything attached
> 2) text with language and/or markup
> 3) 'real' datatypes such as integer, date,...
> 
> Now here is how Ralph proposed to map the various XML
> phenomenas to the above three categories:
> 
>    a plain literal (no language)
>          <dc:title>A Midsummer Night's Dream</dc:title>
>       absence of xml:lang (or alternatively xml:lang='') => 1)
> 
>    an XML literal without markup nor language
>          <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title>
>       absence of xml:lang or markup => 1)
> 
>    an XML Schema string:
>         <dc:title rdf:datatype='http://www.w3.org/2001/XMLSchema#string'
>            >A Midsummer Night's Dream</dc:title>
>       xsd:string => 1)
> 
>    a plain literal with language:
>         <dc:title xml:lang='en'>A Midsummer Night's Dream</dc:title>
>       xml:lang => 2) (with 'en')
> 
>    an XML literal with language but without markup:
>         <dc:title xml:lang='en' rdf:parseType='Literal'
>            >A Midsummer Night's Dream</dc:title>
>       xml:lang => 2) (with 'en')
> 
> 
> This would solve the current problems, and would better model
> the reality of the actual data. Of course, other solutions
> may be available, too.
> 
> 
> Regards,    Martin.
-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Thursday, 26 June 2003 10:03:23 UTC