- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 05 Jun 2003 14:54:16 -0400
- To: w3c-i18n-ig@w3.org
- Cc: swick@w3.org, misha.wolf@reuters.com, "Tim Berners-Lee" <timbl@w3.org>, w3c-rdfcore-wg@w3.org
I was wondering why I was not getting any answers on this message. Now I just discovered that I had forgotten to copy the RDF Core WG. So I'm resending it. Martin. ------------------------------------------------------------ This is my attempt to summarize the issues in RDF with language tagging for XML Literals, and the related issues surrounding the representation of text strings in general. The earlier part of this mail is mainly a summary/intro to RDF for the I18N WG. The later part is an analysis of the problem based on my opinions and an extensive discussion with Ralph Swick (coauthor of the first RDF spec) and some minutes of TimBL dropping into that conversation. I have also copied Misha Wolf, former chair of the I18N WG and member of the original RDF WG. There is a chance that Misha may supply some important knowledge about the history of language tagging in RDF. (even if you find the start boring, you may want to read later parts). [if you reply, please don't copy the whole mail!] (super-short) Intro to RDF (for i18n) ============ RDF (Resource Description Framework) is a technology to represent data in a very simple and straightforward way so that it can easily be reused. Although there are many other uses of RDF, I will use an example here that illustrates the name of the technology, namely an example where RDF is used to describe a resources, in particular a Web page. It is a very simple single statement that associates W3C's Web page, http://www.w3.org/, with the text "World Wide Web Consortium", saying that the later is the 'title' of the former (title being defined as the title according to Dublin Core metadata conventions). To have a look at the example, please go to: http://jiggles.w3.org/servlet/ARPServlet?RDF=%3C%3Fxml+version%3D%221.0%22%3 F%3E%0D%0A%0D%0A%3Crdf%3ARDF+xmlns%3Ardf%3D%22http%3A%2F%2Fwww.w3.org%2F1999 %2F02%2F22-rdf-syntax-ns%23%22%0D%0A++++++++xmlns%3Adc%3D%22http%3A%2F%2Fpur l.org%2Fdc%2Felements%2F1.1%2F%22%3E+%0D%0A+%3Crdf%3ADescription+rdf%3Aabout %3D%22http%3A%2F%2Fwww.w3.org%2F%22%3E+%0D%0A++%3Cdc%3Atitle%3EWorld+Wide+We b+Consortium%3C%2Fdc%3Atitle%3E+%0D%0A+%3C%2Frdf%3ADescription%3E+%0D%0A%3C% 2Frdf%3ARDF%3E%0D%0A Or go to http://www.w3.org/RDF/Validator/ and just click on the 'parse RDF' button. You will then get a result page that shows the three main representations of RDF: - As triples: Each statement is listed as a triple of subject, predicate, object. - As a (labeled directed) graph. - Encoded in XML. This is usually a bit confusing, in particular because of the high overhead for this small example. In some way, this is really all you need to know about RDF, in practice, there are just more than one statement (which means more triples, more nodes and arks in the graph, more XML,...). One advantage is that different collections of RDF statements can easily be merged, which means that even very small pieces of RDF have their value. In what follows, we will concentrate on what RDF calls 'literals', the square boxes in the graphs. The literal we have used above is the text string "World Wide Web Consortium". In table/triple terms, only objects can be literals. This means that literals are used to say things about resources, but cannot be talked about (at least not directly). The orgininal RDF spec (http://www.w3.org/TR/1999/REC-rdf-syntax-19990222) had very few variation for literals. Besides them being plain text strings, it was also possible that they included XML markup. The purpose of this was to allow to include markup in cases such as titles or descriptions, among else (but not only) to address I18N needs for marking up stretches of different language, ruby, bidirectionality,... This raised the question of what to do with xml:lang. This is what the original RDF spec says: >>>> The xml:lang attribute may be used as defined by [XML] to associate a language with the property value. There is no specific data model representation for xml:lang (i.e., it adds no triples to the data model); the language of a literal is considered by RDF to be a part of the literal. An application may ignore language tagging of a string. All RDF applications must specify whether or not language tagging in literals is significant; that is, whether or not language is considered when performing string matching or other processing. >>>> While this is not very explicit, it was always assumed that, RDF (in XML) being an XML application, xml:lang would behave as it behaved in any other XML application, namely that it was inherited. As an example, in the following book title example: <stockNS:booksInStock xml:lang='en'> ... <dc:title>A Midsummer Night's Dream</dc:title> ... <dc:title xml:lang='it'>La traviata</dc:title> ... <dc:title rdf:parseType='Literal'><html:span xml:lang='fr' >La Boheme</html:span> in Full Score</dc:title> "A Midsummer Night's Dream" would be 'en', "La traviata" would be 'it', "La Boheme" would be 'fr', and "in Full Score" would be 'en'. This example also shows the use of "parseType='Literal'" to indicate that the element is to be parsed as a literal, rather than as nested RDF. [For mixed content, the RDF parser would theoretically have been able to figure this out, because nested RDF cannot be mixed, but this would have complicated processing too much.] Please note that according to Ralph, parseType='Literal' was only an indication of how to parse the element content, and didn't indicate a difference in underlying nature. I.e. changing <dc:title>A Midsummer Night's Dream</dc:title> to <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title> would still result in the same text string. A while ago, the RDF Core WG was chartered to rework and clean up the RDF specification, and they have worked hard on this leading to the last calls at the start of this year. The two areas they have worked on that are relevant for this discussion are: - The handwaving in the original RDF specification (cited above) that RDF applications can either consider or not consider language information when comparing literals. It is always possible that an application decides to consider two strings that differ only in language as equivalent if the RDF machinery includes language information (and therefore in effect treats these two strings as different). However, this does not work the other way round. Therefore, RDF choose to treat literals with different languages as different. - The addition of typed literals, e.g. integers, dates, and so on, based on the model of XML Schema Datatypes (http://www.w3.org/TR/xmlschema-2/), i.e.: - distinction between lexical space and value space - locale/language independent representation Datatypes other than those of XML Schema can be used, but only if they can fit the model of XML Schema Datatypes. Reflecting the principle of locale/language-independence, for such datatypes, xml:lang is irrelevant. The RDF Core WG choose to use datatypes to model XML literals (datatype http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral, or rdf:XMLLiteral for short), but to create an exception in that for XML literals, xml:lang is relevant. The RDF Core WG recently has decided to remove this exception, which lead to the start of this discussion. XML Schema datatypes also added the datatype 'string' (http://www.w3.org/2001/XMLSchema#string, or xsd:string for short). In the last call draft, this results in the following literal typology: plain literal: inherits xml:lang XML literal: inherits xml:lang XML Schema string: xml:lang is irrelevant [other datatyped literals: xml:lang is irrelevant] [a question from Ralph to RDF Core: are plain literals characterized by the absence of a type (i.e. type=no_type), or just by the fact that their type is (temporarily, maybe) unknown?] [From here on, to simplily, I'll put xml:lang directly on the element in question, and assume that there is none higher up in the tree, although in practice, it can always be inherited from higher up, or removed by using xml:lang=''.] According to this, the following all are treated as different in the (new) RDF model: a plain literal (no language) <dc:title>A Midsummer Night's Dream</dc:title> an XML literal without markup nor language <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title> an XML Schema string: <dc:title rdf:datatype='http://www.w3.org/2001/XMLSchema#string' >A Midsummer Night's Dream</dc:title> Also the following are treated as different in the (new) RDF model (and different from all the above): a plain literal with language: <dc:title xml:lang='en'>A Midsummer Night's Dream</dc:title> an XML literal with language but without markup: <dc:title xml:lang='en' rdf:parseType='Literal' >A Midsummer Night's Dream</dc:title> The current last call draft treats the following differently: an XML literal without markup nor language <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title> an XML literal with language but without markup: <dc:title xml:lang='en' rdf:parseType='Literal' >A Midsummer Night's Dream</dc:title> an XML literal with another language: <dc:title xml:lang='en-gb' rdf:parseType='Literal' >A Midsummer Night's Dream</dc:title> However, the newest change by the RDF Core WG ignores (external) xml:lang on XML literals, and therefore all the above become the same. In order to be able to express that in: <dc:title rdf:parseType='Literal'><html:span xml:lang='fr' >La Boheme</html:span> in Full Score</dc:title> 'in Full Score' is actually 'en', the RDF Core WG proposed that <dc:title rdf:parseType='Literal'><html:span xml:lang='en' ><html:span xml:lang='fr' >La Boheme</html:span> in Full Score</html:span></dc:title> could be used. This situation is not at all satisfactory from the viewpoint of I18N because: - We have worked hard to eliminate artificial differences between text strings that are essentially the same: - by basing XML and RDF on Unicode, and therefore eliminating differences in character encoding. - by working on normalization (NFC) to reduce or avoid accidental differences based on remaining encoding choices in Unicode It would be very bad if after all that work, we were left with gratuitously different ways of representing textual strings due to idiosyncrasies of a type system. - Language tagging is an important aspect of internationalization. Also, small-scale markup is important for internationalization (multilanguage strings, bidirectionality, ruby, glyph variants,...). Both are in many ways natural extensions of plain text strings as soon as markup is available. The current handling of XML literal strings without any actual markup, as well as the recent change to ignore xml:lang on XML literals, break this natural extension. In addition, the recent change to ignore xml:lang on XML literals makes language tagging more tedious in the prevalent case of monolingual or mostly monolingual data. In our discussion, Ralph came up with some nice ideas: It looks like we have the following things to actually represent and work with: 1) plain text strings without anything attached 2) text with language and/or markup 3) 'real' datatypes such as integer, date,... Now here is how Ralph proposed to map the various XML phenomenas to the above three categories: a plain literal (no language) <dc:title>A Midsummer Night's Dream</dc:title> absence of xml:lang (or alternatively xml:lang='') => 1) an XML literal without markup nor language <dc:title rdf:parseType='Literal'>A Midsummer Night's Dream</dc:title> absence of xml:lang or markup => 1) an XML Schema string: <dc:title rdf:datatype='http://www.w3.org/2001/XMLSchema#string' >A Midsummer Night's Dream</dc:title> xsd:string => 1) a plain literal with language: <dc:title xml:lang='en'>A Midsummer Night's Dream</dc:title> xml:lang => 2) (with 'en') an XML literal with language but without markup: <dc:title xml:lang='en' rdf:parseType='Literal' >A Midsummer Night's Dream</dc:title> xml:lang => 2) (with 'en') This would solve the current problems, and would better model the reality of the actual data. Of course, other solutions may be available, too. Regards, Martin.
Received on Thursday, 5 June 2003 14:54:52 UTC