Updated summary from I18N on 'XML Literals'

As promised yesterday, here is an updated list of arguments
from the I18N side on the issues surrounding 'XML Literals',
and our understanding behind it.

Many thanks to many of you for helping me to understand
how to better explain things; anything that is not yet
clear is clearly my fault.

The rationale is split into various parts, such as technical
and procedural arguments.


Technical Arguments
-------------------

One of the needs in internationalization is the need for what we
may call 'micro-markup'. Although this is not necessarily very
frequent, it turns up repeatedly in all kinds of instances such as:
   - multilingual text fragments
   - bidirectionality
   - ruby
   - special glyph variants
   - translation hints for localization
There are other cases where similar micro-markup is desirable,
such as using mathematical or chemical formulae in text.

In the RDF context, the typical example for this is a book title.
The RDF M&S specification uses a Math example, and not an i18n
example, mainly because that was felt that it was easier for a
wide audience to understand.

The need for micro-markup for i18n is not limited to RDF at all.
We have had long discussions with XML Schema about it, which have
led to the 'text' type described in the XML Schema primer (see 
http://www.w3.org/TR/xmlschema-0/#any), which ended up in the
XML Schema type library (see
http://www.w3.org/2001/03/XMLSchema/TypeLibrary.xsd
http://www.w3.org/2001/03/XMLSchema/TypeLibrary-text.xsd
http://www.w3.org/2001/03/XMLSchema/TypeLibrary-nn-text.xsd,
you may have to do 'view source' to get the full picture).

We have also made review comments to other WGs requesting that
elements rather than attributes be used for anything that looks
like natural text rather than enumerations, numeric values, and
the like, in order to be able to use micro-markup. It is also
leading to changes from XHTML 1.0 to XHTML 2.0.

The need for micro-markup in various ways makes it quite important
that 'text' and 'text-with-markup' are not seen as two completely
different things, but that text-with-markup, to whatever extent
possible, be seen and handled as an extension of text. This is even
more important because the need for micro-markup may often not
appear for very long, in particular if mainly data of particular
languages is handled. Therefore, it should be as natural as possible
to make the transition from plain text to text with micro-markup.

All this is not just an issue from the view of RDF/XML (Pat's
view X), but very much also, or actually first and foremost,
applies to the model and the graph (Pat's view G). The best thing
would be the way M&S treats literals, with language information,
and with literals containing markup as a natural extension of
literals with only plain text. Please note that this does not at
all preclude other usages of XML content in RDF literals, be it
larger textual pieces (typical example is documentation) or,
as some people in this discussion have, to my surprise, suggested,
as blobs of data-oriented XML. There is no need for every XML literal
to have a language in the Graph (we just want it to be possible to
have one), and in RDF/XML, xml:lang="" does the job.


We understand that there are some implementation problems, primarily
motivated by the desire to store plain strings as strings without
escaping, but we see this as an implementation issue, not as a
reason for making plain literals and XML literals completely
different things, and having them behave completely differently.


We see parseType='Literal' as what it says, namely an instruction
about how to parse RDF/XML, similar to the other values of parseType,
and not as something that should be directly reflected in the
graph.


 From an user point of view, users familar with XML will be surprised
that xml:lang tag does affect plain literals (as the XML 1.0 REC says)
but does not affect xml literals. Why are the rules for XML switched
off just for XML?


The idea that applications use an application-specific wrapper element
is in conflict with the idea of micro-markup, which should be used
only when necessary, and with the concept of markup integrity,
which means that in some way or another, markup may always be
significant, and changing it may be unapropriate.


Procedural
----------

- It is our understanding that RDF Core was chartered with
   clarifying the RDF M&S spec, not changing it. Already by
   separating plain literals and XML literals, and much more
   by removing language information from XML literals, the
   new spec is a clear change from M&S, rather than a
   reinterpretation.

- We agreed in Cannes that the ambiguity in M&S that RDF applications
   may or may not consider language information would be resolved
   to that the RDF graph would provide the language information.

- Later, RDF Core asked us about the problem of integrating
   arbitrary pieces of XML without language information into
   an RDF/XML document. The same problem was brought up by
   XML Signature (or was it encryption) and SOAP. The I18N
   WG recognized this problem, checked with the experts on
   language tagging standards, and recommended to XML Core
   to issue an erratum to define xml:lang="" for this case,
   which they did.

- Later, RDF Core asked about the applicability of language
   information to datatypes such as (XML Schema) integer.
   We told them that these were designed as language- and
   locale-independent datatypes, and so it would be appropriate
   to specify that they did not carry language information.

- Although this was rather implicit (in the sense of a common
   understanding that didn't have to be made explicit), I think
   neither side ever assumed that removing language information
   from XML Schema simple datatypes would affect plain literals
   or XML literals.

- After last call, RDF Core asked us whether we would be okay
   with removing language information from XML literals. It was
   nice for them to ask, but it also clearly indicates that they
   understood it to break our previous agreement. We had a look
   at it and decided that, for the reasons explained above, it
   would not be okay. It also helped us to understand that the
   RDF M&S design for literals had been changed rather substantially,
   with undesired consequences for internationalization, and that
   ideally, more than just putting language information back on
   XML literals was needed, but that if really necessary, we
   could live with only that change back (to the last call state).



The bigger picture
------------------

Some people have said that the M&S solution for dealing with
language information is not very good. We don't deny that there
may be other, potentially better solutions. We also would have
been ready (and would still be ready), I guess, to discuss such
solutions with RDF Core. But we are not ready to trade something
that we have currently (M&S, or with somewhat less satisfaction,
the last call status) with something that is according to our
understanding clearly less consistent and less useful, with
or without the potential to 'get things fixed' in the future.



Regards,    Martin.

Received on Thursday, 10 July 2003 00:55:33 UTC