RE: Encoding arbitrary literals in RDF/XML from Jon Hanna on 2004-09-23 (www-rdf-interest@w3.org from September 2004)

From: Jon Hanna <jon@hackcraft.net>
Date: Thu, 23 Sep 2004 10:28:36 +0100
To: "www-rdf-interest@w3.org" <www-rdf-interest@w3.org>
Message-ID: <1095931716.415297447c818@82.195.128.192>

Reason for this discussion is that XML does
> not support the full range of Unicode characters. More specifically,
> the issue concerns the null character (hex value 0x0). From what I've
> learned about Unicode in the past days, I understand that this is a
> perfectly legal Unicode character, but the XML specs do not allow you to
> include it in an XML document.

It is a perfectly legal Unicode character but, just for the record, it is
perfectly valid for an application to apply special meaning to Unicode
characters (and "not allowed" counts as a special meaning) and the control code
characters (including U+0000) are particularly common here - indeed they were
designed to have special meanings in applications. So there is nothing untoward
about XML prohibiting it, and it in no way indicates a failure on the part of
XML in its support of Unicode.

> A workaround that I have been thinking about is to encode such literals
> in hex or base64 and to include an attribute in the surrounding element
> that indicates this. This sounds like a bit of a hack, though, and I'm
> not sure whether this is completely standards compliant.

Not a hack, not even a workaround; this is how you encode such data.

The important thing is that:

1. The "attribute in the surrounding element that indicates this" is
rdf:datatype.

2. The property in question is defined in such a way as to allow this. In
particuar "strings" on the web do not include null characters (whether they are
being used in RDF, HTML, XML, URIs, HTTP headers or just about any other web
technology) so the property mustn't be defined in such a way as to allow only a
"string", but rather it must be defined so that base64 values are allowed
(though it's both valid to allow strings *as well* and to consider the value
you obtain when parsing to be a "string" in the context of your application).

3. Unless there is only one encoding possible for the string before it becomes
base64 encoded (and this encoding is well documented in all applicable places)
you will need to somehow state this encoding in the RDF - for while an XML
document in one encoding (say UTF-8) can only include strings encoded in UTF-8
(conceptually it just contains the string of characters, the encoding is an
issue for a lower level of abstraction) it can include base64 encoded text from
another encoding (say UTF-16). Generally I'd recommend just mandating that the
text must be in UTF-8 before it is base-64 encoded; it makes your life simpler
and since there are no characters that can't be encoded in UTF-8 there won't be
any nasty edge cases (whereas their would with one of the ISO 8859 family).

4. As a rule you won't want to do the sort of character escapes necessary with
XML (e.g. "&lt;" for "<") before base-64 encoding. However some people may
expect that because they could think something along the lines of (XML + < =
&lt;) so you should be clear on this in your documentation.

Received on Thursday, 23 September 2004 09:28:51 UTC