I18N Issue alternative: collapsing plain and xml literals

After discussing this informally over lunch, Danbri asked me to send it 
to the list to make our consideration of it explicit.

This is an alternative design for literals.  The  idea is to drop the 
rdf:XMLLiteral datatype and allow plain literals to contain markup.  Two 
test cases illustrate:

<rdf:Description>
   <eg:prop>foo &lt;br /&gt; bar</eg:prop>
</rdf:Description

parses to:

_:a eg:prop "foo &lt;br /&gt; bar" .

<rdf:Description>
   <eg:prop rdf:parseType="Literal"><br /></eg:prop>
</rdf:Description>

parses to:

_:a eg:prop "foo <br></br> bar" .

The definition of a plain literal changes.  The lexical space of plain 
literal becomes the lexical space of rdf:XMLLiteral, i.e. is restricted 
to (the unicode representation of) canonicalised well formed balanced 
xml markup.  The denotation of a plain literal remains - it is a 
sequence of unicode characters - permitting string comparison for 
equality testing.

Advantages:

I think this provides everything that Martin has been asking for:

   - no discontinuity between plain and xml literals
   - lang on mixed content
   - no use of datatypes

Disadvantages:

- a bigger change than alternatives
- builds XML into the core of the RDF model
- breaks current implementations (but see below)

Ameliorating the Disadvantages - implementation strategy

The above design says that e.g. "<" is not in the lexical space of plain 
literals, and many (all?) current implementations will store
"<" in their representation of a graph.  The point to note is that 
implementations are free to represent literals any way they please. 
Thus "<" is just the way this implementation represents the literal "&lt;".

The implementation does need to distinguish between markup and plain 
text.  To do this, it adds a single bit to literals to indicate whether 
they are stored in escaped or unescaped form.  The above example was in 
unescaped form, which cannot represent markup.  To represent markup, the 
literal must be be stored in escaped form.

Literal comparison becomes more complex - literals stored in unescaped 
form should first be escaped and then canonicalized.  Various 
optimization strategies can be employed here.

By this strategy, It may be possible to argue that this approach does 
not break current implementations of plain literals.  It simply makes 
clearer what xml literals are.

Brian

Received on Monday, 8 September 2003 07:52:24 UTC