Re: I18N Issue alternative: collapsing plain and xml literals

pat hayes wrote:
>> After discussing this informally over lunch, Danbri asked me to send 
>> it to the list to make our consideration of it explicit.
>>
>> This is an alternative design for literals.  The  idea is to drop the 
>> rdf:XMLLiteral datatype and allow plain literals to contain markup.
> 
> 
> Allow, or require? That is, if they happen to contain the symbol '<', is 
> that *required* to be considered to be XML markup?

Yes.  If you want to represent "<" you must use "&lt;".

[...]

> 
> I take it that this is intended to illustrate that without the 
> parseType, the literal string is rendered exactly as it is (?)
> How about
> <rdf:Description>
>   <eg:prop><<</eg:prop>
> </rdf:Description>
> 
> does that parse to
> 
> :_a eg:prop "<<" .
> 
> ?

No - that is not legal rdf/xml.
> 
> 
>>
>> <rdf:Description>
>>   <eg:prop rdf:parseType="Literal"><br /></eg:prop>
>> </rdf:Description>
>>
>> parses to:
>>
>> _:a eg:prop "foo <br></br> bar" .
> 
> 
> ?? Eh? Where do the foo and bar come from?

Oh bu**er.  Typo introduced in editing the email.  Should have been:

  _:a eg:prop "<br></br>" .
> 
>>
>> The definition of a plain literal changes.  The lexical space of plain 
>> literal becomes the lexical space of rdf:XMLLiteral, i.e. is 
>> restricted to (the unicode representation of) canonicalised well 
>> formed balanced xml markup.
> 
> 
> That is unacceptable right there. Applications may want to have plain 
> literals that are not XML, eg the Reuters applications where literals 
> are used to capture free-text paragraphs.

I don't see a problem.   XML can represent free-text paragraphs.

> 
>>  The denotation of a plain literal remains - it is a sequence of 
>> unicode characters - permitting string comparison for equality testing.
> 
> 
> ?? So this amounts to a proposal to get rid of plain literals, in 
> effect, and to just not mention the 'XMLLiteral' type explicitly?

Pretty much.  Replace plain literals with XMLLiterals, since plain 
literals are a subset of XMLLiterals (modulo appropriate escaping).

I've tried to be careful not to describe it as a proposal.  This is an 
alternative design.  I'm not proposing it, just describing it.

> 
>>
>> Advantages:
>>
>> I think this provides everything that Martin has been asking for:
>>
>>   - no discontinuity between plain and xml literals
> 
> 
> Indeed, but I do not want us to concede this point to Martin. He is 
> WRONG about this, and we should refuse to let him (or i18n) bully us 
> into conceding this issue.  Plain text is not the same as XML without 
> markup; that view only makes sense in a completely XML-centric view of 
> the entire world of lexicography and notation. Most of the world's 
> languages and notations are not dialects of XML. SCL, for one, is not 
> XML without markup. Virtually every piece of program code ever written 
> is not XML without markup. The mathematical statement (I quote) "2<3" is 
> not XML without markup, and it certainly isn't XML with markup.  And 
> "2&lt;3" is just gibberish.

"2&lt;3" is an encoding of "2<3".  I fear I haven't got this right in 
what I wrote, and I fear even more to get into model theory, however, 
what if:

   "2&lt;3" denotes "2<3" in all interpretations.

Now I have a problem with what does

   "<br></br>" denote.   Given the above, it can't be the xsd:string, 
underminging something I wrote earlier.  Martin previously suggested it 
might denote something like  seq(markup("<br>"), markup("</br>")).

I was trying to avoid getting into defining a new value space for 
literals to allow for distinguishing markup, hence the attempt to keep 
the denotation as strings.  I think I'm beginning to see that does not work.

Regarding not conceding this point to Martin, I think where Martin is 
coming from is something like the following view:

  - simple sequences of characters are sufficient to represent all 
expressions in common western languages, but not in all languages.

  - where simple sequences of characters are allowed in formal 
languages, such as programming languages and the like they are 
sufficient to represent expressions in common western languages but not 
in all languages.  This represents discrimination by the dominant 
technical community against non-western languages.

  - to avoid such discrimination, wherever simple sequences of 
characters are allowed in a formal language, xml markup should be 
allowed so that expressions in other languages are also permitted.



[...]

>>
>> The above design says that e.g. "<" is not in the lexical space of 
>> plain literals
> 
> 
> Wait a minute. If plain literals do not have a datatype, then this 
> 'lexical space' terminology is meaningless. Are they typed or not?

No they are not typed.  I was using language loosely.  I needed 
terminology to distinguish between the representation in the graph of a 
literal and the whatever that denotes.

> 
>> , and many (all?) current implementations will store
>> "<" in their representation of a graph.  The point to note is that 
>> implementations are free to represent literals any way they please. 
>> Thus "<" is just the way this implementation represents the literal 
>> "&lt;".
> 
> 
> But that is INSANE.

Really.  As an implementor I can choose to represent literals using a 
sequence of f*rts in morse code if it suits my purpose, right?

  Here I am wanting to write an RDF ontology about
> mathematical expressions, say, and I want to refer to pieces of 
> mathematical text like "2<3" (two is less than three). But now I can't 
> do that because this is short for "2&lt;3", which is completely 
> meaningless.

No - if you want to represent "2<3" in the graph, you use "2&lt;3".  In 
my implementation I can represent that in turn as "2<3" but that is a 
private matter for my implementation.

> 
> The basic point is that the RDF machinery is not intended to be 
> restricted to only REFER to XML text. It is required to be encodable in 
> XML, but it is that already. This proposal makes it impossible to refer 
> to non-XML text.
> 
>> The implementation does need to distinguish between markup and plain 
>> text.
> 
> 
> No, the SEMANTICS needs to distinguish them.

Right.  This is where I'm aware this design is on dodgy ground.

[...]

> 
> It completely destroys the idea of a plain literal.

Replaces it with XML literal, but xml literals can represent anything a 
plain literal can represent (modula lang tag).

> 
> I think that my 'wet fish' proposal is better than this, if we have to 
> accede to Martin.  I would prefer not to accede, since Martin has not 
> responded to the technical objections adequately, and has not given 
> actual technical arguments for his requests. They amount to statements 
> of opinion about the proper role of XML in semiotics, opinions with 
> which one may have legitimate disagreement.

Brian

Received on Tuesday, 9 September 2003 08:31:04 UTC