Re: Representation of strings and characters in XML version of ixml from C. M. Sperberg-McQueen on 2021-12-27 (public-ixml@w3.org from December 2021)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Mon, 27 Dec 2021 09:27:44 -0700
To: Steven Pemberton <steven.pemberton@cwi.nl>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, "Liam R. E. Quin" <liam@fromoldbooks.org>, ixml <public-ixml@w3.org>
Message-Id: <19D5D67D-361A-4178-A114-812BD3408F04@blackmesatech.com>

> On 27,Dec2021, at 6:40 AM, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
> 
>> The more I think about it, the more I think that preserving
>> the distinction between dstring and sstring is just a relic of the
>> time when the design wanted to preserve the accidentals of the
>> ixml grammar, and is at best misleading. So I now lean towards
>> “let us mark them both as @string”.
> This discussion has persuaded me of that too.
> 
>> Hex notation, on the other hand, I continue to regard as
>> something I’d like to preserve.
> 
> I agree, but now you need to explain to me why that doesn't count for @from and @to.

Because in @from and @to, the string ‘#a’ is unambiguously 
a reference to the character U+000A and the strings ‘#’ and
‘a’ are unambiguously references to characters U+0023 and
U+0061, respectively.

If we reduced sstring, dstring, and hex encoded strings all to
the same attribute, we would need to find a way to determine
whether the string ‘#a’ denoted

  - the character sequence U+0023, U0061, or 
  - the character sequence U+000A

We could introduce some escaping mechanism for ‘#’, I
suppose.  By analogy with the escaping mechanisms for
single and double quotes we might say that a single
hash mark introduces a hex sequence denoting a single
character, and ‘##’ denotes a literal hash mark.  And then
we need a way to signal the end of the hex sequence.
Maybe another hash mark?  Is the end-of sequence marker
obligatory or can it be omitted if the next character is not
a legal hex character?  I.e. can we write “yes,#asir!” or 
must we write “yes,#a#sir!”?  You will note that without
even perceiving it myself I have shifted from thinking of
hex-encoded strings as an alternative to literal strings, as
they are now, to hex-encoding as something we can 
embed in a larger string.  And suddenly I find myself at the
bottom of a slope that turned out to be a little slippery
at the top.  Not too bad a slope, but still …

If we serialize both dstring and sstring as @string, and
hex as @hex, the spec becomes simpler.

If we serialize all three as @string, the spec becomes slightly
more complex because of the rules about how to tell
when hex escaping is being used.

When I started this mail I was preparing to say that if
we did the same for literals as we do for from and to — allow
hex encoding or conventional encoding — that would be OK,
too.  But I have persuaded myself that unless there is a simpler
way to do it than I have come up with so far, it would not be
an improvement, because the simplification is only apparent,
not real.

The reason from and to don’t need distinct attributes
for their conventional form and their hex-encoded form
is that the length of the value reliably distinguishes them.
The reason literals do need distinct attributes is that the
distinct attributes are a simple way to carry the distinction,
and appear to be simpler than any alternative.

I think I’ve persuaded myself; did I also persuade you?

Michael

Received on Monday, 27 December 2021 16:28:04 UTC