Re: 4.5 TEI Extended Pointer Locators?

[From a message from Lou Burnard.  I'm not sure if this went to the WG,
but if it didn't, Lou, I hope you don't mind my replyinh there.]

In message <009B10EA.F5FEC08B.4@vax.ox.ac.uk> Lou Burnard writes:
> |Fnd I am mystified by the example in A.1.1.13 (the 3rd, 4th 5th tokens
> |in "This is _not_ a very good idea"  are apparently:
> |"not_", "a" "very"
> |The _ acts as a delimiter if it leads a token but not if it trails it?
> |(Any default use of _ as a delimiter will cause scientists a lot of
> |grief, I suspect).]
> The rule is that the first SGML namecharacter found starts a new token, and the
The rule is in TEI, not SGML I assume?  Therefore we don't *have* to accept
it precisely for XML...?

> first non namecharacter terminates it, with the proviso that any character
> whose status as a namecharacter is not defined in the current SGML declaration
> is treated as a non-name-char (i.e. as white space). See TEI P3 p 419, from
> which the XML spec has been derived). This means that the underscore is, by
> default, regarded as white space, so the location specification TOKEN (3 5)

I understand the logic!  This is going to create great fun with numbers.
Let's try something like:

<NUMBERS>-.2 +1.3 0.13E-14 -2.3E+14</NUMBERS>

These are all (I think) valid IEE representations.  If the rule is applied
as stated then (I think) namechars include '-' but not '+'.  Remarkably the
first 3 numbers appear to survive, but the fourth gets its exponent

In fact, given this rule, I cannot see how numbers can be transmitted in
text (and I'm sure that there are even worse things we want to transmit!

I would very much like to avoid having to use the construction:
<NUMBER>-.2</NUMBER><NUMBER>+1.3</NUMBER>, etc. This will drive
scientists away quite rapidly.  Many languages (including XML) have a concept
of whitespace, so please can we qualify TOKEN with (say) DELIMITER="XML-S"
or similar if we want it.

> starts with the "not" of "not_". The example is perhaps a little confusing
> because it requests a *span* of tokens (from the 3rd i.e. "not"  to the 5th
> i.e. "very"), and a span, naturally enough, includes all the characters from
> the start of its beginning to the end of its end, including therefore one of
> the underscores, but not the other. Seemed a good idea at the time!
It always does!  The challenge is how can we use it for free format material
which is not solely 'text'.  I think we must address this.


Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences