Re: Link-6: Addressing at the sub-element level from David Durand on 1997-05-21 (w3c-sgml-wg@w3.org from May 1997)

From: David Durand <dgd@cs.bu.edu>
Date: Wed, 21 May 1997 13:31:39 -0500
To: w3c-sgml-wg@w3.org
Message-Id: <v03007803afa8ece611ea@[205.181.197.69]>
At 11:38 AM -0700 5/18/97, Tim Bray wrote:
>A lot of people want to support addressing by char count, token
>count, or regexp, within #PCDATA (or [danger, Will Robinson!] mixed)
>content.
   My suggestion about merging pseudo-elements makes this much easier (wh/
is why I like it despite it's being weird on the face of it.

>This is obviously a good idea for many applications.  It is also
>somewhat more difficult than you'd expect in the context of
>wide Unicode characters.  Opinions are solicited as to whether
>this should be done for V1.0, and if so, which ones should be done,
>and if so, what to do about the internationalization issues.

Unicode defines the rules for wide characters (and yes they are a bit
complicated). We can refer to those rules (and expect that some
implementations may well get it wrong, if they are reading Unicode for a
script they don't handle properly). We could also use offsets by Unicode
Character Code (ie. 16 bit chunk), and require that authors only create
such pointers at legitimate character boundaries. This means that only
those creating data in a script have to do the work to make sure that they
handle wide characters properly.

If we do Regexp (or even string match) in a Unicode context, we may need to
allow options to specify whether character canonicalization (with repect to
precomposed forms, etc.) should be applied before the match. If we don't
provide the user with options we still need to specify our own policy with
respect to that, in any case.

   I'm nto sure how I'd decide that, it's a choice between giving authors
the task of finding every way to express the string they want to search for
in Unicode, versus prescribing significant work for implementors, or
leaving them the option of how to implement it (which damages
interoperability, and essentially pushes the requirement back to the
author, at least if they are responsible authors).

In the Unicode world, counting tokens or characters is actually easier than
matching strings.

There's also the question of whether you want to ignore accents on matches,
etc, which are probably out of scope for XML.

Maybe token counting (with an explicit list of delimiters) is actually the
only safe thing.

   Character canonicalization makes pattern matching hard, and SGML
compatibility makes character counting impossible because we can't tell if
whitespace "counts" or not.

   -- David

_________________________________________
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________
Received on Wednesday, 21 May 1997 14:05:03 UTC