- From: David Durand <dgd@cs.bu.edu>
- Date: Wed, 21 May 1997 13:31:39 -0500
- To: w3c-sgml-wg@w3.org
At 11:38 AM -0700 5/18/97, Tim Bray wrote: >A lot of people want to support addressing by char count, token >count, or regexp, within #PCDATA (or [danger, Will Robinson!] mixed) >content. My suggestion about merging pseudo-elements makes this much easier (wh/ is why I like it despite it's being weird on the face of it. >This is obviously a good idea for many applications. It is also >somewhat more difficult than you'd expect in the context of >wide Unicode characters. Opinions are solicited as to whether >this should be done for V1.0, and if so, which ones should be done, >and if so, what to do about the internationalization issues. Unicode defines the rules for wide characters (and yes they are a bit complicated). We can refer to those rules (and expect that some implementations may well get it wrong, if they are reading Unicode for a script they don't handle properly). We could also use offsets by Unicode Character Code (ie. 16 bit chunk), and require that authors only create such pointers at legitimate character boundaries. This means that only those creating data in a script have to do the work to make sure that they handle wide characters properly. If we do Regexp (or even string match) in a Unicode context, we may need to allow options to specify whether character canonicalization (with repect to precomposed forms, etc.) should be applied before the match. If we don't provide the user with options we still need to specify our own policy with respect to that, in any case. I'm nto sure how I'd decide that, it's a choice between giving authors the task of finding every way to express the string they want to search for in Unicode, versus prescribing significant work for implementors, or leaving them the option of how to implement it (which damages interoperability, and essentially pushes the requirement back to the author, at least if they are responsible authors). In the Unicode world, counting tokens or characters is actually easier than matching strings. There's also the question of whether you want to ignore accents on matches, etc, which are probably out of scope for XML. Maybe token counting (with an explicit list of delimiters) is actually the only safe thing. Character canonicalization makes pattern matching hard, and SGML compatibility makes character counting impossible because we can't tell if whitespace "counts" or not. -- David _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________
Received on Wednesday, 21 May 1997 14:05:03 UTC