Re: addressing into char content with xml-link
At 20:04 10/04/97 -0700, Paul Grosso wrote:
>In discussions with others over that last couple days, I've come
>to the conclusion we should consider added to xml-link the capability
>to address into data character content (aka dataloc).
>The requirement I see is that users will expect an interface that
>allows them to highlight some text in one document, highlight some
>text in a second document, and make a link from one to the other.
>If the target is a three word phrase in the middle of a very long
>paragraph element, making the entire paragraph the target is unacceptable.
I don't think the issue here is so much whether this is a desirable
capability but whether it can be done robustly and whether it can be
>Note that Char != byte, but if we can expect the XML processor to know what
>Char is when it's parsing an XML file, I figure we can expect it to know
>what a Char is when it's addressing into an XML file.
There are many things in addition to the char/byte distinction that can mess
- line terminators: you move your document from a Unix to a DOS system and
suddenly all your links break because your lines now end with CR/LF rather
- RS/RE ignoring rules: you parse with an SGML-based XML parser, which does
its standard RS/RE ignoring thing
- white space collapsing: consider an application that by default does
white-space collapsing a la HTML
I do not believe simple char counting is going to be robust. Counting just
non-white space characters would be an improvement but still quite fragile.
Counting words or tokens doesn't work for many Asian languages.
One possibility would be something like:
STRING ("making the entire paragraph the target is unacceptable" 1) ("the" 2)
to find the second occurrence of the string "the" in the first occurrence of
the string "making the...unacceptable" in the location source.
However, I still think this would be too hard for XML. In particular I
think you are asking a lot of a style sheet mechanism to be able to attach
styles to arbitrary spans of character data that are not marked up as elements.