RE: Link-6: Addressing at the sub-element level from Steven J. DeRose on 1997-05-20 (w3c-sgml-wg@w3.org from May 1997)

From: Steven J. DeRose <sjd@eps.inso.com>
Date: Tue, 20 May 1997 12:17:57 -0400
To: Peter Flynn <pflynn@curia.ucc.ie>, w3c-sgml-wg@w3.org
Message-Id: <2.2.32.19970520161757.00b27a10@pop>

At 08:44 AM 05/20/97 +0100, Peter Flynn wrote:
>James Tauber writes:
>> Ah yes, the problematic definition of 'word'---brings back memories of
first-year linguistics :-)
>> 
>> Would *glyph* counting solve any of the problems?
>
>Not really, because of the problems of ligaturing. If someone is counting
>based on the typeset result in one system which ligatures fi ff fl ffi ffl
>and maybe even st, ct, and the; and another system treats them as separate
>glyphs, you've got a problem.

I'm a little confused about why these are really problems. Certainly Unicode
contains many cases where, if you only examine the rendered glyph sequence
without having access to the data stream, you can't tell what you're looking
at. Among them are:

* ligatures
* compound vs. overstruck diacritics
* hard vs. soft space
* space sequence vs. tab
* hard vs. soft vs. en vs. em vs. other dashes/hyphens

and LOTS more (note that some of these even come up in Latin-1!). Pardon me
if this is dumb, but I wonder, "so what?" If the user interaction model is
that people create links by staring at formatted output and counting glyphs,
of course there's a problem -- but that seems absurd.

If you count characters of the source content, why won't everyone count them
the same way? The only problem I see would be with a system that silently
"normalized" data somehow, and lost track of what the original data was. But
such a system will have many other problems, such as:

* Its "export" and "copy" commands will produce data that surprises
knowledgeable users.

* Text that doesn't even exist in the source will have interesting effects:
list item numbers and bullet, headings that generate "Chapter n", footnote
reference symbols, and so on and on.

* Text that is in the source but hidden will have the opposite interesting
effects: footnotes, unexpanded sections, etc.

* Text that is moved will have incorrect/indeterminate position (you *can't*
count it). Like, footnotes, pop-up annotations, floating sidebars, etc.

* Rendering choices that depend on layout or context will lead to broken
behavior. For example, if it hyphenates for display and forgets that those
added hyphens aren't part of the source, won't it accumulate tons of bogus
hyphens if you resize your display several times? If it tries to manage
Arabic initial/medial/final forms and you type in a space somewhere, won't
it be hopelessly confused?

It seems to me that the basic problem arises only if you assume that systems
count rendered objects when creating links, but source objects otherwise. Of
course if we assume that we're hosed. But why not count source objects
entirely, and why not make that part of our definition for the relevant
constructs? I think any workable system has to know both streams anyway.

Steven J. DeRose, Ph.D., Chief Scientist
Inso Electronic Publishing Solutions
   (formerly EBT)

Received on Tuesday, 20 May 1997 12:21:12 UTC