- From: Steven J. DeRose <sjd@eps.inso.com>
- Date: Tue, 20 May 1997 12:17:57 -0400
- To: Peter Flynn <pflynn@curia.ucc.ie>, w3c-sgml-wg@w3.org
At 08:44 AM 05/20/97 +0100, Peter Flynn wrote: >James Tauber writes: >> Ah yes, the problematic definition of 'word'---brings back memories of first-year linguistics :-) >> >> Would *glyph* counting solve any of the problems? > >Not really, because of the problems of ligaturing. If someone is counting >based on the typeset result in one system which ligatures fi ff fl ffi ffl >and maybe even st, ct, and the; and another system treats them as separate >glyphs, you've got a problem. I'm a little confused about why these are really problems. Certainly Unicode contains many cases where, if you only examine the rendered glyph sequence without having access to the data stream, you can't tell what you're looking at. Among them are: * ligatures * compound vs. overstruck diacritics * hard vs. soft space * space sequence vs. tab * hard vs. soft vs. en vs. em vs. other dashes/hyphens and LOTS more (note that some of these even come up in Latin-1!). Pardon me if this is dumb, but I wonder, "so what?" If the user interaction model is that people create links by staring at formatted output and counting glyphs, of course there's a problem -- but that seems absurd. If you count characters of the source content, why won't everyone count them the same way? The only problem I see would be with a system that silently "normalized" data somehow, and lost track of what the original data was. But such a system will have many other problems, such as: * Its "export" and "copy" commands will produce data that surprises knowledgeable users. * Text that doesn't even exist in the source will have interesting effects: list item numbers and bullet, headings that generate "Chapter n", footnote reference symbols, and so on and on. * Text that is in the source but hidden will have the opposite interesting effects: footnotes, unexpanded sections, etc. * Text that is moved will have incorrect/indeterminate position (you *can't* count it). Like, footnotes, pop-up annotations, floating sidebars, etc. * Rendering choices that depend on layout or context will lead to broken behavior. For example, if it hyphenates for display and forgets that those added hyphens aren't part of the source, won't it accumulate tons of bogus hyphens if you resize your display several times? If it tries to manage Arabic initial/medial/final forms and you type in a space somewhere, won't it be hopelessly confused? It seems to me that the basic problem arises only if you assume that systems count rendered objects when creating links, but source objects otherwise. Of course if we assume that we're hosed. But why not count source objects entirely, and why not make that part of our definition for the relevant constructs? I think any workable system has to know both streams anyway. Steven J. DeRose, Ph.D., Chief Scientist Inso Electronic Publishing Solutions (formerly EBT)
Received on Tuesday, 20 May 1997 12:21:12 UTC