Re: word separators (was: Ignoring empty paragraphs) from Chris Lilley on 2000-04-14 (www-html@w3.org from April 2000)

From: Chris Lilley <chris@w3.org>
Date: Fri, 14 Apr 2000 14:11:37 +0200
To: daniel.glazman@polytechnique.org
CC: Dan Connolly <connolly@w3.org>, webmaster@richinstyle.com, www-html@w3.org, www-style@w3.org
Message-ID: <38F70AF9.24E9D767@w3.org>
Daniel Glazman wrote:
> 
> Dan Connolly a écrit :
> 
> > But examining the formalities... that is a paragraph with exactly
> > one word in it. The HTML definition of a word is not impacted
> > by stylesheets:

Yes, that is part of the problem, when the HTML spec strays outside of
markup and starts to describe presentation. Often, loosely, which gives
problems of harmonisation and integration for those tasked with the work of
formatting. The Q element being a classic case.

Imagine your dismay for example if some spec said in passing

"for URLs (we use the term "links" for URLs) they may be readily
distinguished since they always start with the string "www." and end in
".htm". When following links, traverse them according to the rules in HTTP
or whatever other wierd protocol you may be using."

And someone then asserted that the definition of both 'link' and of 'URL'
was "independent of the HTTP spec, XLink spec or the URL spec" and pointed
to that carefully crafted piece of naive misdirection to back them up. 

> >         "For all HTML elements except PRE, sequences of white space separate
> >         "words" (we use  the term "word" here to mean "sequences of non-white
> >         space characters"). When formatting text, user agents should identify
> >         these words and lay them out according to the
> >         conventions of the particular written language (script) and
> > target medium."
> >         --
> > http://www.w3.org/TR/1999/REC-html401-19991224/struct/text.html#h-9.1

Works fine for American, I mean, English. I can supply a screenshot of what
happens when this algorithm is followed literally, with Arabic - the words
themselves are correct and read from right to left, but the sequence of
words reads from left to right since each was displayed individually so the
unicode bidi algorithm never got a chance to kick in properly.

I recall someone saying that formatting was easy - you just lay out the
characters one by one until you hit the right margin, then move down a
line. ;-(

> Just FYI and information of the other readers : one of my best friends is
> a latinist. He had to put on an intranet some weeks ago the exact copy of
> a romanian wall inscription where words are separated by a colon.
> Not by whitespaces. It means that the common formatting algos in browsers
> don't work and he *has* to insert whitespaces in that quotation. 

Well, it could be argued that the exact line breaks in the inscription
should also be preserved, for example to demonstrate that Roman
inscriptions often broke words in the middle. But I take your point in
general. I have also seen middle dots used as word separators in Latin,
Greek and Gaulish inscriptions.

> It also
> means that a copy/paste of the quotation is incorrect. I told him that
> some useful whitespaces are not so important ; he became red and explained
> me during ten minutes that the original text has no space and Science (with
> a big S) needs to show/print the text as it stands on the romanian wall.

Right. Similarly, readers of Thai and of Japanese will be amused to
discover that their text consists of a single word, due to that ill-advised
text cited above. I mention this before someone points out that their
customers don't care about 2000-year-old dead languages. There are, I
heard, something like 12 million Japanese speakers in the USA.

Incidentally, if that Romanian inscription was marked up in SVG, it could
show exactly the different letter forms, their contractions and omissions
and use of variant letterforms, and retain the text as searchable Unicode
text, and have a full description as well, in multiple languages. Just FYI.


> I have raised this issue some time ago. Sometimes, document providers need
> to specify, on a per-element basis, which char should be interpretated as
> a word separator.

Right. And the definitions above are woolly, to say the least. What is
whitespace? Is it the whitespace as defined in the S production of XML Does
it include zwnj and  ideographic space and so forth? The problem arises
because the HTML spec, mainly concerned with space markup, went outside its
scope into formatting of words.
> 
> > Bit I'd like to discourage authors from relying on <p><p><p>
> > to skootch their text down a little bit.
> 
> Yes, of course. But that's not enough ; they still can write
> <br><br><br>, which is IMHO ugly too...

And which should be collapsed, or should not be collapsed, depending on who
you ask.


--
Chris
<pre>















</pre>
skootched!
Received on Friday, 14 April 2000 08:11:51 UTC