W3C home > Mailing lists > Public > www-international@w3.org > January to March 2000

Re: too late for <wbr>, too soon for &#x200B; ?

From: Chris Lilley <chris@w3.org>
Date: Fri, 24 Mar 2000 19:58:19 +0100
Message-ID: <38DBBACB.E2FE393C@w3.org>
To: Doug Cooper <doug@th.net>
CC: www-international@w3.org

Doug Cooper wrote:
> At 15:37 23/3/00 +0100, Chris Lilley wrote:
> >Have you tried XML browsers, or just HTML ones?


> >OK, so you are saying that the expectation is that explicit segmentation
> >using a zero-width space is acceptable to the community of SEA
> >non-segmented language users? Certainly it is a lot more tractable than
> >per-language dictionary lookup, for implementors.
>   Unfortunately, this isn't an either/or situation.  An explicit zero-width
> space
> is necessary because:
>   -- names, loanwords, neologisms, misspellings, etc. create situations in
>      which standard approaches to word breaking produce errors,
>   -- since their bounds are not easily identified, these unknown areas can
>      make much longer sequences unsegmentable, or lead to incorrect
>      segmentation,
>   -- you can't assume that reliable dictionaries (or national interchange
>      standards, for that matter) are available for render-time breaking.

OK so it is belt and braces, or occasional overrides.

>    On the other hand, because a) &#x200B; more than doubles a doc's
> 'text payload' size,

Not if you declare it upfront as a short name, like w, and then use &w;
which is three characters (three bytes in UTF-8). Of course, you can encode
your file in UTF-16 and actually put the explicit character right where it
is needed, without entities, anxd then it is only two bytes.

Recall that XML lets you do this sort of thing; you can do this in XHTML
anmd it will work (if the browser uses an XML parser which is supposed to
be the point).

> and b) most apps that generate HTML do not
> insert breaks, some mechanism for breaking at render-time is needed.

So, you see these breaks as occasional overrides for, as you said,
neologisms and the like - in whichcase the effect on file size should not
be great.

> However, the solution is certainly _not_ to enshrine one particular
> approach -- especially if that approach (dictionary-based maximal
> matching) is known to be flakey.
>    IMHO, a better way is to provide a hook, called just before the standard
> line-breaking code, that lets a local app insert zero-width spaces as needed.
> Maximal matching can be provided as a default local app, but there are
> other, lighter-weight approaches to weak segmentation for less-well-
> documented languages (Burmese, say), as well as more robust methods
> for better-studied systems like Thai.

I would nbeed to see a more fully-worked-out proposal before commenting on

> >>   I'm raising this issue now both in the hope of resurrecting <wbr>,
> >Unlikely ...
>   Yet, hope springs eternal;-).  I just got this from ftang@netscape.com:
> >We are going to release the beta of Netsape 6 in earily April
> >I believe <wbr> works in Netscape 6 beta 1.

Well if so, I sincerely hope it is spelled <wbr/>

>   But it's the bigger picture that I want to address.  If a tag of this
> importance (eg, it's ballpark 50% of the text volume of many Thai
> html pages) can disappear, then either:
>   a) somebody is not articulating SEA needs clearly, or
>   b) somebody is not listening.

Or perhaps both. But hey, you seem to be articulating your needs and people
are listening so, it looks good from here on out, no?

Incidentally was wbr ever part of any HTML spec? And how is it different
from the zwnj entity in HTML 4.0?

Received on Friday, 24 March 2000 13:58:40 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:19 UTC