Re: HTML Streaming from Dan Sugalski on 1997-09-08 (www-html@w3.org from September 1997)

From: Dan Sugalski <sugalsd@lbcc.cc.or.us>
Date: Mon, 08 Sep 1997 09:52:55 -0700
To: Albertfine@aol.com, www-html@w3.org
Message-Id: <3.0.3.32.19970908095255.008a5830@stargate.lbcc.cc.or.us>
At 07:55 AM 9/8/97 -0400, Albertfine@aol.com wrote:
>sugalsd@lbcc.cc.or.us (Dan Sugalski) wrote:
>
>>You're not going to get significant enough compression to make it
>>worthwile. Adding even 30% overhead to a web page to make it render faster
>>is not worth it, and I think you'll find it difficult to get much smaller
>>than that and still stay within the limits of a less-than-seven-bit
>
>Right now, I am not concerned with the degree of compression.

You should be, though. The original point of your proposal, as I understood
it, was to create some mechanism such that a rough sketch of the page could
be displayed, with the details filled in as the page was recieved. If
loading the rough draft takes a significant amount of time to load,
relative to the size of the page it describes, it's going to be
counterproductive.

>                                                              I agree, it 
>does seem it would be inefficient but the global description is not even 
>complete. Their are more things to consider than text. It may be very 
>different from it is now and may have a better degree of compression. If it 
>is not an error free description it can still be of use. The W3, unlike the 
>members of this list, thinks that things should at least be tried to see if 
>their are any benefits. Take the width attribute in the pre element. It uses 
>a character only description.
>
>   width = integer
>
>          This attribute provides a hint to visual user agents about the
>          desired width of the formatted block. The user agent can use
>          this information to select an appropriate font size or to
>          indent the content appropriately. The desired width is
>          expressed in number of characters. This attribute is not widely
>          supported currently.

This is a very specific case, and is valid *because the visual properties
of PRE text are defined* PRE text is preformatted, and is supposed to be
displayed in a fixed width font, with basically *no* messing about by the
browser.

You'll also note a HEIGHT and WIDTH property for OBJECT and IMG elements.
The *lack* of those elements, or something like them, on other elements is
a pretty good clue that someone decided they weren't worth the trouble.

>Btw you wouldn't need this attribute if events existed. If a better 
>description comes along, such as characters and spaces, it wouldn't need to 
>be changed or wait for support. This is one of the benefits of events. The
>call 
>element could also help here.
>
>>character set. (Honestly, since you're looking for accurate rendering,
>>you're going to have to have an intimate knowledge of the font metrics used
>>to render the page, and you just can't have that at page-generation time. A
>>truly accurate mockup of the page is an impossibility, alas)
>
>You can calculate the font metrics from the panose number. If it is the 
>base font, it can also be calculated before the text is actually downloaded.

Panose numbers aren't going to do you a bit of good *without the text that
generated the number*! Giving the bounding box for a word under a
particular font, even with panose numbers to describe that font, doesn't
give enough info to generate a bounding box for a font with different
properties.

I'm thinking you misunderstand the point of panose numbers. They describe
the properties of a font well enough to let a program choose a different
font that reasonably resembles the original. The point here is
*substitution* of fonts, and with a Panose database that, for user
purposes, does *not* have font metrics in it. The problem you have to deal
with is where the font is *fixed* at the browser.

Here's an example that might help. You have a word, say 'mmm', and generate
a bounding box with Times as your font. I, however, have my browser set to
use Courier. My browser gets the panose number for the font, and the
bounding box. Now, that box could describe 'mmm', which means three
characters. It could *also* describe 'iiiiii', which in Times is
approximately the same height and width. For me in Courier, though, the
bounding box is fairly radically different.

>>I think the assumption that this is going to be done exclusively by HTML
>>editors is something that's going to have to get jettisoned. If you work on
>>that assumption then you might as well pack it in now, since you'll never
>>get a significant enough user-base to get any of the significant browser
>>makers to bother with it.
>
>I imagine it could be a program separate from the HTML editor similar to an 
>assembler. The program would take editor written or hand written code and add
>pre rendering attributes, organize text etc. I think HTML editors are a
>significant part of user base. One comes with every copy of Communicator. 
>HTML is becoming more complicated. Using notepad or bbedit is becoming very 
>inefficient. Simply editing an HTML file is one the considerations behind 
>style sheets. I think users will consider its use for its benefit and not
>where 
>it is available.

I think you're wildly misjudging the direction that web design is going. An
increasingly large fraction of pages are being generated automatically,
either on the fly or in batches, from databases and other non-UI systems.
The site I run, as an example, has about 2800 pages, of which maybe 10% are
*not* generated by a program.

>>Also, from what I've seen you're counting on tools and info that's not
>>easily available to the majority of the engines generating web pages, i.e.
>>CGI scripts and database-driven pages.
>
>It would not be impossible to do, though. An incomplete events would still
>be of some help to the browser. For example, starting the java compiler would
>be of help to many modular browsers.
>
>>Actually, it *is* a printing problem, and it would be as inaccurate as I
>>say. Worse, really. When printing, you can make some very valid page size
>>assumptions (8.5x11 or A4 paper) which you can't make for browsers. A web
>>page in 12 point times is going to have a significantly different layout in
>>a 400x400 window and a 1200x900 one.
>
>Technically it is a display problem. Font degradability is only classically a
>printing problem. I thought looking at it from various perspectives would 
>help. It really makes no difference if the layout of the window is 400x400 
>or 1200x900. The browser would know how much room it would have to display
>similar to the printer knows the size of the paper. The only real difference 
>is that it may break at any particular point. In printing, this is not really
>a concern because the information is already displayed.

Sure, printing is a subset of displaying, one that's simpler in many ways.
The issue I was trying to raise with you is that, because the end display
medium is *so* varied, you can't make *any* assumptions about it when
generating your page descriptions.

>>That pretty much leaves you describing individual words and the rectangle
>>they take. English has an average word size less than six characters. Do
>>you really think you can do it in an average less than 2? (Don't forget
>>you'll need both width and height because you can't assume that font
>>metrics won't change from word to word, or even character to character)
>
>This is only one way of describing text. It fails with the addition of more 
>complicated HTML like HTML math. The description that I am developing now 
>works from its own database and describes a larger variety of data. Again, it
>may not be error free but would be of some benefit.

Bounding boxes are the easiest, and most compact, way to do what you seem
to be talking about. HTML math items (which are currently dead, though they
might reappear) can be treated like any other OBJECT, and have a HEIGHT and
WIDTH attribute on them. 

>>And you did specify you were shooting for lossless, and you just can't have
>>that. Now, if you shoot for 'mostly accurate' and take a good guess at
>>standard Times metrics, your assumption will probably be valid, or close
>>enough for several varieties of times on different platforms, for most
>>(60-80%) of the people viewing the page. OTOH, 20-40% of the people *won't*
>>be able to use your assumptions, thus incurring the overhead of downloading
>>a data description they can't use.
>
>If you assume all the fonts are described with the panose matching system;
>this would not be a problem.

I think you misunderstand the uses of panose font numbers. They just
*don't* give you enough info to calculate changes in bounding boxes.

>>By all means, work it out, it's a good exercise. I think, unfortunately,
>>you'll find that the increase in size your additions make will make will
>>end up slowing down the ultimate display of the page enough to make it
>>counterproductive.
>
>You take a very pessimistic view of this. It may have been the way I 
>introduced it; very rough and in the developmental stage. Too late :) I think
>events already show numerous benefits. I just finished checking the load 
>times of various plugins and compilers versus download time of the HTML file,
>placement of the commands and browser shell commands. Their are numerous 
>benefits to the program and even the os.

You found pages where the time to fire up a reasonably sized plug-in was
*less* than the time it took to start displaying the page? I think if
you'll check, you'll see that the page in question was using IMG elements
without size attributes, or possibly tables.

If you check, the only two elements in HTML right now that have to load
more info than just the initial element tag to display are IMGs without
size attributes (need a network connection to fetch the image file, at
least the header with the size) and tables (need the whole table to figure
out the internal dimensions).

IMGs can (and should, IMHO) have HEIGHT and WIDTH attributes attached
already. That really only leaves, in the current spec, TABLE elements as
'non-streamable'. While what you're talking about doing *might* fix that,
the expense is going to be significant enough that it won't be
cost-effective.

					Dan

----------------------------------------"it's like this"-------------------
Dan Sugalski   (541) 917-4364           even samurai
Programmer/SysAdmin                     have teddy bears
Linn-Benton Community College           and even the teddy bears
sugalsd@lbcc.cc.or.us                   get drunk
Received on Monday, 8 September 1997 12:48:53 UTC