RE: allow UTF-16 not just UTF-8 (PR#6774) from BIGELOW,JIM (HP-Boise,ex1) on 2003-10-17 (www-html@w3.org from October 2003)

From: BIGELOW,JIM (HP-Boise,ex1) <jim.bigelow@hp.com>
Date: Thu, 16 Oct 2003 21:15:04 -0400
To: Steven Pemberton <steven.pemberton@cwi.nl>, don@lexmark.com
Cc: w3c-html-wg@w3.org, voyager-issues@mn.aptest.com, elliott.bradshaw@zoran.com, www-html@w3.org, mike@easysw.com
Message-ID: <020A3CF87FB5AC47AA67966B33845755067E07A8@xboi22.boise.itc.hp.com>
Don and Steven,

I want to expand on what you have said:
Don wrote:
> > 1) Every XHTML tag will require twice as many bytes when 
> > represented in UTF-16 versus UTF-8
> > 2) Every English XHTML-Print print job will be twice as 
> > big encoded with UTF-16 versus UTF-8
> > 3) Every "Latin 1" print job will be larger approaching 
> > 2X in size.
> >
> > When you double the data's size, buffers have to double to 
> > be able to hold and manipulate an equivalent amount of print 
> > stream content.  

This statement is only true for some print streams. See the discussion below
in "The problem space".

Steven wrote:
> UTF 16 and UTF 8 are *external* representations. The internal 
> amount of storage needed for them is identical, and 
> completely up to you how you store.

If a printer uses 16 bits internally to represent a character, then there
shouldn't be a difference in buffering requirements between utf-8 and utf-16
encoded files (see below for a more complete discussion).  However, if a
printer uses 8 bits per character, then it has restricted itself to only
handle a subset of possible documents, those with ASCII characters.  This is
a product-specific decision akin to that of whether to make a device print
in color or black & white or support landscape as well as portrait printing.
Therefore, I suggest that the spec say that a printer should support utf-16,
just as it now says it should support CSS, landscape printing, and color --
within the limits of the device.  If a user buys a low-cost device that can
only print ASCII characters in portrait orientation, without color, style
sheets, or images, hopefully the price was inline with the printer's
abilities and other, more expensive, more capable devices are available as
needed.

Jim


The problem space
----------------------
There is a document composition continuum from documents with only text,
through mixed text and images, to documents that contain only images.  At
the text-only end of the continuum, the effects on the document size of
UTF-16 vs. UTF-8 is a doubling of document size. At the image-only end of
the continuum, the effects on the document size of encoding in UTF-16 versus
UTF-8 are over-shadowed by the image data. 

The table below illustrates three points on the document composition
continuum:
1. Text-only: a document that prints as one page of ASCII text (times, 10pt,
8in by 11in paper) [1].  Size, in bytes, is 6,282.

2. Text & Image: a one page document with one 3in x 5in image (166.7K bytes)
and the remainder text [2]. Size, in bytes, of document and image is
171,531.

3. Image-only: a one page document with eight 2in x 3.25in images (703.2K
bytes) and no text. [3] Size, in bytes, of document and eight images is
705,108.

Size (bytes): utf-8: %doc : utf-16: %doc 
Text-only:    6,282: 100  : 12,566: 100
Text+Image:   4,776: 3.2  :  9,554: 5.4  (9,554 /(9,954+166,675)* 100)
Image-only:   1,916: .27  :  3,834: .54 

There is another point of variability: the characters in the text portions
of the document. This is another continuum from ASCII only at one end to
Japanese, Chinese, Korean, and Hindi at the other.  

"Table 1: UTF types" of [4] gives the following average bytes per code point

         utf-8  utf-16
English  1      2
Latin-1  1.1    2
Greek,
Russian,
Arabic,
Hebrew   1.7    2
Japanese,
Chinese
Korean
Hindi    3      2

As the language/script of the text portion of the document changes from
English-only toward other scripts and languages, the size difference between
utf-8 and utf-16 decreases.


End-to-end solution
-------------------
If you look at the end-to-end solution, from the sending application to the
printer, the stages can be thought of as:
1. Sending Device: the data as represented in the sending device (a cell
phone for example)
2. Transmission: the data combined with markup and style information as and
XHTML-Print data stream and then encoded in either UTF-8 or UTF-16
3. Receiving Device: the printer -- breaking this into two parts gives:
3.a The XHTML-Print data stream as received 
3.b The data without markup and style information and before printing. How
the data is stored is implementation dependent and how much memory is used
depends on how a character is represented --  8 or 16 bits, and how much
buffer of the document is buffered.  Each printer makes these choices,
8bits/char restricted the documents processed to Latin1 characters.



Stage   Size    utf-8   utf-16
1. app   n       -         -
2. xmit  n       n-3n*    2n   
3a. Pr   n       n-3n     2n
3b. Pr** n       n-2n     n-2n

* n-3n shows the variable sizing depending on characters being encode:
English only (n), CJK (3n)
** at Stage 3b, representing a character with 8bits restricts the characters
that can be represented to ASCII or Latin 1, 16 bits can represent all
characters.

Internal representation

If a printer uses 16 bits internally to represent a character, then there
shouldn't be difference in buffering requirements between utf-8 and utf-16
encoded files.  However, if a printer uses 8 bits, then it has restricted
itself to only handle a subset of documents.  This is a product-specific
decision akin to that of supporting color or not.  Therefore, I suggest that
the spec say that a printer should support utf-16 just as it now say it
should support CSS, landscape printing, and color -- within the limits of
the device.  If a user buys a low-cost device that can only print ASCII
characters in portrait orientation, without color, images or style,
hopefully the price is inline with the printer's abilities and other, more
expensive, more capable devices are available as needed.



[1] http://www.pwg.org/xhtml-print/W3C-Version/georgeb.html
[2] http://www.pwg.org/xhtml-print/W3C-Version/text+image.html
[3] http://www.pwg.org/xhtml-print/W3C-Version/image-only.html

[4] http://www-106.ibm.com/developerworks/library/utfencodingforms/
Received on Thursday, 16 October 2003 21:18:01 UTC