Re: allow UTF-16 not just UTF-8 (PR#6774) from Steven Pemberton on 2003-10-17 (www-html@w3.org from October 2003)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Fri, 17 Oct 2003 14:55:07 +0200
To: "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>, <don@lexmark.com>
Cc: <w3c-html-wg@w3.org>, <voyager-issues@mn.aptest.com>, <elliott.bradshaw@zoran.com>, <www-html@w3.org>, <mike@easysw.com>
Message-ID: <18a401c394ad$e54fef60$df13fea9@srx41p>
UTF 8 and UTF 16 are just definitions of how you send a Unicode character
stream in an interoperable way over the wire. The character set is the same,
the characters are the same, it is just the encoding that is different.

It is orthogonal to questions of how characters are stored internally. You
can do what you like internally, it is completely up to you. It has no
effect on the memory requirements of the receiving device, because you have
to convert to your internal form anyway.

Steven

----- Original Message ----- 
From: "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>
To: "Steven Pemberton" <steven.pemberton@cwi.nl>; <don@lexmark.com>
Cc: <w3c-html-wg@w3.org>; <voyager-issues@mn.aptest.com>;
<elliott.bradshaw@zoran.com>; <www-html@w3.org>; <mike@easysw.com>
Sent: Friday, October 17, 2003 3:15 AM
Subject: RE: allow UTF-16 not just UTF-8 (PR#6774)


> Don and Steven,
>
> I want to expand on what you have said:
> Don wrote:
> > > 1) Every XHTML tag will require twice as many bytes when
> > > represented in UTF-16 versus UTF-8
> > > 2) Every English XHTML-Print print job will be twice as
> > > big encoded with UTF-16 versus UTF-8
> > > 3) Every "Latin 1" print job will be larger approaching
> > > 2X in size.
> > >
> > > When you double the data's size, buffers have to double to
> > > be able to hold and manipulate an equivalent amount of print
> > > stream content.
>
> This statement is only true for some print streams. See the discussion
below
> in "The problem space".
>
> Steven wrote:
> > UTF 16 and UTF 8 are *external* representations. The internal
> > amount of storage needed for them is identical, and
> > completely up to you how you store.
>
> If a printer uses 16 bits internally to represent a character, then there
> shouldn't be a difference in buffering requirements between utf-8 and
utf-16
> encoded files (see below for a more complete discussion).  However, if a
> printer uses 8 bits per character, then it has restricted itself to only
> handle a subset of possible documents, those with ASCII characters.  This
is
> a product-specific decision akin to that of whether to make a device print
> in color or black & white or support landscape as well as portrait
printing.
> Therefore, I suggest that the spec say that a printer should support
utf-16,
> just as it now says it should support CSS, landscape printing, and
color --
> within the limits of the device.  If a user buys a low-cost device that
can
> only print ASCII characters in portrait orientation, without color, style
> sheets, or images, hopefully the price was inline with the printer's
> abilities and other, more expensive, more capable devices are available as
> needed.
>
> Jim
>
>
> The problem space
> ----------------------
> There is a document composition continuum from documents with only text,
> through mixed text and images, to documents that contain only images.  At
> the text-only end of the continuum, the effects on the document size of
> UTF-16 vs. UTF-8 is a doubling of document size. At the image-only end of
> the continuum, the effects on the document size of encoding in UTF-16
versus
> UTF-8 are over-shadowed by the image data.
>
> The table below illustrates three points on the document composition
> continuum:
> 1. Text-only: a document that prints as one page of ASCII text (times,
10pt,
> 8in by 11in paper) [1].  Size, in bytes, is 6,282.
>
> 2. Text & Image: a one page document with one 3in x 5in image (166.7K
bytes)
> and the remainder text [2]. Size, in bytes, of document and image is
> 171,531.
>
> 3. Image-only: a one page document with eight 2in x 3.25in images (703.2K
> bytes) and no text. [3] Size, in bytes, of document and eight images is
> 705,108.
>
> Size (bytes): utf-8: %doc : utf-16: %doc
> Text-only:    6,282: 100  : 12,566: 100
> Text+Image:   4,776: 3.2  :  9,554: 5.4  (9,554 /(9,954+166,675)* 100)
> Image-only:   1,916: .27  :  3,834: .54
>
> There is another point of variability: the characters in the text portions
> of the document. This is another continuum from ASCII only at one end to
> Japanese, Chinese, Korean, and Hindi at the other.
>
> "Table 1: UTF types" of [4] gives the following average bytes per code
point
>
>          utf-8  utf-16
> English  1      2
> Latin-1  1.1    2
> Greek,
> Russian,
> Arabic,
> Hebrew   1.7    2
> Japanese,
> Chinese
> Korean
> Hindi    3      2
>
> As the language/script of the text portion of the document changes from
> English-only toward other scripts and languages, the size difference
between
> utf-8 and utf-16 decreases.
>
>
> End-to-end solution
> -------------------
> If you look at the end-to-end solution, from the sending application to
the
> printer, the stages can be thought of as:
> 1. Sending Device: the data as represented in the sending device (a cell
> phone for example)
> 2. Transmission: the data combined with markup and style information as
and
> XHTML-Print data stream and then encoded in either UTF-8 or UTF-16
> 3. Receiving Device: the printer -- breaking this into two parts gives:
> 3.a The XHTML-Print data stream as received
> 3.b The data without markup and style information and before printing. How
> the data is stored is implementation dependent and how much memory is used
> depends on how a character is represented --  8 or 16 bits, and how much
> buffer of the document is buffered.  Each printer makes these choices,
> 8bits/char restricted the documents processed to Latin1 characters.
>
>
>
> Stage   Size    utf-8   utf-16
> 1. app   n       -         -
> 2. xmit  n       n-3n*    2n
> 3a. Pr   n       n-3n     2n
> 3b. Pr** n       n-2n     n-2n
>
> * n-3n shows the variable sizing depending on characters being encode:
> English only (n), CJK (3n)
> ** at Stage 3b, representing a character with 8bits restricts the
characters
> that can be represented to ASCII or Latin 1, 16 bits can represent all
> characters.
>
> Internal representation
>
> If a printer uses 16 bits internally to represent a character, then there
> shouldn't be difference in buffering requirements between utf-8 and utf-16
> encoded files.  However, if a printer uses 8 bits, then it has restricted
> itself to only handle a subset of documents.  This is a product-specific
> decision akin to that of supporting color or not.  Therefore, I suggest
that
> the spec say that a printer should support utf-16 just as it now say it
> should support CSS, landscape printing, and color -- within the limits of
> the device.  If a user buys a low-cost device that can only print ASCII
> characters in portrait orientation, without color, images or style,
> hopefully the price is inline with the printer's abilities and other, more
> expensive, more capable devices are available as needed.
>
>
>
> [1] http://www.pwg.org/xhtml-print/W3C-Version/georgeb.html
> [2] http://www.pwg.org/xhtml-print/W3C-Version/text+image.html
> [3] http://www.pwg.org/xhtml-print/W3C-Version/image-only.html
>
> [4] http://www-106.ibm.com/developerworks/library/utfencodingforms/
>
>
Received on Friday, 17 October 2003 09:01:03 UTC