W3C home > Mailing lists > Public > www-html@w3.org > October 2003

Re: allow UTF-16 not just UTF-8 (PR#6774)

From: <don@lexmark.com>
Date: Fri, 17 Oct 2003 17:00:32 -0400
To: "Steven Pemberton" <steven.pemberton@cwi.nl>
Cc: "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>, <don@lexmark.com>, <w3c-html-wg@w3.org>, <voyager-issues@mn.aptest.com>, <elliott.bradshaw@zoran.com>, <www-html@w3.org>, <mike@easysw.com>
Message-ID: <OF3B58CBE8.7037824D-ON85256DC2.0072103D@lexmark.com>


Steven:

You perception of how this works in an embedded device especially in a
printer that will use this in Bluetooth, UPNP and other environments is
clearly tainted by your experience of this with the Web and PCs.

0) Of course UTF-8 versus UTF-16 is orthogonal to the internal
representation of the "printer" but not until it is in the "printer" and
off the "network"

1)  As defined to be used by Bluetooth and in other environments, the data
is PUSHed to the device rather than being pulled.  You have less control
over the amount of data being sent.

2) The network buffers are in the same constrained memory space as the
processor for XHTML-Print.  Chunks from the network have to be buffered by
the network process until they can be dealt with by the TCP processes which
buffers them until they can be dealt with by the XHTML-Print process.  All
this is done in that same limited, constrained memory space.  If I'm going
to maintain performance levels customers expect, I need to be able to
buffer up in multiple buffers this data equivalent amounts of CONTENT which
in English encoded UTF-16 is TWICE as many bytes as UTF-8.  It is
unreasonable to expected the network or TCP process within the device to
convert UTF-16 to the internal format; that happens when it actually hits
the "printer."  So while it might not take any more memory in the "printer"
because the content is converted to an internal format, before it reaches
the "printer" but while it is in the embedded physical device called a
printer, it does.

Do you get it yet?  In the PC world, the user agent doesn't have to worry
about all the underlying details necessary to have the content delivered
from the network.  We don't have that luxury in the embedded space.  All
that work is done by the same processor and with the same limited memory.
How else do you think we can sell printers for $29??

*******************************************
Don Wright                 don@lexmark.com

Chair,  IEEE SA Standards Board
Member, IEEE-ISTO Board of Directors
f.wright@ieee.org / f.wright@computer.org

Director, Alliances and Standards
Lexmark International
740 New Circle Rd C14/082-3
Lexington, Ky 40550
859-825-4808 (phone) 603-963-8352 (fax)
*******************************************




"Steven Pemberton" <steven.pemberton@cwi.nl> on 10/17/2003 08:55:07 AM

To:    "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>, <don@lexmark.com>
cc:    <w3c-html-wg@w3.org>, <voyager-issues@mn.aptest.com>,
       <elliott.bradshaw@zoran.com>, <www-html@w3.org>, <mike@easysw.com>
Subject:    Re: allow UTF-16 not just UTF-8 (PR#6774)


UTF 8 and UTF 16 are just definitions of how you send a Unicode character
stream in an interoperable way over the wire. The character set is the
same,
the characters are the same, it is just the encoding that is different.

It is orthogonal to questions of how characters are stored internally. You
can do what you like internally, it is completely up to you. It has no
effect on the memory requirements of the receiving device, because you have
to convert to your internal form anyway.

Steven

----- Original Message -----
From: "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>
To: "Steven Pemberton" <steven.pemberton@cwi.nl>; <don@lexmark.com>
Cc: <w3c-html-wg@w3.org>; <voyager-issues@mn.aptest.com>;
<elliott.bradshaw@zoran.com>; <www-html@w3.org>; <mike@easysw.com>
Sent: Friday, October 17, 2003 3:15 AM
Subject: RE: allow UTF-16 not just UTF-8 (PR#6774)


> Don and Steven,
>
> I want to expand on what you have said:
> Don wrote:
> > > 1) Every XHTML tag will require twice as many bytes when
> > > represented in UTF-16 versus UTF-8
> > > 2) Every English XHTML-Print print job will be twice as
> > > big encoded with UTF-16 versus UTF-8
> > > 3) Every "Latin 1" print job will be larger approaching
> > > 2X in size.
> > >
> > > When you double the data's size, buffers have to double to
> > > be able to hold and manipulate an equivalent amount of print
> > > stream content.
>
> This statement is only true for some print streams. See the discussion
below
> in "The problem space".
>
> Steven wrote:
> > UTF 16 and UTF 8 are *external* representations. The internal
> > amount of storage needed for them is identical, and
> > completely up to you how you store.
>
> If a printer uses 16 bits internally to represent a character, then there
> shouldn't be a difference in buffering requirements between utf-8 and
utf-16
> encoded files (see below for a more complete discussion).  However, if a
> printer uses 8 bits per character, then it has restricted itself to only
> handle a subset of possible documents, those with ASCII characters.  This
is
> a product-specific decision akin to that of whether to make a device
print
> in color or black & white or support landscape as well as portrait
printing.
> Therefore, I suggest that the spec say that a printer should support
utf-16,
> just as it now says it should support CSS, landscape printing, and
color --
> within the limits of the device.  If a user buys a low-cost device that
can
> only print ASCII characters in portrait orientation, without color, style
> sheets, or images, hopefully the price was inline with the printer's
> abilities and other, more expensive, more capable devices are available
as
> needed.
>
> Jim
>
>
> The problem space
> ----------------------
> There is a document composition continuum from documents with only text,
> through mixed text and images, to documents that contain only images.  At
> the text-only end of the continuum, the effects on the document size of
> UTF-16 vs. UTF-8 is a doubling of document size. At the image-only end of
> the continuum, the effects on the document size of encoding in UTF-16
versus
> UTF-8 are over-shadowed by the image data.
>
> The table below illustrates three points on the document composition
> continuum:
> 1. Text-only: a document that prints as one page of ASCII text (times,
10pt,
> 8in by 11in paper) [1].  Size, in bytes, is 6,282.
>
> 2. Text & Image: a one page document with one 3in x 5in image (166.7K
bytes)
> and the remainder text [2]. Size, in bytes, of document and image is
> 171,531.
>
> 3. Image-only: a one page document with eight 2in x 3.25in images (703.2K
> bytes) and no text. [3] Size, in bytes, of document and eight images is
> 705,108.
>
> Size (bytes): utf-8: %doc : utf-16: %doc
> Text-only:    6,282: 100  : 12,566: 100
> Text+Image:   4,776: 3.2  :  9,554: 5.4  (9,554 /(9,954+166,675)* 100)
> Image-only:   1,916: .27  :  3,834: .54
>
> There is another point of variability: the characters in the text
portions
> of the document. This is another continuum from ASCII only at one end to
> Japanese, Chinese, Korean, and Hindi at the other.
>
> "Table 1: UTF types" of [4] gives the following average bytes per code
point
>
>          utf-8  utf-16
> English  1      2
> Latin-1  1.1    2
> Greek,
> Russian,
> Arabic,
> Hebrew   1.7    2
> Japanese,
> Chinese
> Korean
> Hindi    3      2
>
> As the language/script of the text portion of the document changes from
> English-only toward other scripts and languages, the size difference
between
> utf-8 and utf-16 decreases.
>
>
> End-to-end solution
> -------------------
> If you look at the end-to-end solution, from the sending application to
the
> printer, the stages can be thought of as:
> 1. Sending Device: the data as represented in the sending device (a cell
> phone for example)
> 2. Transmission: the data combined with markup and style information as
and
> XHTML-Print data stream and then encoded in either UTF-8 or UTF-16
> 3. Receiving Device: the printer -- breaking this into two parts gives:
> 3.a The XHTML-Print data stream as received
> 3.b The data without markup and style information and before printing.
How
> the data is stored is implementation dependent and how much memory is
used
> depends on how a character is represented --  8 or 16 bits, and how much
> buffer of the document is buffered.  Each printer makes these choices,
> 8bits/char restricted the documents processed to Latin1 characters.
>
>
>
> Stage   Size    utf-8   utf-16
> 1. app   n       -         -
> 2. xmit  n       n-3n*    2n
> 3a. Pr   n       n-3n     2n
> 3b. Pr** n       n-2n     n-2n
>
> * n-3n shows the variable sizing depending on characters being encode:
> English only (n), CJK (3n)
> ** at Stage 3b, representing a character with 8bits restricts the
characters
> that can be represented to ASCII or Latin 1, 16 bits can represent all
> characters.
>
> Internal representation
>
> If a printer uses 16 bits internally to represent a character, then there
> shouldn't be difference in buffering requirements between utf-8 and
utf-16
> encoded files.  However, if a printer uses 8 bits, then it has restricted
> itself to only handle a subset of documents.  This is a product-specific
> decision akin to that of supporting color or not.  Therefore, I suggest
that
> the spec say that a printer should support utf-16 just as it now say it
> should support CSS, landscape printing, and color -- within the limits of
> the device.  If a user buys a low-cost device that can only print ASCII
> characters in portrait orientation, without color, images or style,
> hopefully the price is inline with the printer's abilities and other,
more
> expensive, more capable devices are available as needed.
>
>
>
> [1] http://www.pwg.org/xhtml-print/W3C-Version/georgeb.html
> [2] http://www.pwg.org/xhtml-print/W3C-Version/text+image.html
> [3] http://www.pwg.org/xhtml-print/W3C-Version/image-only.html
>
> [4] http://www-106.ibm.com/developerworks/library/utfencodingforms/
>
>
Received on Friday, 17 October 2003 17:04:42 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 March 2012 18:15:58 GMT