- From: <don@lexmark.com>
- Date: Fri, 17 Oct 2003 17:00:32 -0400
- To: "Steven Pemberton" <steven.pemberton@cwi.nl>
- Cc: "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>, <don@lexmark.com>, <w3c-html-wg@w3.org>, <voyager-issues@mn.aptest.com>, <elliott.bradshaw@zoran.com>, <www-html@w3.org>, <mike@easysw.com>
Steven: You perception of how this works in an embedded device especially in a printer that will use this in Bluetooth, UPNP and other environments is clearly tainted by your experience of this with the Web and PCs. 0) Of course UTF-8 versus UTF-16 is orthogonal to the internal representation of the "printer" but not until it is in the "printer" and off the "network" 1) As defined to be used by Bluetooth and in other environments, the data is PUSHed to the device rather than being pulled. You have less control over the amount of data being sent. 2) The network buffers are in the same constrained memory space as the processor for XHTML-Print. Chunks from the network have to be buffered by the network process until they can be dealt with by the TCP processes which buffers them until they can be dealt with by the XHTML-Print process. All this is done in that same limited, constrained memory space. If I'm going to maintain performance levels customers expect, I need to be able to buffer up in multiple buffers this data equivalent amounts of CONTENT which in English encoded UTF-16 is TWICE as many bytes as UTF-8. It is unreasonable to expected the network or TCP process within the device to convert UTF-16 to the internal format; that happens when it actually hits the "printer." So while it might not take any more memory in the "printer" because the content is converted to an internal format, before it reaches the "printer" but while it is in the embedded physical device called a printer, it does. Do you get it yet? In the PC world, the user agent doesn't have to worry about all the underlying details necessary to have the content delivered from the network. We don't have that luxury in the embedded space. All that work is done by the same processor and with the same limited memory. How else do you think we can sell printers for $29?? ******************************************* Don Wright don@lexmark.com Chair, IEEE SA Standards Board Member, IEEE-ISTO Board of Directors f.wright@ieee.org / f.wright@computer.org Director, Alliances and Standards Lexmark International 740 New Circle Rd C14/082-3 Lexington, Ky 40550 859-825-4808 (phone) 603-963-8352 (fax) ******************************************* "Steven Pemberton" <steven.pemberton@cwi.nl> on 10/17/2003 08:55:07 AM To: "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>, <don@lexmark.com> cc: <w3c-html-wg@w3.org>, <voyager-issues@mn.aptest.com>, <elliott.bradshaw@zoran.com>, <www-html@w3.org>, <mike@easysw.com> Subject: Re: allow UTF-16 not just UTF-8 (PR#6774) UTF 8 and UTF 16 are just definitions of how you send a Unicode character stream in an interoperable way over the wire. The character set is the same, the characters are the same, it is just the encoding that is different. It is orthogonal to questions of how characters are stored internally. You can do what you like internally, it is completely up to you. It has no effect on the memory requirements of the receiving device, because you have to convert to your internal form anyway. Steven ----- Original Message ----- From: "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com> To: "Steven Pemberton" <steven.pemberton@cwi.nl>; <don@lexmark.com> Cc: <w3c-html-wg@w3.org>; <voyager-issues@mn.aptest.com>; <elliott.bradshaw@zoran.com>; <www-html@w3.org>; <mike@easysw.com> Sent: Friday, October 17, 2003 3:15 AM Subject: RE: allow UTF-16 not just UTF-8 (PR#6774) > Don and Steven, > > I want to expand on what you have said: > Don wrote: > > > 1) Every XHTML tag will require twice as many bytes when > > > represented in UTF-16 versus UTF-8 > > > 2) Every English XHTML-Print print job will be twice as > > > big encoded with UTF-16 versus UTF-8 > > > 3) Every "Latin 1" print job will be larger approaching > > > 2X in size. > > > > > > When you double the data's size, buffers have to double to > > > be able to hold and manipulate an equivalent amount of print > > > stream content. > > This statement is only true for some print streams. See the discussion below > in "The problem space". > > Steven wrote: > > UTF 16 and UTF 8 are *external* representations. The internal > > amount of storage needed for them is identical, and > > completely up to you how you store. > > If a printer uses 16 bits internally to represent a character, then there > shouldn't be a difference in buffering requirements between utf-8 and utf-16 > encoded files (see below for a more complete discussion). However, if a > printer uses 8 bits per character, then it has restricted itself to only > handle a subset of possible documents, those with ASCII characters. This is > a product-specific decision akin to that of whether to make a device print > in color or black & white or support landscape as well as portrait printing. > Therefore, I suggest that the spec say that a printer should support utf-16, > just as it now says it should support CSS, landscape printing, and color -- > within the limits of the device. If a user buys a low-cost device that can > only print ASCII characters in portrait orientation, without color, style > sheets, or images, hopefully the price was inline with the printer's > abilities and other, more expensive, more capable devices are available as > needed. > > Jim > > > The problem space > ---------------------- > There is a document composition continuum from documents with only text, > through mixed text and images, to documents that contain only images. At > the text-only end of the continuum, the effects on the document size of > UTF-16 vs. UTF-8 is a doubling of document size. At the image-only end of > the continuum, the effects on the document size of encoding in UTF-16 versus > UTF-8 are over-shadowed by the image data. > > The table below illustrates three points on the document composition > continuum: > 1. Text-only: a document that prints as one page of ASCII text (times, 10pt, > 8in by 11in paper) [1]. Size, in bytes, is 6,282. > > 2. Text & Image: a one page document with one 3in x 5in image (166.7K bytes) > and the remainder text [2]. Size, in bytes, of document and image is > 171,531. > > 3. Image-only: a one page document with eight 2in x 3.25in images (703.2K > bytes) and no text. [3] Size, in bytes, of document and eight images is > 705,108. > > Size (bytes): utf-8: %doc : utf-16: %doc > Text-only: 6,282: 100 : 12,566: 100 > Text+Image: 4,776: 3.2 : 9,554: 5.4 (9,554 /(9,954+166,675)* 100) > Image-only: 1,916: .27 : 3,834: .54 > > There is another point of variability: the characters in the text portions > of the document. This is another continuum from ASCII only at one end to > Japanese, Chinese, Korean, and Hindi at the other. > > "Table 1: UTF types" of [4] gives the following average bytes per code point > > utf-8 utf-16 > English 1 2 > Latin-1 1.1 2 > Greek, > Russian, > Arabic, > Hebrew 1.7 2 > Japanese, > Chinese > Korean > Hindi 3 2 > > As the language/script of the text portion of the document changes from > English-only toward other scripts and languages, the size difference between > utf-8 and utf-16 decreases. > > > End-to-end solution > ------------------- > If you look at the end-to-end solution, from the sending application to the > printer, the stages can be thought of as: > 1. Sending Device: the data as represented in the sending device (a cell > phone for example) > 2. Transmission: the data combined with markup and style information as and > XHTML-Print data stream and then encoded in either UTF-8 or UTF-16 > 3. Receiving Device: the printer -- breaking this into two parts gives: > 3.a The XHTML-Print data stream as received > 3.b The data without markup and style information and before printing. How > the data is stored is implementation dependent and how much memory is used > depends on how a character is represented -- 8 or 16 bits, and how much > buffer of the document is buffered. Each printer makes these choices, > 8bits/char restricted the documents processed to Latin1 characters. > > > > Stage Size utf-8 utf-16 > 1. app n - - > 2. xmit n n-3n* 2n > 3a. Pr n n-3n 2n > 3b. Pr** n n-2n n-2n > > * n-3n shows the variable sizing depending on characters being encode: > English only (n), CJK (3n) > ** at Stage 3b, representing a character with 8bits restricts the characters > that can be represented to ASCII or Latin 1, 16 bits can represent all > characters. > > Internal representation > > If a printer uses 16 bits internally to represent a character, then there > shouldn't be difference in buffering requirements between utf-8 and utf-16 > encoded files. However, if a printer uses 8 bits, then it has restricted > itself to only handle a subset of documents. This is a product-specific > decision akin to that of supporting color or not. Therefore, I suggest that > the spec say that a printer should support utf-16 just as it now say it > should support CSS, landscape printing, and color -- within the limits of > the device. If a user buys a low-cost device that can only print ASCII > characters in portrait orientation, without color, images or style, > hopefully the price is inline with the printer's abilities and other, more > expensive, more capable devices are available as needed. > > > > [1] http://www.pwg.org/xhtml-print/W3C-Version/georgeb.html > [2] http://www.pwg.org/xhtml-print/W3C-Version/text+image.html > [3] http://www.pwg.org/xhtml-print/W3C-Version/image-only.html > > [4] http://www-106.ibm.com/developerworks/library/utfencodingforms/ > >
Received on Friday, 17 October 2003 17:04:42 UTC