- From: BIGELOW,JIM (HP-Boise,ex1) <jim.bigelow@hp.com>
- Date: Thu, 16 Oct 2003 21:15:04 -0400
- To: Steven Pemberton <steven.pemberton@cwi.nl>, don@lexmark.com
- Cc: w3c-html-wg@w3.org, voyager-issues@mn.aptest.com, elliott.bradshaw@zoran.com, www-html@w3.org, mike@easysw.com
Don and Steven, I want to expand on what you have said: Don wrote: > > 1) Every XHTML tag will require twice as many bytes when > > represented in UTF-16 versus UTF-8 > > 2) Every English XHTML-Print print job will be twice as > > big encoded with UTF-16 versus UTF-8 > > 3) Every "Latin 1" print job will be larger approaching > > 2X in size. > > > > When you double the data's size, buffers have to double to > > be able to hold and manipulate an equivalent amount of print > > stream content. This statement is only true for some print streams. See the discussion below in "The problem space". Steven wrote: > UTF 16 and UTF 8 are *external* representations. The internal > amount of storage needed for them is identical, and > completely up to you how you store. If a printer uses 16 bits internally to represent a character, then there shouldn't be a difference in buffering requirements between utf-8 and utf-16 encoded files (see below for a more complete discussion). However, if a printer uses 8 bits per character, then it has restricted itself to only handle a subset of possible documents, those with ASCII characters. This is a product-specific decision akin to that of whether to make a device print in color or black & white or support landscape as well as portrait printing. Therefore, I suggest that the spec say that a printer should support utf-16, just as it now says it should support CSS, landscape printing, and color -- within the limits of the device. If a user buys a low-cost device that can only print ASCII characters in portrait orientation, without color, style sheets, or images, hopefully the price was inline with the printer's abilities and other, more expensive, more capable devices are available as needed. Jim The problem space ---------------------- There is a document composition continuum from documents with only text, through mixed text and images, to documents that contain only images. At the text-only end of the continuum, the effects on the document size of UTF-16 vs. UTF-8 is a doubling of document size. At the image-only end of the continuum, the effects on the document size of encoding in UTF-16 versus UTF-8 are over-shadowed by the image data. The table below illustrates three points on the document composition continuum: 1. Text-only: a document that prints as one page of ASCII text (times, 10pt, 8in by 11in paper) [1]. Size, in bytes, is 6,282. 2. Text & Image: a one page document with one 3in x 5in image (166.7K bytes) and the remainder text [2]. Size, in bytes, of document and image is 171,531. 3. Image-only: a one page document with eight 2in x 3.25in images (703.2K bytes) and no text. [3] Size, in bytes, of document and eight images is 705,108. Size (bytes): utf-8: %doc : utf-16: %doc Text-only: 6,282: 100 : 12,566: 100 Text+Image: 4,776: 3.2 : 9,554: 5.4 (9,554 /(9,954+166,675)* 100) Image-only: 1,916: .27 : 3,834: .54 There is another point of variability: the characters in the text portions of the document. This is another continuum from ASCII only at one end to Japanese, Chinese, Korean, and Hindi at the other. "Table 1: UTF types" of [4] gives the following average bytes per code point utf-8 utf-16 English 1 2 Latin-1 1.1 2 Greek, Russian, Arabic, Hebrew 1.7 2 Japanese, Chinese Korean Hindi 3 2 As the language/script of the text portion of the document changes from English-only toward other scripts and languages, the size difference between utf-8 and utf-16 decreases. End-to-end solution ------------------- If you look at the end-to-end solution, from the sending application to the printer, the stages can be thought of as: 1. Sending Device: the data as represented in the sending device (a cell phone for example) 2. Transmission: the data combined with markup and style information as and XHTML-Print data stream and then encoded in either UTF-8 or UTF-16 3. Receiving Device: the printer -- breaking this into two parts gives: 3.a The XHTML-Print data stream as received 3.b The data without markup and style information and before printing. How the data is stored is implementation dependent and how much memory is used depends on how a character is represented -- 8 or 16 bits, and how much buffer of the document is buffered. Each printer makes these choices, 8bits/char restricted the documents processed to Latin1 characters. Stage Size utf-8 utf-16 1. app n - - 2. xmit n n-3n* 2n 3a. Pr n n-3n 2n 3b. Pr** n n-2n n-2n * n-3n shows the variable sizing depending on characters being encode: English only (n), CJK (3n) ** at Stage 3b, representing a character with 8bits restricts the characters that can be represented to ASCII or Latin 1, 16 bits can represent all characters. Internal representation If a printer uses 16 bits internally to represent a character, then there shouldn't be difference in buffering requirements between utf-8 and utf-16 encoded files. However, if a printer uses 8 bits, then it has restricted itself to only handle a subset of documents. This is a product-specific decision akin to that of supporting color or not. Therefore, I suggest that the spec say that a printer should support utf-16 just as it now say it should support CSS, landscape printing, and color -- within the limits of the device. If a user buys a low-cost device that can only print ASCII characters in portrait orientation, without color, images or style, hopefully the price is inline with the printer's abilities and other, more expensive, more capable devices are available as needed. [1] http://www.pwg.org/xhtml-print/W3C-Version/georgeb.html [2] http://www.pwg.org/xhtml-print/W3C-Version/text+image.html [3] http://www.pwg.org/xhtml-print/W3C-Version/image-only.html [4] http://www-106.ibm.com/developerworks/library/utfencodingforms/
Received on Thursday, 16 October 2003 21:18:01 UTC