- From: <elliott.bradshaw@zoran.com>
- Date: Thu, 16 Oct 2003 10:17:38 -0400
- To: Rowland Shaw <Rowland.Shaw@crystaldecisions.com>
- Cc: "'don@lexmark.com'" <don@lexmark.com>, "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>, Steven Pemberton <steven.pemberton@cwi.nl>, voyager-issues@mn.aptest.com, w3c-html-wg@w3.org, www-html@w3.org
Don, I agree with the argument that a front end can convert from UTF-16 to UTF-8 or whatever internal form is used, and have essentially no impact on memory needs. "A couple of dozen bytes" might be a little optimistic for this logic :^) , but it's pretty straightforward: -look at first 16 bits to detect a UTF-16 mark -for each double byte emit the UTF-8 (or other) equivalent Of course a printer could choose to store Asian data differently than Latin, and save some space compared to native UTF-8. This decision is orthogonal to the form of the input. But this logic may not be worth it and is not needed for compliance. Frugally, Elliott -------------------------------------------------------------------------------- Elliott Bradshaw Director, Software Engineering Zoran Imaging Division (formerly Oak Technology Imaging Group) 781 638-7534 Rowland Shaw <Rowland.Shaw@crystaldeci To: "'don@lexmark.com'" <don@lexmark.com>, Steven sions.com> Pemberton <steven.pemberton@cwi.nl> cc: "BIGELOW,JIM (HP-Boise,ex1)" 10/16/2003 09:16 AM <jim.bigelow@hp.com>, w3c-html-wg@w3.org, voyager-issues@mn.aptest.com, elliott.bradshaw@zoran.com, www-html@w3.org Subject: RE: allow UTF-16 not just UTF-8 (PR#6774) ...and for every Asian language, each character can take up to three bytes (in UTF-8 vs. two in UTF-16) Taking a complete random Japanese character (Hiragana Letter Small A) U+3041, in UTF-8 as 0xE3 0x81 0x81 -- this assumes that you are willing to deal with characters as a MBCS, and that you aren't going to convert to UCS2 internally. English has the biggest saving by saving as UTF-8 (so let it), but for most other languages, there is no benefit or worse, a 50% growth in sizes (vs. UTF-16). If UTF-16 is disallowed, it's no longer an XML application (which may be a road to go down) by definition on the minimum bar set for XML (back in the days of 486's and 8Mb machines). Thinking about it, my printer nowadays at home has more RAM in it than my PC when XML was being created... -----Original Message----- From: don@lexmark.com [mailto:don@lexmark.com] Sent: 16 October 2003 14:00 To: Steven Pemberton Cc: don@lexmark.com; BIGELOW,JIM (HP-Boise,ex1); w3c-html-wg@w3.org; voyager-issues@mn.aptest.com; elliott.bradshaw@zoran.com; www-html@w3.org Subject: Re: allow UTF-16 not just UTF-8 (PR#6774) Steven: I think your answer proves my point that the XML commmunity did not and does not consider the limitations of low cost, constrained embedded environments when developing XML. You make the assertion that no extra memory is required yet the reality is quite the opposite. Please tell me if I'm wrong, but my understanding of UTF-8 and UTF-16 is that: 1) Every XHTML tag will require twice as many bytes when represented in UTF-16 versus UTF-8 2) Every English XHTML-Print print job will be twice as big encoded with UTF-16 versus UTF-8 3) Every "Latin 1" print job will be larger approaching 2X in size. When you double the data's size, buffers have to double to be able to hold and manipulate an equivalent amount of print stream content. There is real cost and performance costs to be paid to deal with UTF-16 encoding especially when dealing with western character sets. When a device is designed to deal with the far east "characters" there are other penalties to be paid in things like the size of the font load that mitigate the UTF-16 versus UTF-8 encoding issue. ******************************************* Don Wright don@lexmark.com Chair, IEEE SA Standards Board Member, IEEE-ISTO Board of Directors f.wright@ieee.org / f.wright@computer.org Director, Alliances and Standards Lexmark International 740 New Circle Rd C14/082-3 Lexington, Ky 40550 859-825-4808 (phone) 603-963-8352 (fax) ******************************************* "Steven Pemberton" <steven.pemberton@cwi.nl> on 10/15/2003 07:26:24 PM To: <don@lexmark.com> cc: "BIGELOW,JIM \(HP-Boise,ex1\)" <jim.bigelow@hp.com>, <w3c-html-wg@w3.org>, <don@lexmark.com>, <voyager-issues@mn.aptest.com>, <elliott.bradshaw@zoran.com>, <www-html@w3.org> Subject: Re: allow UTF-16 not just UTF-8 (PR#6774) But support for UTF 16 adds a few dozen bytes of code, and no extra memory requirements. It is simpler than UTF 8! What's the problem? Steven ----- Original Message ----- From: <don@lexmark.com> To: "Steven Pemberton" <Steven.Pemberton@cwi.nl> Cc: "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>; <w3c-html-wg@w3.org>; <don@lexmark.com>; <voyager-issues@mn.aptest.com>; <elliott.bradshaw@zoran.com>; <www-html@w3.org> Sent: Thursday, October 16, 2003 12:20 AM Subject: Re: allow UTF-16 not just UTF-8 (PR#6774) > > Steven, et al: > > The real problem is that the entire XML architecture was designed assuming > high end boxes like the 3 GHz Pentium with 512 megabytes of memory. We > have already seen push back in other standards groups that consumer > electronic devices and other smaller, lighter devices cannot afford all the > luxuries demand by an obese XML architecture. Unless the XML community > accepts subsetting, we can't expect the broadest support for XML to happen > at the low end until the price/performance ratios experience another order > or two magnitude improvement. As recently reported in several of the trade > magazines focused on IT professionals, the deployment of XML and Web > Services are have significant negative impacts on the IT infrastructure > especially in the area of bandwidth utilization. This is just another > symptom of the same problem. > > I know I will lose this argument in the W3C but the realities of the > XHTML-Print implementations will blow off UTF-16 as more fat with no > benefit and simply not support it, "interoperable" or not. > > Sorry I'm not pure but practical. > > ******************************************* > Don Wright don@lexmark.com > > Chair, IEEE SA Standards Board > Member, IEEE-ISTO Board of Directors > f.wright@ieee.org / f.wright@computer.org > > Director, Alliances and Standards > Lexmark International > 740 New Circle Rd C14/082-3 > Lexington, Ky 40550 > 859-825-4808 (phone) 603-963-8352 (fax) > ******************************************* > > > > > "Steven Pemberton" <Steven.Pemberton@cwi.nl> on 10/15/2003 09:18:15 AM > > To: "BIGELOW,JIM \(HP-Boise,ex1\)" <jim.bigelow@hp.com>, > <w3c-html-wg@w3.org>, <don@lexmark.com> > cc: <voyager-issues@mn.aptest.com>, <elliott.bradshaw@zoran.com>, > <www-html@w3.org> > Subject: Re: allow UTF-16 not just UTF-8 (PR#6774) > > > > From: don@lexmark.com [mailto:don@lexmark.com] > > > So let me understand this.... > > > > Because people have poorly designed and written XML applications running > on > > 3 GHz Pentium 4s with 512 megabytes of real memory that do not allow the > > control over whether UTF-8 or UTF-16 are emitted, we are expecting to > burden > > $49 printers with code to be able to detect and interpret both. > > No Don. It is about interoperability and conforming to standards. XML > allows > documents to be encoded in either UTF8 or UTF 16: consumers must accept > both, producers may produce either. An XHTML-Print printer will be just a > consumer of an XML byte-stream at some IP address; we don't want to burden > every program in the world that can produce XML with a switch that says > "this output is going to a poor lowly XHTML Print processor that can't deal > with UTF-16, so please produce UTF-8", especially since UTF 16 is the easy > one to implement, and can only cost a few dozen bytes at best. > > If we changed this, XHTML Print would have to go back to last call, and you > can bet your boots that the XML community would rise up against us, as it > has in the past, and I can tell you we don't want to go there, and we would > have a hundred people registering objections. > > Conforming to XML requirements comes with the territory of being XHTML. The > XML community will not take lightly to us messing with their standards. > > Best wishes, > > Steven Pemberton > > > > > > >
Received on Thursday, 16 October 2003 10:24:25 UTC