RE: allow UTF-16 not just UTF-8 (PR#6774) from elliott.bradshaw@zoran.com on 2003-10-16 (www-html@w3.org from October 2003)

From: <elliott.bradshaw@zoran.com>
Date: Thu, 16 Oct 2003 10:17:38 -0400
To: Rowland Shaw <Rowland.Shaw@crystaldecisions.com>
Cc: "'don@lexmark.com'" <don@lexmark.com>, "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>, Steven Pemberton <steven.pemberton@cwi.nl>, voyager-issues@mn.aptest.com, w3c-html-wg@w3.org, www-html@w3.org
Message-ID: <OFF42D3C0F.D00CECB0-ON85256DC1.004DC63D-85256DC1.004EC457@ne.oaktech.com>
Don,

I agree with the argument that a front end can convert from UTF-16 to UTF-8
or whatever internal form is used, and have essentially no impact on memory
needs.

"A couple of dozen bytes" might be a little optimistic for this logic  :^)
, but it's pretty straightforward:
  -look at first 16 bits to detect a UTF-16 mark
  -for each double byte emit the UTF-8 (or other) equivalent

Of course a printer could choose to store Asian data differently than
Latin, and save some space compared to native UTF-8.  This decision is
orthogonal to the form of the input.  But this logic may not be worth it
and is not needed for compliance.

  Frugally,
  Elliott


--------------------------------------------------------------------------------

Elliott Bradshaw
Director, Software Engineering
Zoran Imaging Division (formerly Oak Technology Imaging Group)
781 638-7534



                                                                                                          
                    Rowland Shaw                                                                          
                    <Rowland.Shaw@crystaldeci       To:     "'don@lexmark.com'" <don@lexmark.com>, Steven 
                    sions.com>                       Pemberton <steven.pemberton@cwi.nl>                  
                                                    cc:     "BIGELOW,JIM (HP-Boise,ex1)"                  
                    10/16/2003 09:16 AM              <jim.bigelow@hp.com>, w3c-html-wg@w3.org,            
                                                     voyager-issues@mn.aptest.com,                        
                                                     elliott.bradshaw@zoran.com, www-html@w3.org          
                                                    Subject:     RE: allow UTF-16 not just UTF-8          
                                                     (PR#6774)                                            
                                                                                                          




...and for every Asian language, each character can take up to three bytes
(in UTF-8 vs. two in UTF-16)

Taking a complete random Japanese character (Hiragana Letter Small A)
U+3041, in UTF-8 as 0xE3 0x81 0x81 -- this assumes that you are willing to
deal with characters as a MBCS, and that you aren't going to convert to
UCS2
internally.

English has the biggest saving by saving as UTF-8 (so let it), but for most
other languages, there is no benefit or worse, a 50% growth in sizes (vs.
UTF-16).

If UTF-16 is disallowed, it's no longer an XML application (which may be a
road to go down) by definition on the minimum bar set for XML (back in the
days of 486's and 8Mb machines). Thinking about it, my printer nowadays at
home has more RAM in it than my PC when XML was being created...


-----Original Message-----
From: don@lexmark.com [mailto:don@lexmark.com]
Sent: 16 October 2003 14:00
To: Steven Pemberton
Cc: don@lexmark.com; BIGELOW,JIM (HP-Boise,ex1); w3c-html-wg@w3.org;
voyager-issues@mn.aptest.com; elliott.bradshaw@zoran.com; www-html@w3.org
Subject: Re: allow UTF-16 not just UTF-8 (PR#6774)



Steven:

I think your answer proves my point that the XML commmunity did not and
does not consider the limitations of low cost, constrained embedded
environments when developing XML.

You make the assertion that no extra memory is required yet the reality is
quite the opposite.

Please tell me if I'm wrong, but my understanding of UTF-8 and UTF-16 is
that:

1) Every XHTML tag will require twice as many bytes when represented in
UTF-16 versus UTF-8
2) Every English XHTML-Print print job will be twice as big encoded with
UTF-16 versus UTF-8
3) Every "Latin 1" print job will be larger approaching 2X in size.

When you double the data's size, buffers have to double to be able to hold
and manipulate an equivalent amount of print stream content.  There is real
cost and performance costs to be paid to deal with UTF-16 encoding
especially when dealing with western character sets.  When a device is
designed to deal with the far east "characters" there are other penalties
to be paid in things like the size of the font load that mitigate the
UTF-16 versus UTF-8 encoding issue.

*******************************************
Don Wright                 don@lexmark.com

Chair,  IEEE SA Standards Board
Member, IEEE-ISTO Board of Directors
f.wright@ieee.org / f.wright@computer.org

Director, Alliances and Standards
Lexmark International
740 New Circle Rd C14/082-3
Lexington, Ky 40550
859-825-4808 (phone) 603-963-8352 (fax)
*******************************************





"Steven Pemberton" <steven.pemberton@cwi.nl> on 10/15/2003 07:26:24 PM

To:    <don@lexmark.com>
cc:    "BIGELOW,JIM \(HP-Boise,ex1\)" <jim.bigelow@hp.com>,
       <w3c-html-wg@w3.org>, <don@lexmark.com>,
       <voyager-issues@mn.aptest.com>, <elliott.bradshaw@zoran.com>,
       <www-html@w3.org>
Subject:    Re: allow UTF-16 not just UTF-8 (PR#6774)


But support for UTF 16 adds a few dozen bytes of code, and no extra memory
requirements. It is simpler than UTF 8! What's the problem?

Steven

----- Original Message -----
From: <don@lexmark.com>
To: "Steven Pemberton" <Steven.Pemberton@cwi.nl>
Cc: "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>;
<w3c-html-wg@w3.org>;
<don@lexmark.com>; <voyager-issues@mn.aptest.com>;
<elliott.bradshaw@zoran.com>; <www-html@w3.org>
Sent: Thursday, October 16, 2003 12:20 AM
Subject: Re: allow UTF-16 not just UTF-8 (PR#6774)


>
> Steven, et al:
>
> The real problem is that the entire XML architecture was designed
assuming
> high end boxes like the 3 GHz Pentium with 512 megabytes of memory.  We
> have already seen push back in other standards groups that consumer
> electronic devices and other smaller, lighter devices cannot afford all
the
> luxuries demand by an obese XML architecture.  Unless the XML community
> accepts subsetting, we can't expect the broadest support for XML to
happen
> at the low end until the price/performance ratios experience another
order
> or two magnitude improvement.  As recently reported in several of the
trade
> magazines focused on IT professionals, the deployment of XML and Web
> Services are have significant negative impacts on the IT infrastructure
> especially in the area of bandwidth utilization.  This is just another
> symptom of the same problem.
>
> I know I will lose this argument in the W3C but the realities of the
> XHTML-Print implementations will blow off UTF-16 as more fat with no
> benefit and simply not support it, "interoperable" or not.
>
> Sorry I'm not pure but practical.
>
> *******************************************
> Don Wright                 don@lexmark.com
>
> Chair,  IEEE SA Standards Board
> Member, IEEE-ISTO Board of Directors
> f.wright@ieee.org / f.wright@computer.org
>
> Director, Alliances and Standards
> Lexmark International
> 740 New Circle Rd C14/082-3
> Lexington, Ky 40550
> 859-825-4808 (phone) 603-963-8352 (fax)
> *******************************************
>
>
>
>
> "Steven Pemberton" <Steven.Pemberton@cwi.nl> on 10/15/2003 09:18:15 AM
>
> To:    "BIGELOW,JIM \(HP-Boise,ex1\)" <jim.bigelow@hp.com>,
>        <w3c-html-wg@w3.org>, <don@lexmark.com>
> cc:    <voyager-issues@mn.aptest.com>, <elliott.bradshaw@zoran.com>,
>        <www-html@w3.org>
> Subject:    Re: allow UTF-16 not just UTF-8 (PR#6774)
>
>
> > From: don@lexmark.com [mailto:don@lexmark.com]
>
> > So let me understand this....
> >
> > Because people have poorly designed and written XML applications
running
> on
> > 3 GHz Pentium 4s with 512 megabytes of real memory that do not allow
the
> > control over whether UTF-8 or UTF-16 are emitted, we are expecting to
> burden
> > $49 printers with code to be able to detect and interpret both.
>
> No Don. It is about interoperability and conforming to standards. XML
> allows
> documents to be encoded in either UTF8 or UTF 16: consumers must accept
> both, producers may produce either. An XHTML-Print printer will be just a
> consumer of an XML byte-stream at some IP address; we don't want to
burden
> every program in the world that can produce XML with a switch that says
> "this output is going to a poor lowly XHTML Print processor that can't
deal
> with UTF-16, so please produce UTF-8", especially since UTF 16 is the
easy
> one to implement, and can only cost a few dozen bytes at best.
>
> If we changed this, XHTML Print would have to go back to last call, and
you
> can bet your boots that the XML community would rise up against us, as it
> has in the past, and I can tell you we don't want to go there, and we
would
> have a hundred people registering objections.
>
> Conforming to XML requirements comes with the territory of being XHTML.
The
> XML community will not take lightly to us messing with their standards.
>
> Best wishes,
>
> Steven Pemberton
>
>
>
>
>
>
>
Received on Thursday, 16 October 2003 10:24:25 UTC