Re: BOCU-1, SCSU, etc. from Frank Ellermann on 2008-01-29 (public-html-comments@w3.org from January 2008)

From: Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Date: Tue, 29 Jan 2008 16:21:46 +0100
To: <public-html-comments@w3.org>
Cc: "Henri Sivonen" <hsivonen@iki.fi>
Message-ID: <00ad01c8628a$abbe6650$4fa0b43e@xyzzy>

Henri Sivonen wrote:

> I would be interesting to see a large-scale study of the
> compactness of UTF-8 vs. UTF-16 vs. BOCU-1 vs. SCSU vs.
> well-supported applicable legacy encodings vs. and all
> of them gzipped as applied to real-world *Web* content.

Yes, same here.  Apart from what is covered in UTN #14 
here's my own test result for permutations of MES-1 + BOM:

UTF-32 0000FEFF   1344
UTF-16 FEFF        672
UTF-8  EFBBBF      595
UTF-7  2B2F76382D  836
UTF-4  849F9E9F9F  789
UTF-1  F7644C      578
BOCU-1 FBEE28      514
B( 80)  627
B(  1)  377

For BOCU-1 I tried to catch worst (627) and best (377) cases,
but it is quite possible that I missed worse / better cases.

Of course "permutation of MES-1 + BOM" is totally unrelated
to "real world Web content".  For the script of this test 
see <http://purl.net/xyzzy/src/bocu.cmd> - but its Bocu-1
code is unsuited for real applications (no error handling).

 Frank

Received on Tuesday, 29 January 2008 15:21:45 UTC