W3C home > Mailing lists > Public > whatwg@whatwg.org > April 2012

[whatwg] Encoding: big5 and big5-hkscs

From: Anne van Kesteren <annevk@opera.com>
Date: Wed, 04 Apr 2012 18:05:14 +0200
Message-ID: <op.wb80a0zw64w2qv@annevk-macbookpro.local>
On Fri, 30 Mar 2012 14:00:38 +0200, Anne van Kesteren <annevk at opera.com>  
> Ideally someone does detailed content analysis to figure out what the  
> best path forward is here, though I'm not entirely sure how.

I still don't know how, but thanks to Simon Pieters I gathered some URLs  
 from http://dotnetdotcom.org/ and found that 22 pages (of which at least  
two are big5-hkscs encoded) out of 609 have byte sequences in the ranges  
that are distinct between big5 and big5-hkscs and in most implementations  
(in IE they are identical, in Opera big5-hkscs is a superset I believe).  
The byte sequences found per URL are published here:  

To go from (lead, trail) to an index usable in big5.json you can use a  
function such as:

def get_index(lead, trail):
     row = 0xFE-0xA1 + RANGE + 1
     cell = (trail-0xA1 + RANGE) if trail > (0x7E+1) else trail - 0x40
     return (lead-0x81) * row + cell

I can do that for the dataset, but I need someone who is able to interpret  
the results to see which decoding makes more sense.

Anne van Kesteren
Received on Wednesday, 4 April 2012 09:05:14 UTC

This archive was generated by hypermail 2.3.1 : Monday, 13 April 2015 23:09:12 UTC