[Bug 28141] treatment of invalid 2-byte sequence is different in different CJK encodings from bugzilla@jessica.w3.org on 2015-03-06 (www-international@w3.org from January to March 2015)

From: <bugzilla@jessica.w3.org>
Date: Fri, 06 Mar 2015 19:18:51 +0000
To: www-international@w3.org
Message-ID: <bug-28141-4285-nJfgRqpm3A@http.www.w3.org/Bugs/Public/>

https://www.w3.org/Bugs/Public/show_bug.cgi?id=28141

--- Comment #2 from Jungshik Shin <jshin@chromium.org> ---
Another piece of information: 

I was tightening Chromium's Big5's table and found that it has a lot of "holes"
in the trail byte in the ASCII range. Below is what I found (all in
hexadecimal). 

lead: trail byte holes in the ASCII range 
87: 76
89: 42 44 45 4A 4B
8A: 42 63 75
8B: 54
8D: 41
9B: 61
9F: 4E
A0: 54 57 5A 62 72

They're all in [a-zA-Z]. So, arguably, the XSS risk is lower than
'punctuation-mark-like characters' in the ASCII range. 

In case of EUC-KR (windows-949), the trail byte in the ASCII range is limited
to [a-zA-Z]. So, without 'adding back to the stream' clause, we'd only eat up
[a-zA-Z]. 


Unless we're sure that [a-zA-Z] is harmless when eaten up, we should keep
'adding back to the stream if the trail is [0, 7F]" clause (in case of ICU,
perhaps the overall memory/perf impact of keeping the current spec is neutral
to a small net-loss; haven't compared yet). 

Anyway, it occurred to me that we might think about this, too.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Received on Friday, 6 March 2015 19:18:53 UTC