[whatwg/encoding] Shift_JIS decoder (#270) from Ludovic Delabre on 2021-08-07 (public-webapps-github@w3.org from August 2021)

From: Ludovic Delabre <notifications@github.com>
Date: Sat, 07 Aug 2021 13:34:48 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/270@github.com>

Hi,
I need to implement a full blown HTML5 parsing library in C# (with all its quirks); the fitst layer being byte-stream decoding.
After implementing Shift_JIS as described in https://encoding.spec.whatwg.org/#shift_jis-decoder, I did a full conformance check  against https://www.w3.org/International/tests/repo/encoding/legacy-mb-japanese/shift_jis/sjis_chars.html.

First, byte 0x5c (which is ASCII) must be changed to U+00a5; same as 0x7e to U+203E which seems to be missing from the spec. 
Both characters are marked as "Modified ASCII character" at https://en.wikipedia.org/wiki/Shift_JIS.

But my main issue is with the bytes sequence 0x81 0x7C which according to https://encoding.spec.whatwg.org/index-jis0208.txt can be both decoded at either u+2211 or u+FF0C. 

Did I misinterprete something ?

Thanks for your help,
Ludovic.

Ps : I notice the same trouble with EUC-JP with the sequence 0xA1 0xDD which can decoder either as u+2211 u+ff0c (different sequences but same code points ?)

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/270

Received on Saturday, 7 August 2021 20:35:00 UTC