Re: [whatwg/encoding] Corner cases arising from Big5 encoder not excluding HKSCS codes with lead bytes 0xFA–FE (#252) from Anne van Kesteren on 2021-02-16 (public-webapps-github@w3.org from February 2021)

From: Anne van Kesteren <notifications@github.com>
Date: Tue, 16 Feb 2021 02:41:08 -0800
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/252/779750696@github.com>

When I run
```python
import json

data = json.load(open("indexes.json", "r"))

big5 = data["big5"]

code_points = {}
pointer = 0
for code_point in big5:
    if code_point != None:
        if code_point not in code_points:
            code_points[code_point] = [pointer]
        else:
            code_points[code_point].append(pointer)
    pointer += 1

for code_point in code_points:
    pointers = code_points[code_point]
    if len(pointers) > 1: # It's either 1 or 2
        excluded = "no"
        if pointers[0] < 5024 and pointers[1] < 5024:
            excluded = "yes"
        elif pointers[0] < 5024 or pointers[1] < 5024:
            excluded = "partial"

        print("U+" + hex(code_point).upper()[2:], pointers, excluded)
```
it seems we have many other pointers for duplicates we probably want to keep excluding? If so, the fix here would likely be to special case the code points listed in OP.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/252#issuecomment-779750696

Received on Tuesday, 16 February 2021 10:41:20 UTC