Re: i18n-ISSUE-473: Can RDF data be generated from non UTF-8 encoded CSV data from Anne van Kesteren on 2015-06-10 (public-csv-wg@w3.org from June 2015)

From: Anne van Kesteren <annevk@annevk.nl>
Date: Wed, 10 Jun 2015 11:30:51 +0200
To: Jeni Tennison <jeni@jenitennison.com>
Cc: "www-international@w3.org" <www-international@w3.org>, public-csv-wg@w3.org, Steven Atkin <atkin@us.ibm.com>
Message-ID: <CADnb78i8R3zRcD-5Ro-jbg8jERzcNJVju2zXd99eRhQ4aXjb+Q@mail.gmail.com>

On Wed, Jun 10, 2015 at 11:24 AM, Jeni Tennison <jeni@jenitennison.com> wrote:
> As currently defined, the encoding is specified through a flag (see http://w3c.github.io/csvw/syntax/#dfn-encoding) and must be one of the values specified in the encoding spec. It is either set explicitly in the JSON metadata document supplied for the CSV file or through the charset in the Content-Type header. Otherwise it defaults to utf-8.
>
> Would you recommend an alternative approach?

Well, there's labels and encodings. Labels are strings, encodings are
their own type. So you need to use
https://encoding.spec.whatwg.org/#concept-encoding-get to go from one
to the other.

>>> If the encoding is not a Unicode encoding, use a normalizing transcoder
>>> to normalize into Unicode Normal Form C as defined in [UAX15].
>>
>> 1) What is a Unicode encoding?
>
> What would you recommend that we say? The comments from Steven on behalf of the I18N WG simply said “not in Unicode”, would that be a better way of framing it than “not a Unicode encoding”?

Well, everything is in Unicode once decoded and before that it's just
bytes. I guess you could check if the encoding is not utf-8, utf-16be,
or utf-16le...

>> 2) What encodings would be affected by this?
>
> Are you asking us to list the encodings that aren’t Unicode encodings, in the spec?

I'm just wondering what the expected benefit of this normalization is.
I'm not aware of any legacy encoding producing non-NFC code points.

-- 
https://annevankesteren.nl/

Received on Wednesday, 10 June 2015 09:31:19 UTC