W3C home > Mailing lists > Public > public-csv-wg@w3.org > June 2015

Re: i18n-ISSUE-466: Can JSON data be generated from non UTF-8 encoded CSV data

From: Steven Atkin <atkin@us.ibm.com>
Date: Wed, 10 Jun 2015 20:10:43 +0100
To: Jeni Tennison <jeni@jenitennison.com>
Cc: public-csv-wg@w3.org, www-international@w3.org
Message-ID: <OF1BF70002.4B353F29-ON80257E60.0069519B-80257E60.00695A34@us.ibm.com>

These changes satisfy ISSUE-466.



Steven Atkin, Ph.D.
STSM - Chief Globalization Architect
IBM Globalization Center of Competency
atkin@us.ibm.com
http://www-3.ibm.com/software/globalization/index.jsp



From:	Jeni Tennison <jeni@jenitennison.com>
To:	www-international@w3.org, Steven Atkin/Austin/IBM@IBMUS
Cc:	public-csv-wg@w3.org
Date:	06/10/2015 09:54 AM
Subject:	Re: i18n-ISSUE-466: Can JSON data be generated from non UTF-8
            encoded CSV data



Hi Steven,

Thank you for raising this issue which we turned into
https://github.com/w3c/csvw/issues/575

We have added text in the definition of the tabular data model (
http://w3c.github.io/csvw/syntax/#model) to make it clear that all string
values it contains are Unicode strings:

  String values within the tabular data model (such as column titles or
cell string values)
  MUST contain only Unicode characters.

We have also added text in step 5 of the non-normative parsing algorithm
for CSV at http://w3c.github.io/csvw/syntax/#parsing which describes how to
create a model from CSV and now says:

5. Read the file using the encoding, as specified in [encoding], using the
replacement
   error mode. If the encoding is not a Unicode encoding, use a normalizing
transcoder
   to normalize into Unicode Normal Form C as defined in [UAX15].

   NOTE

   The replacement error mode ensures that any non-Unicode characters
within the CSV
   file are replaced by U+FFFD, ensuring that strings within the tabular
data model
   such as column titles and cell string values only contain valid Unicode
characters.

We have added text to the JSON conversion document at
http://w3c.github.io/csvw/csv2json/#generating-json which says:

   The [tabular-data-model] specifies that string values within tabular
data (such as
   column titles or cell string values) must contain only Unicode
characters. No Unicode
   normalization (as specified in [UAX15]) is applied to these string
values during the
   conversion to JSON.

Please can you confirm that these changes satisfy this comment?

Thanks,

Jeni
--
Jeni Tennison
http://www.jenitennison.com/

On 1 June 2015 at 17:53:45, Steven Atkin (atkin@us.ibm.com) wrote:
>
>
> 4.2 Generating JSON
> http://www.w3.org/TR/2015/WD-csv2json-20150416/#generating-json
>
> There is no mention of whether or not CSV data must be encoded in UTF-8.
> The model for tabular data indicates that non UTF-8 CSV data should
specify
> the charset in the Content-Type header. The specification should clearly
> indicate that a conversion to UTF-8 needs to be performed when the CSV
data
> is not in Unicode.
> See 7.2 Encoding
> http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/#encoding
>
>
> Steven Atkin, Ph.D.
> STSM - Chief Globalization Architect
> IBM Globalization Center of Competency
> atkin@us.ibm.com
> http://www-3.ibm.com/software/globalization/index.jsp







graycol.gif
(image/gif attachment: graycol.gif)

Received on Wednesday, 10 June 2015 19:15:00 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 10 June 2015 19:15:08 UTC