W3C home > Mailing lists > Public > public-csv-wg@w3.org > June 2015

Re: i18n-ISSUE-466: Can JSON data be generated from non UTF-8 encoded CSV data

From: Jeni Tennison <jeni@jenitennison.com>
Date: Wed, 10 Jun 2015 10:29:34 +0100
To: www-international@w3.org, Steven Atkin <atkin@us.ibm.com>
Cc: public-csv-wg@w3.org
Message-ID: <etPan.5578037e.6fbc81e6.ed@jenit.local>
Apologies, this should have referenced our issue https://github.com/w3c/csvw/issues/579

Jeni
--  
Jeni Tennison
http://www.jenitennison.com/

On 10 June 2015 at 09:54:20, Jeni Tennison (jeni@jenitennison.com) wrote:
> Hi Steven,
>  
> Thank you for raising this issue which we turned into https://github.com/w3c/csvw/issues/575  
>  
> We have added text in the definition of the tabular data model (http://w3c.github.io/csvw/syntax/#model)  
> to make it clear that all string values it contains are Unicode strings:
>  
> String values within the tabular data model (such as column titles or cell string values)  
> MUST contain only Unicode characters.
>  
> We have also added text in step 5 of the non-normative parsing algorithm for CSV at http://w3c.github.io/csvw/syntax/#parsing  
> which describes how to create a model from CSV and now says:
>  
> 5. Read the file using the encoding, as specified in [encoding], using the replacement  
> error mode. If the encoding is not a Unicode encoding, use a normalizing transcoder
> to normalize into Unicode Normal Form C as defined in [UAX15].
>  
> NOTE
>  
> The replacement error mode ensures that any non-Unicode characters within the CSV
> file are replaced by U+FFFD, ensuring that strings within the tabular data model
> such as column titles and cell string values only contain valid Unicode characters.  
>  
> We have added text to the JSON conversion document at http://w3c.github.io/csvw/csv2json/#generating-json  
> which says:
>  
> The [tabular-data-model] specifies that string values within tabular data (such as  
> column titles or cell string values) must contain only Unicode characters. No Unicode  
> normalization (as specified in [UAX15]) is applied to these string values during the  
> conversion to JSON.
>  
> Please can you confirm that these changes satisfy this comment?
>  
> Thanks,
>  
> Jeni
> --
> Jeni Tennison
> http://www.jenitennison.com/
>  
> On 1 June 2015 at 17:53:45, Steven Atkin (atkin@us.ibm.com) wrote:
> >
> >
> > 4.2 Generating JSON
> > http://www.w3.org/TR/2015/WD-csv2json-20150416/#generating-json
> >
> > There is no mention of whether or not CSV data must be encoded in UTF-8.
> > The model for tabular data indicates that non UTF-8 CSV data should specify
> > the charset in the Content-Type header. The specification should clearly
> > indicate that a conversion to UTF-8 needs to be performed when the CSV data
> > is not in Unicode.
> > See 7.2 Encoding
> > http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/#encoding
> >
> >
> > Steven Atkin, Ph.D.
> > STSM - Chief Globalization Architect
> > IBM Globalization Center of Competency
> > atkin@us.ibm.com
> > http://www-3.ibm.com/software/globalization/index.jsp
>  
>  
>  
Received on Wednesday, 10 June 2015 09:29:59 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 10 June 2015 09:30:00 UTC