- From: Jeni Tennison <jeni@jenitennison.com>
- Date: Wed, 10 Jun 2015 09:54:20 +0100
- To: www-international@w3.org, Steven Atkin <atkin@us.ibm.com>
- Cc: public-csv-wg@w3.org
Hi Steven, Thank you for raising this issue which we turned into https://github.com/w3c/csvw/issues/575 We have added text in the definition of the tabular data model (http://w3c.github.io/csvw/syntax/#model) to make it clear that all string values it contains are Unicode strings: String values within the tabular data model (such as column titles or cell string values) MUST contain only Unicode characters. We have also added text in step 5 of the non-normative parsing algorithm for CSV at http://w3c.github.io/csvw/syntax/#parsing which describes how to create a model from CSV and now says: 5. Read the file using the encoding, as specified in [encoding], using the replacement error mode. If the encoding is not a Unicode encoding, use a normalizing transcoder to normalize into Unicode Normal Form C as defined in [UAX15]. NOTE The replacement error mode ensures that any non-Unicode characters within the CSV file are replaced by U+FFFD, ensuring that strings within the tabular data model such as column titles and cell string values only contain valid Unicode characters. We have added text to the JSON conversion document at http://w3c.github.io/csvw/csv2json/#generating-json which says: The [tabular-data-model] specifies that string values within tabular data (such as column titles or cell string values) must contain only Unicode characters. No Unicode normalization (as specified in [UAX15]) is applied to these string values during the conversion to JSON. Please can you confirm that these changes satisfy this comment? Thanks, Jeni -- Jeni Tennison http://www.jenitennison.com/ On 1 June 2015 at 17:53:45, Steven Atkin (atkin@us.ibm.com) wrote: > > > 4.2 Generating JSON > http://www.w3.org/TR/2015/WD-csv2json-20150416/#generating-json > > There is no mention of whether or not CSV data must be encoded in UTF-8. > The model for tabular data indicates that non UTF-8 CSV data should specify > the charset in the Content-Type header. The specification should clearly > indicate that a conversion to UTF-8 needs to be performed when the CSV data > is not in Unicode. > See 7.2 Encoding > http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/#encoding > > > Steven Atkin, Ph.D. > STSM - Chief Globalization Architect > IBM Globalization Center of Competency > atkin@us.ibm.com > http://www-3.ibm.com/software/globalization/index.jsp
Received on Wednesday, 10 June 2015 08:54:45 UTC