- From: Jeni Tennison <jeni@jenitennison.com>
- Date: Wed, 10 Jun 2015 09:45:18 +0100
- To: www-international@w3.org, Steven Atkin <atkin@us.ibm.com>
- Cc: public-csv-wg@w3.org
Hi Steven, Thank you for raising this issue which we turned into https://github.com/w3c/csvw/issues/575 We have added text in the definition of the tabular data model (http://w3c.github.io/csvw/syntax/#model) to make it clear that all string values it contains are Unicode strings: String values within the tabular data model (such as column titles or cell string values) MUST contain only Unicode characters. We have also added text in step 5 of the non-normative parsing algorithm for CSV at http://w3c.github.io/csvw/syntax/#parsing which describes how to create a model from CSV and now says: 5. Read the file using the encoding, as specified in [encoding], using the replacement error mode. If the encoding is not a Unicode encoding, use a normalizing transcoder to normalize into Unicode Normal Form C as defined in [UAX15]. NOTE The replacement error mode ensures that any non-Unicode characters within the CSV file are replaced by U+FFFD, ensuring that strings within the tabular data model such as column titles and cell string values only contain valid Unicode characters. We are in the process (https://github.com/w3c/csvw/pull/601) of adding text to the RDF conversion document which will say: The [tabular-data-model] specifies that string values within tabular data (such as column titles or cell string values) must contain only Unicode characters. No Unicode normalization (as specified in [UAX15]) is applied to these string values during the conversion to RDF. NOTE If a CSV file is originally encoded as UTF-8, it should not go through Unicode normalization during parsing, nor in conversion to RDF. This can result in RDF literals that are not in Normal Form C as they should be according to [rdf11-concepts]. Please can you confirm that these changes satisfy this comment? Thanks, Jeni -- Jeni Tennison http://www.jenitennison.com/ On 1 June 2015 at 17:42:56, Steven Atkin (atkin@us.ibm.com) wrote: > > > 4.2 Generating RDF > http://www.w3.org/TR/2015/WD-csv2rdf-20150416/#generating-rdf > > There is no mention of whether or not CSV data must be encoded in UTF-8. > The model for tabular data indicates that non UTF-8 CSV data should specify > the charset in the Content-Type header. The specification should clearly > indicate that a conversion to UTF-8 needs to be performed when the CSV data > is not in Unicode. > See 7.2 Encoding > http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/#encoding > > > > Steven Atkin, Ph.D. > STSM - Chief Globalization Architect > IBM Globalization Center of Competency > atkin@us.ibm.com > http://www-3.ibm.com/software/globalization/index.jsp
Received on Wednesday, 10 June 2015 08:45:45 UTC