- From: Steven Atkin <atkin@us.ibm.com>
- Date: Wed, 10 Jun 2015 20:09:31 +0100
- To: Jeni Tennison <jeni@jenitennison.com>
- Cc: public-csv-wg@w3.org, www-international@w3.org
- Message-ID: <OF621A5788.758C0C4A-ON80257E60.006934DF-80257E60.00693DE2@us.ibm.com>
These changes satisfy ISSUE-475.
Steven Atkin, Ph.D.
STSM - Chief Globalization Architect
IBM Globalization Center of Competency
atkin@us.ibm.com
http://www-3.ibm.com/software/globalization/index.jsp
From: Jeni Tennison <jeni@jenitennison.com>
To: www-international@w3.org, Steven Atkin/Austin/IBM@IBMUS
Cc: public-csv-wg@w3.org
Date: 06/10/2015 09:52 AM
Subject: Re: i18n-ISSUE-475: Do Unicode Strings get normalized when
placed into RDF
Hi Steven,
Thank you for raising this issue which we turned into
https://github.com/w3c/csvw/issues/577
We have added text in the definition of the tabular data model (
http://w3c.github.io/csvw/syntax/#model) to make it clear that all string
values it contains are Unicode strings:
String values within the tabular data model (such as column titles or
cell string values)
MUST contain only Unicode characters.
We have also added text in step 5 of the non-normative parsing algorithm
for CSV at http://w3c.github.io/csvw/syntax/#parsing which describes how to
create a model from CSV and now says:
5. Read the file using the encoding, as specified in [encoding], using the
replacement
error mode. If the encoding is not a Unicode encoding, use a normalizing
transcoder
to normalize into Unicode Normal Form C as defined in [UAX15].
NOTE
The replacement error mode ensures that any non-Unicode characters
within the CSV
file are replaced by U+FFFD, ensuring that strings within the tabular
data model
such as column titles and cell string values only contain valid Unicode
characters.
We are in the process (https://github.com/w3c/csvw/pull/601) of adding text
to the RDF conversion document which will say:
The [tabular-data-model] specifies that string values within tabular
data (such as
column titles or cell string values) must contain only Unicode
characters. No Unicode
normalization (as specified in [UAX15]) is applied to these string
values during the
conversion to RDF.
NOTE
If a CSV file is originally encoded as UTF-8, it should not go through
Unicode
normalization during parsing, nor in conversion to RDF. This can result
in RDF literals
that are not in Normal Form C as they should be according to
[rdf11-concepts].
Please can you confirm that these changes satisfy this comment?
Thanks,
Jeni
--
Jeni Tennison
http://www.jenitennison.com/
On 1 June 2015 at 17:47:18, Steven Atkin (atkin@us.ibm.com) wrote:
>
>
> 4.3 Interpreting datatypes
> http://www.w3.org/TR/2015/WD-csv2rdf-20150416/#datatypes
>
> When strings are parsed from the CSV data are they first normalized
before
> being mapped to the xsd:string datatype? For example, are the strings
> normalized into Unicode Normal Form C.
>
> It is recommended that text not be normalized if it is already in a
Unicode
> encoding. If the text is not in Unicode then a normalizing transcoder
> should be used and the Unicode Normal Form C should be used.
>
>
>
> Steven Atkin, Ph.D.
> STSM - Chief Globalization Architect
> IBM Globalization Center of Competency
> atkin@us.ibm.com
> http://www-3.ibm.com/software/globalization/index.jsp
Attachments
- image/gif attachment: graycol.gif
Received on Wednesday, 10 June 2015 19:15:00 UTC