Re: i18n-ISSUE-475: Do Unicode Strings get normalized when placed into RDF

Hi Steven,

Thank you for raising this issue which we turned into https://github.com/w3c/csvw/issues/577

We have added text in the definition of the tabular data model (http://w3c.github.io/csvw/syntax/#model) to make it clear that all string values it contains are Unicode strings:

  String values within the tabular data model (such as column titles or cell string values) 
  MUST contain only Unicode characters.

We have also added text in step 5 of the non-normative parsing algorithm for CSV at http://w3c.github.io/csvw/syntax/#parsing which describes how to create a model from CSV and now says:

5. Read the file using the encoding, as specified in [encoding], using the replacement 
   error mode. If the encoding is not a Unicode encoding, use a normalizing transcoder 
   to normalize into Unicode Normal Form C as defined in [UAX15].

   NOTE

   The replacement error mode ensures that any non-Unicode characters within the CSV 
   file are replaced by U+FFFD, ensuring that strings within the tabular data model 
   such as column titles and cell string values only contain valid Unicode characters.

We are in the process (https://github.com/w3c/csvw/pull/601) of adding text to the RDF conversion document which will say:

   The [tabular-data-model] specifies that string values within tabular data (such as 
   column titles or cell string values) must contain only Unicode characters. No Unicode 
   normalization (as specified in [UAX15]) is applied to these string values during the 
   conversion to RDF.

   NOTE

   If a CSV file is originally encoded as UTF-8, it should not go through Unicode 
   normalization during parsing, nor in conversion to RDF. This can result in RDF literals 
   that are not in Normal Form C as they should be according to [rdf11-concepts].

Please can you confirm that these changes satisfy this comment?

Thanks,

Jeni
-- 
Jeni Tennison
http://www.jenitennison.com/

On 1 June 2015 at 17:47:18, Steven Atkin (atkin@us.ibm.com) wrote:
>  
>  
> 4.3 Interpreting datatypes
> http://www.w3.org/TR/2015/WD-csv2rdf-20150416/#datatypes
>  
> When strings are parsed from the CSV data are they first normalized before
> being mapped to the xsd:string datatype? For example, are the strings
> normalized into Unicode Normal Form C.
>  
> It is recommended that text not be normalized if it is already in a Unicode
> encoding. If the text is not in Unicode then a normalizing transcoder
> should be used and the Unicode Normal Form C should be used.
>  
>  
>  
> Steven Atkin, Ph.D.
> STSM - Chief Globalization Architect
> IBM Globalization Center of Competency
> atkin@us.ibm.com
> http://www-3.ibm.com/software/globalization/index.jsp

Received on Wednesday, 10 June 2015 08:53:11 UTC