Re: i18n-ISSUE-467: What are the rules for string equality when column names are matched with annotations

These changes satisfy ISSUE-467.


Steven Atkin, Ph.D.
STSM - Chief Globalization Architect
IBM Globalization Center of Competency
atkin@us.ibm.com
http://www-3.ibm.com/software/globalization/index.jsp



From:	Jeni Tennison <jeni@jenitennison.com>
To:	www-international@w3.org, Steven Atkin/Austin/IBM@IBMUS
Cc:	public-csv-wg@w3.org
Date:	06/10/2015 09:56 AM
Subject:	Re: i18n-ISSUE-467: What are the rules for string equality when
            column names are matched with annotations



Hi Steven,

Thank you for raising this issue which we turned into
https://github.com/w3c/csvw/issues/578

We have added text in the definition of the tabular data model (
http://w3c.github.io/csvw/syntax/#model) to make it clear that all string
values it contains are Unicode strings:

  String values within the tabular data model (such as column titles or
cell string values)
  MUST contain only Unicode characters.

We have also added text in step 5 of the non-normative parsing algorithm
for CSV at http://w3c.github.io/csvw/syntax/#parsing which describes how to
create a model from CSV and now says:

5. Read the file using the encoding, as specified in [encoding], using the
replacement
   error mode. If the encoding is not a Unicode encoding, use a normalizing
transcoder
   to normalize into Unicode Normal Form C as defined in [UAX15].

   NOTE

   The replacement error mode ensures that any non-Unicode characters
within the CSV
   file are replaced by U+FFFD, ensuring that strings within the tabular
data model
   such as column titles and cell string values only contain valid Unicode
characters.

We have changed the rules on comparisons of titles to ensure that these are
always case-sensitive (
http://w3c.github.io/csvw/metadata/#schema-compatibility):

  Column descriptions are compatible under the following conditions:

  1. If either column description has neither name nor titles properties.
  2. If there is a case-sensitive match between the name properties of the
columns.
  3. If there is a non-empty case-sensitive intersection between the titles
values,
     where matches must have a matching language; und matches any language,
and
     languages match if they are equal when truncated, as defined in
[BCP47], to the
     length of the shortest language tag.
  4. If not validating, and one schema has a name property but not a titles
property,
     and the other has a titles property but not a name property.

Please can you confirm that these changes satisfy this comment?

Thanks,

Jeni
--
Jeni Tennison
http://www.jenitennison.com/

On 1 June 2015 at 17:55:17, Steven Atkin (atkin@us.ibm.com) wrote:
>
>
> 6.2 Example with single table and rich annotations
> http://www.w3.org/TR/2015/WD-csv2json-20150416/#example-tree-ops-ext
>
> When the names of the columns in the CSV data are compared with the names
> of the columns in the annotations what is the rule for determining if
they
> are the same? For example, is equality based solely on the UTF-8 raw byte
> sequence or is some form of Unicode Normalization applied first and does
> case matter when making comparisons?
>
> It is recommended that Unicode text not be normalized if it is already in
a
> Unicode encoding. If text needs to be converted into Unicode, then a
> normalizing transcoder should be used and text be normalized into Unicode
> Normal Form C.
>
> It is recommended that case sensitive matching be used when making
> comparisons.
>
>
> Steven Atkin, Ph.D.
> STSM - Chief Globalization Architect
> IBM Globalization Center of Competency
> atkin@us.ibm.com
> http://www-3.ibm.com/software/globalization/index.jsp

Received on Wednesday, 10 June 2015 19:15:00 UTC