Re: Column merging is not clear... from Gregg Kellogg on 2015-02-01 (public-csv-wg@w3.org from February 2015)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Sun, 1 Feb 2015 13:10:04 -0800
To: Ivan Herman <ivan@w3.org>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <DAC53FE9-E269-42E3-B113-06BCA1E3F913@greggkellogg.net>
> On Jan 31, 2015, at 2:51 AM, Ivan Herman <ivan@w3.org> wrote:
> 
> Just a first reaction (maybe we do have to turn into an issue to get this properly traced...) I do prefer the second version that takes an exception for the language.
> 
> However... I have spotted another issue. The _property value_ for a natural language property says:
> 
> [[[
>  • if the metadata value is a string, that string
>  • if the metadata value is an array, the strings in that array
>  • if the metadata value is an object, the string or strings that are the value of the property of that object whose name is value of the lang inherited property on that description, or und if no lang is defined.
> ]]]
> 
> (We know that 'lang' has to be changed against the value of '@language', but that is another matter). So let us consider:
> 
> metadata A
> 
> {
>  "@context" : { "@language": "en"}
>  ...
>  "tableSchema" :
>     "columns" : [
>        {
>           "title" : { "en" : "My title" }
> ...
> }
> 
> metadata B:
> 
> {
>  "@context" : { "@language": "fr"}
>  ...
>  "tableSchema" :
>     "columns" : [
>        {
>           "title" : { "en" : "My title", "fr" : "Mon titre" }
> ...
> }
> 
> 
> I think we can agree that we want these two to match. However, does the property value for 'title' in B include "My title"?

We need to be careful to separate _property value_ from the use within the merge algorithm. The merge language says the following:

[[[
 • If the property is a natural language property, the result is an object whose properties are language codes and where the values of those properties are arrays. The suitable language code for the values is either explicit within the existing value or determined through the default language in the metadata document; if it can't be determined the language code und must be used. The arrays should provide the values from A followed by those from B that were not already a value in A.
]]]

So, the result of merging A and B would be the following:

     "title" : { "en" : "My title", "fr" : "Mon titre" }

This merge happens because of the definition of schema merge:

[[[
When an array of column descriptions B is merged into an original array of column descriptions A, each column description within B is combined into the original array A by:

 • if there is a column description at the same index within A and that column description has the same name, the column description from B is merged into the matching column description in A
 • otherwise, if there is a column description at the same index within A and that column description has a title, is also in A, and the column default languageis the same in both A and B, the column description from B is imported into the matching column description in A
 • otherwise, if there is no column description at the same index within A, then the column description is taken from that index of B
 • otherwise, the column description is ignored. A validator must issue a warning if such a column description is encountered.
]]]

This allows the columns to merge because both A and B have a column description with the same title ("My title"@en). After the merge, the schema looks like the following:

    {
     "@context" : { "@language": "en"}
     ...
     "tableSchema" :
        "columns" : [
           {
              "title" : { "en" : "My title", "fr" : "Mon titre" }
    ...
    }

Because the merged context has @language: en, the property value of title would be "My title"@en.

> It is not really clear from the definition of the property value. I presume the intention is that the property value for B is
> 
> ["My title", "Mon titre"]
> 
> maybe it should be
> 
> [{ "en" : "My title" }, { "fr" : "Mon titre }]
> 
> ie, some sort of a canonical value...

My interpretation may vary well be faulty, and the property value of a natural language property should include all titles in all languages. On reflection, I think this is the correct interpretation, and the property value should include all defined values in all languages.

We also need to consider precisely what atomic properties make use of @language, which is defined to be the default language of properties in the document. The only one which could possible make sense is name, which I've taken to be a string without language, but maybe it should get the language of @language or the title which is promoted. Given how it's used, I don't think having a language for name is useful, but we should be explicit. In that case, the only thing @language applies to is common properties and title, where it is a string or array of strings.

> Maybe the definition of property value should be something like:
> 
> [[[
> The _property value_ for natural language property is an array of objects, each object having a single key/value pair of the form
> 
>   { [language code] : [String] }
> 
> The objects are created from the original definition as follows:
> 
>  * if the metadata value is a string, the language code in the resulting object is either the value of "@langauge", if it exists, or "und" otherwise
>  * if the metadata value is an array, the language code for each resulting objects is either the value of "@langauge", if it exists, or "und" otherwise
>  * if the metadata value is an object, the array are the constituent key/value pairs, after possibly flattening values that are themselves arrays with the common key
> ]]]
> 
> 
> (this obviously needs refinement). With this definition, the second alternative below seems to work well.
> 
> WDYT?

Why wouldn't it just be the same representation as the merged value?

[[[ an object whose properties are language codes and where the values of those properties are arrays]]]

This has the advantage of being equivalent to the JSON-LD representation. I think this makes more sense than using yet another representation involving arrays of objects with tags. Since all metadata used for extracting property values is the result of a merge, that is the most natural representation to use. So, I'd be inclined to go with the following:

[[[
The _property value_ for natural language property is an object whose properties are language codes and where the values of those properties are arrays (see <cite><a href="http://www.w3.org/TR/json-ld/#language-maps">Language Maps</a></cite> in [[!JSON-LD]]).
]]]

Given that when processing all metadata is the result of a merger (possibly with the default metadata used for extracted embedded metadata), it shouldn't be necessary to re-define the procedure for normalizing values to the object form.

I created the issue #183 for this, and will create a pull request accordingly.

Gregg

> Ivan
> 
> 
>> On 30 Jan 2015, at 20:35 , Gregg Kellogg <gregg@greggkellogg.net> wrote:
>> 
>>> On Jan 30, 2015, at 12:42 AM, Ivan Herman <ivan@w3.org> wrote:
>>> 
>>> Gregg,
>>> 
>>> I do not want to add this as an issue, because it may just be my bad understanding. Here is what the (new) document says on merging columns:
>>> 
>>> [[[
>>> When an array of column descriptions B is merged into an original array of column descriptions A, each column description within B is combined into the original array A by:
>>> 
>>>  • if there is a column description at the same index within A and that column description has the same name, the column description from B is merged into the matching column description in A
>>>  • otherwise, if there is a column description at the same index within A and that column description has a title, is also in A, and the column default language is the same in both A and B, the column description from B is imported into the matching column description in A
>>>  • otherwise, if there is no column description at the same index within A, then the column description is taken from that index of B
>>>  • otherwise, the column description is ignored. A validator must issue a warning if such a column description is encountered.
>>> ]]]
>>> 
>>> I do not really understand the second entry, and I wonder whether there is a misspelling. What does 'is also in A' means? Or should that be 'is also in B', meaning that the same title should appear on both sides? What happens if the title is an array (which can happen)? Does it mean that there should be at least one agreement in a title? Also, if A says:
>> 
>> Might be better worded as the following:
>> 
>> [[[
>> * otherwise, if there is a column description at the same index within A and that column description has a title in _property value_ which is also in B, considering the language of each title, the column description from B is imported into the matching column description in A.
>> ]]]
>> 
>> Basically, A and B match if they share a title, considering the language of each title in A and B.
>> 
>>> {
>>> "@context" : { "@language" : "en" },
>>> "tableSchema" :
>>>   "columns" : [
>>>      {
>>>         "title" : "my Title"
>>>      }
>>> 
>>> and B says
>>> 
>>> {
>>> "tableSchema" :
>>>   "columns" : [
>>>      {
>>>         "title" : "my Title",
>>>         "name"  : "my-title"
>>>      }
>>> 
>>> according to these rules you cannot merge the two, because one of the two has a language tag, the other does not. Is it what we want?
>> 
>> No, they don’t match as currently defined, although we might make an exception if the language is undefined.
>> 
>> This page has some interesting word comparisons: http://edl.ecml.at/LanguageFun/Sameworddifferentmeaning/tabid/3103/language/en-GB/Default.aspx.
>> 
>> For example, “Bad” means different things in many germanic languages and English, so “bad”@de != “bad”@en. But would we say that “bad”@en” == “bad”^^xsd:string (SPARQL says no)? If we did, then we could simplify the creation of embedded metadata by not needing to use @language and `lang` in the extracted metadata. In this case, the wording might be the following:
>> 
>> [[[
>> * otherwise, if there is a column description at the same index within A and that column description has a title in _property value_ which is also in B, considering the language of each title where an undefined language value matches a value in any other language, the column description from B is imported into the matching column description in A.
>> ]]]
>> 
>> As always, suggestions on improving the description to make it less cryptic or more accurate are welcome.
>> 
>> Gregg
>> 
>>> I think some clarifications may be necessary...
>>> 
>>> Ivan
>>> 
>>> 
>>> ----
>>> Ivan Herman, W3C
>>> Digital Publishing Activity Lead
>>> Home: http://www.w3.org/People/Ivan/
>>> mobile: +31-641044153
>>> ORCID ID: http://orcid.org/0000-0003-0782-2704
> 
> 
> ----
> Ivan Herman, W3C
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> ORCID ID: http://orcid.org/0000-0003-0782-2704
> 
> 
> 
>
Received on Sunday, 1 February 2015 21:10:34 UTC