Re: Column merging is not clear... from Gregg Kellogg on 2015-02-02 (public-csv-wg@w3.org from February 2015)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Mon, 2 Feb 2015 09:43:00 -0800
To: Ivan Herman <ivan@w3.org>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <1A76B926-B6C0-4D2E-8649-24675C7AB94D@greggkellogg.net>
> On Feb 2, 2015, at 2:02 AM, Ivan Herman <ivan@w3.org> wrote:
> 
> Hi Gregg,
> 
> 
>> On 01 Feb 2015, at 22:10 , Gregg Kellogg <gregg@greggkellogg.net> wrote:
>> 
> 
> <skip/>
> 
>>> Maybe the definition of property value should be something like:
>>> 
>>> [[[
>>> The _property value_ for natural language property is an array of objects, each object having a single key/value pair of the form
>>> 
>>> { [language code] : [String] }
>>> 
>>> The objects are created from the original definition as follows:
>>> 
>>> * if the metadata value is a string, the language code in the resulting object is either the value of "@langauge", if it exists, or "und" otherwise
>>> * if the metadata value is an array, the language code for each resulting objects is either the value of "@langauge", if it exists, or "und" otherwise
>>> * if the metadata value is an object, the array are the constituent key/value pairs, after possibly flattening values that are themselves arrays with the common key
>>> ]]]
>>> 
>>> 
>>> (this obviously needs refinement). With this definition, the second alternative below seems to work well.
>>> 
>>> WDYT?
>> 
>> Why wouldn't it just be the same representation as the merged value?
>> 
>> [[[ an object whose properties are language codes and where the values of those properties are arrays]]]
>> 
>> This has the advantage of being equivalent to the JSON-LD representation. I think this makes more sense than using yet another representation involving arrays of objects with tags. Since all metadata used for extracting property values is the result of a merge, that is the most natural representation to use. So, I'd be inclined to go with the following:
>> 
>> [[[
>> The _property value_ for natural language property is an object whose properties are language codes and where the values of those properties are arrays (see <cite><a href="http://www.w3.org/TR/json-ld/#language-maps">Language Maps</a></cite> in [[!JSON-LD]]).
>> ]]]
>> 
>> Given that when processing all metadata is the result of a merger (possibly with the default metadata used for extracted embedded metadata), it shouldn't be necessary to re-define the procedure for normalizing values to the object form.
>> 
>> I created the issue #183 for this, and will create a pull request accordingly.
>> 
> 
> I have the impression that we agree on what we want to achieve, and the difference is on how we want to formulate it in the spec. I have given some thoughts and, at least for me, the whole issue of handling natural language properties boils down to a conceptual model. Something like (with my notes):
> 
> 
> 1. Conceptually, the _property value_ for a natural language property is an array of language tagged literals. The language tag may either be an ISOXXX tag or the string "und" if the tag is not defined.
> 
>  Note: this is a conceptual representation. The metadata document describes the surface syntax which, in the most general case, is the JSON-LD like structure but may also boil down to an array of strings or indeed a single string, depending on the existence and the possible value of @language

Sure.

> 2. Two language tagged literals are equal if either one of the two is using the language tag "und" and the literals are equal, or if both the language tags and the literals are equal

Yes.

> 3. Merging two language tag literals mean concatenating their property values
> 
>  Note: it is not clear whether there is a unicity requirement for a property value, ie, whether it is allowed to have repeated language tags literals in the array. I believe the answer is there is, ie, such a unicity check must be done after a merge, too.

The merge text says the following:

[[[
... The arrays should provide the values from A followed by those from B that were not already a value in A.
]]]

So, I think we're covered on this. However, there is another issue, where if the embedded metadata has no language but the specified metadata does, then after merging, it would contain both the "und" and defined values, and I think we only want the language-defined version, so this might need some refinement. For example, specified metadata might be:

{
  "@context": {"@language": "en"},
  "columns": [{"title": "My Title"}]
}

and after extracting the embedded we'd get

{
  "columns": [{"title": "My Title"}]
}

As written now, when merged we'd get the following:

{
  "@context": {"@language": "en"},
  "columns": [{"title": {"en": "My Title"}, {"und": "My Title"}]
}

So, this requires some more text to only result in the english-language version.

> 4. If the "title" attribute is used for identifying two column description, the criteria is whether the two property values have a non-empty intersection

Yes. I believe that is equivalent to what the merge text on "columns" says:

[[[
...
 • otherwise, if there is a column description at the same index within A with a title that is also a title in A, considering the language of each title where und matches a value in any language, the column description from B is imported into the matching column description in A.
]]]

> 5. If the "title" attribute is used to give a value to "name", this means taking the first element in the array that corresponds either to any language tag (in case there is no @language specified) or the first value that is either tagged with "und" or the defined language tag.
> 
>  Note: this may lead to 'undefined', in which case the '_col.[N]' alternative comes in.

Yes, the language worked before, as it used the _property value_ of title, which would have done this, but after revision, it no longer does. I'll update the text accordingly.

> At least in my head this approach works well... Whether the property value is in line with the JSON-LD representation is, frankly, besides the point; actually, I may very well imagine that a JSON-LD implementation might decide to do something like that internally! But the _property value_, in the spec, is a conceptual representation only anyway.

The specifics of how this information is represented can be left up to an implementation, as long as the results are correct. Note that we don't have any specific tests for how metadata merge, or property values actually work, just what the resulting RDF or JSON serializations are.

Sounds like we're about done with this.

Gregg

> Ivan
> 
> 
> 
> 
> ----
> Ivan Herman, W3C
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> ORCID ID: http://orcid.org/0000-0003-0782-2704
> 
> 
> 
>
Received on Monday, 2 February 2015 17:43:30 UTC