RE: Questions on the url property in Table annotation an on dialect being a core property

Gregg,

On Thursday, December 10, 2015 7:17 PM, Gregg Kellogg wrote:
> To: Svensson, Lars
> Cc: W3C CSV on the Web Working Group; Jeni Tennison; Ivan Herman
> Subject: Re: Questions on the url property in Table annotation an on dialect
> being a core property
> 
> > On Dec 10, 2015, at 2:01 AM, Svensson, Lars <L.Svensson@dnb.de> wrote:
> >
> > Gregg,
> >
> > On Wednesday, December 09, 2015 6:52 PM, Gregg Kellogg wrote:
> >
> >>> On Dec 9, 2015, at 12:52 AM, Ivan Herman <ivan@w3.org> wrote:
> >>>
> >>> (Cc-ing to Gregg & Jeni, as an additional ping to get their attention…)
> >>>
> >>> Hey Lars,
> >>>
> >>>
> >>>> On 4 Dec 2015, at 12:26, Svensson, Lars <L.Svensson@dnb.de> wrote:
> >>>>

<trim/>

> >> However, you might create a metadata document the respond to from the
> link
> >> header compatible with the CSV that is downloaded. It won’t validate as
> being
> >> compatible, but it should be useable for generating RDF or JSON from the
> >> result, as long as the column descriptions match those in the CSV file.
> >
> > Hmm. Why shouldn't it validate? If I read §6 Processing Tables [1] correctly, I
> can start by downloading a data file and then rely on the application finding the
> proper metadata. And as long as the metadata matches the table, it should
> validate. Or have I misunderstood something?
> 
> The data model section 6.2 says that processors MUST ensure that metadata
> and tabular data file are compatible, as defined in 5.4.3 of the Metadata
> document. The first statement from her is that they have equivalent normalized
> url properties. So, if the url properties are not the same, they are not
> considered compatible, and are so not valid. However, this does not necessarily
> mean that processing stops.

Over the week-end I read and re-read section 6.2 and I must admit that I'm now more confused than before...

Prerequisite: A customer has pushed off a process to create a tabular file with object descriptions. The file contains a header line and then one line per object. The order of the columns is known (the customer cannot change that). The tabular data file is available at a URL that is available only to the customer and SHOULD NOT be known to other customers. When customers (or their applications) download the tabular data file, we want to supply additional metadata in a Link header (and possibly through site-wide location configuration).

Start processing (§6.1):
1) Retrieve the tabular file, retrieve the first metadata file (referenced through the Link header). This is what §6.1 calls FM. Continue as if FM was supplied metadata.
2) Treat FM as if it were user-supplied metadata and rename it to UM
3) Normalise UM. This includes resolving all URLs against the base URL (if supplied). This means that if the metadata contains no URL (or if the url is null), the value of the url is set to either the url of the metadata document or to the base url.
4) "For each table (TM) in UM in order, create one or more annotated tables" (Step 3 in the process).
In order to do this, I need to find the tables referenced in UM and process each table. In the UM, the URL identifying my tabular data file will be equivalent to the URL of the metadata file or the base URL (see 3, above). 
Even if it's not explicit in the text, it sounds as if it is expected that the processor, when processing the metadata, dereferences the URLs pointing to the tabular data files.

In short:
1) download the tabular data file
2) download the metadata for the tabular data file
3) from the URLs in the metadata, download the tabular data file _again_.
4) proceed with processing.

My processing model would rather be:
1) download the tabular data file
2) download the metadata for the tabular data file
3) start processing the downloaded tabular data file (using the normalised URLs in the metadata as identifiers only), executing steps 3.1-3.4.
4) when processing 3.5, verify that the data is compatible. Here a validator will raise an error since it cannot promise that the metadata actually fits to the tabular data file (and that could be fine).

Looking at it again, I propose the following change to the processing model:

Current text:
[[
If processing starts with a tabular data file, implementations:
[...]
3)    Proceed as if the process starts with FM.
]]

Proposed text:
[[
If processing starts with a tabular data file, implementations:
[...]
3)    Normalize UM using the process defined in Normalization in [tabular-metadata], coercing UM into a table group description, if necessary.
4) For each table (TM) in the tabular data file, create one or more annotated tables as if the process starts with a metadata file, steps 3.1-3.6. 
]]

> >>>> This boils down to the following question(s):
> >>>>
> >>>> 1) Is my understanding of the use of the url property in the table
> metadata
> >> correct?
> >>>> 2) If so, can I solve it by simply setting it to null?
> >>>
> >>> That is my reading and, I think, that was our intention. If set to null, that
> >> means that the implementation makes the 'pairing' between the metadata
> and
> >> the data itself which, as far as I can see, is exactly what you do.

If this was your intention you should add some text that makes this clear to implementors.

> >> As I said, I don’t think so. If it’s set to null, it is interpreted as an empty
> string,
> >> which is a relative URL. However, this should just issue a warning. Note,
> >> however, that if the CSV and the metadata are both available at the same
> URL
> >> subject to content-negotiation, this would be valid.
> >
> > I'm afraid you're losing me here: In the specification of the tabular data
> model, particularly §5 (locating metadata) [2], content negotiation is not listed
> as a method to find metadata for a csv file.
> 
> You’re right that we don’t say anything explicit about this. However, the Link is
> typed application/csvm+json, which would be reasonable for a client to use
> when retrieving the resource referenced by the link, but I can find no spec
> which recommends this.

Then it should be explicitly mentioned somewhere.

> >> But, if you’re downloading
> >> the CSV and it has no location, there would be no way for the metadata to
> >> locate it anyway.
> >
> > One option for the user could be to copy the download URL and paste it into
> an application that downloads the file, locates the metadata through the
> methods listed in §5 and processes it.
> 
> Yes, if you use the URL of the original downloaded file as the URL when
> comparing with the metadata. You might also imagine an unspecified provision
> that if the CSV file has no URL, then comparing it with the “url” property of the
> metadata makes no sense, but the spec does not say anything about this.

Would it be possible to add such text?

> >> One thing which would be good within the spec, if not within an existing
> >> implementation, is to set the Location or Content-Location header to be the
> >> same as the metadata. A client which is aware of this would see that the
> >> location of the CSV was the same as the metadata referenced using the Link
> >> header and consider it compatible.
> >
> > From my understanding this would break the http contract for Location and
> Content-Location. In RFC 7231, §3.1.4.2 Content-Location [3] specifies:
> > [[
> > The "Content-Location" header field references a URI that can be used
> > as an identifier for a specific resource corresponding to the
> > representation in this message's payload.
> > ]]
> > which to me means that it references the location of the resource I just
> downloaded, not of its metadata.
> 
> The proper header would be Conent-Location to describe the URL of the
> resource downloaded; the Link header references the metadata using it’s own
> URL. Based on questionable reasoning above, they might share the same URL,
> but if one were retrieved using Accept: text/csv, and the other using Accept:
> application/csvm+json (or application/ld+json), that they could result in
> different representations. I’m certainly reaching here, but I don’t think it’s
> inconsistent with the spec.
> 
> > For the Location header, RFC 7231, §7.1.2 [4] only mentions its use in the
> context of 201 (Created) and 3XX (Redirection) response codes.
> >
> > I'm not saying it won't work, but it would at least help me if you could
> elaborate a bit on how this would work and also point me to the appropriate
> text of the tabular data specifications.
> 
> As I said, if you download the URL and get a CSV with a link header with type
> application/csvm+json at the same URL, you might reasonable use that content-
> type when downloading the metadata and so get a different representation.
> This is not specified, but is also not inconsistent with the spec. If we had
> considered this option, we might have added a suggestion that the metadata be
> retrieved using application/csvm+json and application/ld+json.

I think that this stretches content negotiation a bit. When I negotiate, I negotiate for representations of _the same_ resource (for any definition of "same"). Saying that a tabular data file and its metadata are "the same" resource sounds a bit far-fetched.

> I might also point you to a draft Note on the use of HTML for containing both
> metadata and tabular data: http://w3c.github.io/csvw/html-note/. If data were
> encoded in HTML tables, then this would provide you a mechanism for
> including both metadata and tabular data in the same resource and be another
> way of avoiding your issue. When published as a WG Note, it does not have the
> force of recommendation, but is expected to be quite popular. I’d certainly like
> to know if this might satisfy some of your issues.

Yes, I see where this could help. In my particular case, however, the customers explicitly signed up for CSV...

Thanks,

Lars

Received on Monday, 14 December 2015 11:20:04 UTC