Re: Question about implementing triple pattern fragments client from Gregory Williams on 2015-01-20 (public-hydra@w3.org from January 2015)

From: Gregory Williams <greg@evilfunhouse.com>
Date: Mon, 19 Jan 2015 23:49:37 -0800
To: Ruben Verborgh <ruben.verborgh@ugent.be>
Cc: public-linked-data-fragments@w3.org
Message-Id: <0ACB400A-FE0D-421B-AE23-2632827439EA@evilfunhouse.com>
> On Jan 16, 2015, at 12:43 AM, Ruben Verborgh <ruben.verborgh@ugent.be> wrote:
> 
>> 1. Provide the client with the endpoint URL `http://fragments.dbpedia.org/2014/en` (known a priori)
> 
> Note that, with triple pattern fragments,
> there isn't something like “the” endpoint URL.
> Each fragment can serve as a starting point.
> 
> For instance, these fragments could be starting points
> of the same dataset:
> - http://fragments.dbpedia.org/2014/en?subject=&predicate=http%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23type&object=http%3A%2F%2Fdbpedia.org%2Fontology%2FArtist
> - http://fragments.dbpedia.org/2014/en?subject=&predicate=&object=http%3A%2F%2Fdbpedia.org%2Fresource%2FBelgium
> - http://fragments.dbpedia.org/2014/en?subject=&predicate=http%3A%2F%2Fdbpedia.org%2Fontology%2FbirthPlace&object=
> 
> This is why my implementation calls it a "start fragment"
> rather than an endpoint.

Yes, I understand that. However, simply for usability reasons, won’t there often be a URI that you pass around as a preferred entry fragment (e.g. the fragment with the shortest URI)? Or do you honestly believe that any time somebody wants to access a new TPF server they’ll use an arbitrary/random fragment as the entry point?

I obviously come at this from a SPARQL perspective, but as I began to look at the TPF spec and the DBPedia fragments server, I had to do much more mucking about with curl and rapper than I had expected just to figure out what the URI template was. I had expected there to be an explanation on the front page pointing me at an entry point URI (instead of having it be silently linked from the “Welcome to DBPedia!” text near the top (but completely different from the link on the “DBpedia – Linked Data Fragments” page header text). The text "This page is the entry point of the Triple Pattern Fragments interface” further confused me, as “this page” seemed to be an entry point into a small metadata dataset, not an entry point into the DBPedia dataset.


> 
>> my concern is that the URL dereferenced in step 2 may end up being the same as the URL for the unbounded triple pattern `{ ?s ?p ?o }`.
> 
> This could be, and seems logical to humans,
> but doesn't have to be, as illustrated above.
> 
>> This seems to be the case for the DBPedia endpoint, and while the DBPedia endpoint pages its data, the TPF spec is [pretty clear](http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/#paging) about paging being optional. So is it the case that all TPF clients need to be concerned about the possibility of requesting the entire dataset when all they are after is the hypermedia controls?
> 
> That is indeed a valid concern. However,
> - The start fragment can be chosen arbitrarily
>  (But yes, its URL needs to be obtained from somewhere.
>   However, the server could generate URLs independently of the controls.)
> - While such a fragment might be large, it can be parsed in a streaming way,
>  and the client can stop retrieving it as soon as the controls have arrived.

Parsing the result in a streaming fashion doesn’t help much if the response payload contains an entire, huge dataset, and the hypermedia controls are only appended to the end of the dataset triples. FWIW, appending the hypermedia controls to the data seems to be exactly what the DBPedia server does.

> - If the fragment is large, the server itself strongly benefits from pagination.
>  It is thus not unreasonable to assume the server will page content.

It might be a benefit for a server to choose to page, but that isn’t the concern I asked about. As someone thinking about implementing a *client*, I can’t rely on it being sensible for a server to do the right thing. If the server doesn’t do paging, either by choice or through mis-configuration, I’m left in the position of having a client that might be accidentally requesting massive resources unnecessarily. Furthermore, if that does happen, and downloading all of the data doesn’t hit resource limits on the client or server, the TPF querying algorithms you’ve proposed never seems to consider that it might be best at that point to bail out on the TPF query algorithm and just run the entire query locally. This is an issue I think needs to be discussed if the spec is going to leave the choice of paging entirely to the implementation and/or the server configuration.


>> 1. Suggest ("MUST" or "SHOULD") that the public endpoint URL that is used as the entry point to a TPF server when no other URL is known be different from the URL of the unbound triple pattern fragment.
> 
> As there is no dedicated endpoint URL, we cannot suggest it exactly like that.
> However, it might be possible to suggest that the URL of some small fragment is communicated.
> (It could even be empty.)

OK. Are you opposed to the idea of a separate entry point URL that contains the hypermedia controls but isn’t a fragment? Or are you suggesting that this “small [empty] fragment” could be explicitly an entry point URL that necessarily didn’t contain fragment data? I think having an explicitly designated entry point would be a good thing for usability generally, and especially to people already familiar with things like SPARQL.

> 
>> 2. Either require paging ("MUST" or at least "SHOULD") or require that there be a separate URL to retrieve estimated counts for a fragment.
> 
> I'm less in favor of this, as it complicates the API,
> but foremost, because the resulting media type would be less useful.
> If we compare this to the human Web, we wouldn't have a page
> that has only a number on it.

It would complicate the API, but without that or some other complication (like the one you discuss below), I don’t think TPFs can be effectively used for query answering as much of your work suggests. It can obviously be used when certain assumptions hold, but those assumptions (like requiring paging) don’t seem to be explicit.

> On a side note, we are currently benchmarking the influence
> of page size on clients (and caches) of triple pattern fragments.
> One particular case we consider, is that the first page is empty,
> and that the second page contains all triples.
> This would lead to the effect you describe.

Yes, that would address this issue. If nothing else, I think algorithms and pseudocode that discuss using the triple counts for query execution need to address the potential case where data is not paged. If updates to TPF are made that suggest something like an empty first page, 


> 
>>> The use of the requested URL in the representation is especially important in
>>> the common case in which fragments or pages are accessible through (subtly)
>>> different URLs, such as `http://example.org/example?s=a&page=3` and
>>> `http://example.org/example?page=3&s=a`.
>> 
>> If the server provides a URI template for this resource, presumably something like `http://example.org/example{?s,p,o}`, why wouldn't a request with the URI `http://example.org/example?page=3&s=a` be invalid?
> 
> Sure, it would be an invalid expansion of the template!
> But the server is still allowed to serve a fragment at this URL
> (yet this behavior is entirely server-dependent).
> Maybe “a” is part of another set of hypermedia controls
> that extends triple pattern fragments; maybe it's something else.
> 
> The point is that, as a client, you shouldn't make any assumptions.
> Indeed, you are only allowed to expand the template with s/p/o,
> but the server has more freedom.
> 
> To address this issue, I would add the above explanation
> to the specification document, especially the “invalid” remark.

I’m afraid I wasn’t very clear about my concern here. The NOTE in section 3.6 talks about what the server should do when it receives various requests from a client. It gives two example URLs that seem like they should reference the same resource, the first with "?s=a&page=3” and the second with "?page=3&s=a”. The text doesn’t indicate where those URLs came from. We’re left to assume that they originated with the server, but intuitively the server should have only generated one of those (either that or the server implementation is order-agnostic with respect to the query parameters). I’m concerned here with a requirement placed on what a server has to implement if the requested URL *didn’t* originate with the server, but was simply constructed by the client.

thanks,
.greg
Received on Tuesday, 20 January 2015 07:50:02 UTC