Re: QUERY Verb Proposal from henry.story@bblfish.net on 2015-01-20 (public-ldp-wg@w3.org from January 2015)

From: <henry.story@bblfish.net>
Date: Tue, 20 Jan 2015 16:16:31 +0100
To: Sandro Hawke <sandro@w3.org>
Cc: Yves Lafon <ylafon@w3.org>, ashok malhotra <ashok.malhotra@oracle.com>, public-ldp-wg@w3.org
Message-Id: <333E9063-A610-4998-90D8-F161827B0FA3@bblfish.net>
> On 20 Jan 2015, at 15:43, Sandro Hawke <sandro@w3.org> wrote:
> 
> On 01/20/2015 09:00 AM, henry.story@bblfish.net wrote:
>>> On 20 Jan 2015, at 14:22, Yves Lafon <ylafon@w3.org> wrote:
>>> 
>>> On Tue, 20 Jan 2015, henry.story@bblfish.net wrote:
>>> 
>>>>> One of the reasons the HTTP WG is very unlikely to standardize this is that there's so little technical advantage to doing this with a new verb (at least as far as I can see).  The main reasons would be queries > 2k, but your saved queries solve that, and allowing intermediate nodes to understand and cache based on query semantics, ... and MAYBE the Get option would allow that.
>>>> Some of the disadvantages of your approach I can think of at present:
>>>> 
>>>> ? Queries are limited to < 2k
>>> Source?
>>> 
>>> http://tools.ietf.org/html/rfc7230#section-3.1.1
>>> <<
>>>  HTTP does not place a predefined limit on the length of a
>>>  request-line, as described in Section 2.5.  A server that receives a
>>>  method longer than any that it implements SHOULD respond with a 501
>>>  (Not Implemented) status code.  A server that receives a request-target
>>>  longer than any URI it wishes to parse MUST respond with a 414 (URI Too
>>>  Long) status code (see Section 6.5.12 of [RFC7231]).
>>> 
>>>  Various ad hoc limitations on request-line length are found in
>>>  practice.  It is RECOMMENDED that all HTTP senders and recipients
>>>  support, at a minimum, request-line lengths of 8000 octets.
>> Things may have changed. When I worked at AltaVista in 2001 there was a
>> limit of 2k for URL due  old proxies, etc... Still there is a limit as
>> indicated there, and there has to be one, or else denial of service attacks
>> through creation of infinitely long URIs would be all to easy. ( I broke
>> a server once by just sending it infinitely long headers, via a shell
>> script :-)
>> 
>> Consider also that you may be using a large number of ontologies which
>> you need to use the prefix for which you have then to encode and which you have
>> to paste onto your intial query URL. It should be clear that  to put URLs inside
>> of URLs you are going to end up breaking the limit above.
> 
> As we said earlier, queries that are too long to put in the URL can be sent separately to the server as a saved query.   It's not perfect, but it should work fine.

saved queries and PATCH seems to work nicely. Saved Queries and GET on a different
URL just sounds like a hack.
> 
> There's an argument to be made that every query should be treated as a saved query anyway.   First-class resources, and all that.     (My current code doesn't have saved queries, but makes running queries be first-class resources, so you can inspect them, abort them, modify them, etc.)
> 
> Maybe the GET-based workaround should REQUIRE using saved queries, so there's little worry about length.   (The query fill-in parameters could still get long, in theory, but I doubt that'll be much of a problem.)
> 
>>>> ? URLs are no longer opaque. You can see this by considering the following:
>>>> - if a cache wants to use the query URL to build up a partial representation of
>>>> the original document, it would need to parse the query URL. So we end up with mime
>>>> type information in the URL.
>>> URL templates anyone? <http://tools.ietf.org/html/rfc6570>
> 
> Except that means we're not supposed to pass them via Link headers.
> 
>   URI Templates are not URIs: they do not identify an abstract or
>   physical resource, they are not parsed as URIs, and they should not
>   be used in places where a URI would be expected unless the template
>   expressions will be expanded by a template processor prior to use.
> 
>  -- https://tools.ietf.org/html/rfc6570
> 
> Unless maybe we can ignore that, and do it anyway.
> 
> Maybe there's a hack using Target Attributes.
> 
> Or we just use a new HTTP Header.
> 
> Or we just recognize that using GET instead of QUERY is a hack anyway.
> 
> Actually...  it's not clear how one could use URI Templates anyway, for the stored-query case, since there are two levels of parameters.   The first level has url-of-queried-resource and url-of-stored-query-to-use, and the second level has all the parameters to the stored query.    Maybe there's a way to do this with RFC 6570, but I think it would still involve inventing something new, and if you're going to invent something new, I'm not sure there's much value in using RFC 6570.
> 
>> To mandate that breaks web architecture and is bad for security.
>> What if you want to develop URLs that are as opaque as possible
>> to avoid people reading links being able to determine what it is
>> referring to?
>> This type of things works for form based queries because the
>> form is generated by the server that then is going to parse the
>> query. If you want an open web where any service can make a query
>> then you need something more generic than that. If you make it as
>> generic as the QUERY verb proposed here, then you end up putting
>> a language into the URL with a mime type as indeed was proposed by
>> Sandro.
>> 
>> URL encoding SPARQL queries is just ugly for any number of reasons.
>> 
>>>> - If the cache sees the query URL but does not know that the original resource
>>>> is pointing to it, then it cannot build up the cache ( and it cannot know this
>>>> without itself doing a GET on the original URL, because otherwise how would it deal
>>>> with lying resources that claim to be partial representations of other URLs? )
>>>> ? URL explosion: one ends up with a lot more URLs - and hence resource - than needed,
>>>> with most resources being just partial representation of resources, instead of
>>>> building up slowly complete representation of resources.
>>> Querying something on the web using URIs is hardly new.
>> Here the aim is not to query the Web, as with AltaVista, but to query a
>> resource directly, to get relevant subsets of the resource. It is an interesting
>> question whether on querying a LDPC you can also query its contents.
>> 
>>>> ? caching
>>>> - etags don't work the same way on two resources with two URLs as with one
>>>>  and the same URL
>>>> - the same is true with time-to-live etc.
>>>> - A PUT, PATCH, DELETE on the main resource won't tell the cache that it should
>>>>  update all the thousand of other resources that are just views on the
>>>>  original one
>>> Why? This is an implementation detail server-side.
>> Because the cache may not have seen the PUT, PATCH or DELETE. You may have
>> done that using another proxy. eg: one at home and the other at work.
> 
> In summary, I think the big argument for a QUERY verb is that it makes it much more practical to implement caches which understand they are caching RDF, and can short-cut some queries because they have the relevant triples cached.
> 
> Until there's at least one major web infrastructure player who actually want to do that, it's hard to make a case for standardizing a QUERY verb.

First QUERY is not limited to RDF, it could also work with other query frameworks such
as XQUERY, or some JSON equivalent.
Also you don't need a major infrastructure player to use it, though having one would 
be nice. As I explained an LDP client needs to 
  • have a local cache ( e.g.. in the browser, or for servers on the server )
  • because of CORS limitiations in browsers it is easier for JS code to fetch remote 
   resources via the "personal LDP server" that itself fetches the remote resources.
   Such servers can have a lot more memory available to them compared to web browser 
   clients and can also be programmed much more flexibly, enabling the creation of
   new interesting protocols. These servers can end up acting as caches on which QUERY
   requests would be very useful.
  • One can also imagine OS level caches - eg for Semantic Desktop projects
 
So there are many areas where caching and proxying can be useful. This does not only
need to be at the major infrastructure layer.

What I think we can learn from a QUERY proposal, by discussion with the IETF, is
 1. how we should define such a verb for it to be correct at the HTTP layer
  ( ignoring issues of infrastructural deployment )
 2. how it could tie into LDP elegantly
 

> 
>       -- Sandro
> 
>>>> - The cache cannot itself respond to queries
>>>>   A cache that would be SPARQL aware, should be able to respond
>>>>   to a SPARQL query if it has received the whole representation of the
>>>>   resource already - or indeed even a relevant partial representation )
>>>>   This means that a client can send a QUERY to the resoure via the cache
>>>>   and the cache should be able to respond as well as the remote resource
>>>> ? Access Control
>>>>  Now you have a huge number of URLs referring to resources with exactly the same
>>>>  access control rules as the non query resource, with all that can go wrong, when
>>>>  those resources are not clearly linked to the original
>>>> ? The notion of a partial representation of an original resource is much more opaque
>>>> if not lost without the QUERY verb. The system is no longer thinking: "x is a partial
>>>> representation of something bigger, that it would be interesting to have a more complete
>>>> representation of"
>>>> 
>>>> Btw. Do we have a trace of the arguments made in favor of PATCH. Then it would be a case
>>>> of seeing if we can inverse some of those arguments to see if we are missing any here.
>>>> 
>>>>> BTW, all my query work these days is on standing queries, not one time queries.  As such, I think you don't actually want the query results to come back like this.   You want to POST to create a Query, and in that query you specify the result stream that the query results should come back on.  And then you GET that stream, which could include results from many different queries.   That's my research hypothesis, at least.
>>>>> 
>>>>>     -- Sandro
>>>>> 
>>>>>>> Assume the HTTP WG will say no for the first several years, after which maybe you can start to transition from GET to QUERY.
>>>>>>> 
>>>>>>> Alternatively, resources can signal exactly which versions of the QUERY spec they implement, and the QUERY operation can include a parameter saying which version of the query spec is to be used. But this wont give you caching like GET.   So better to just use that signaling for constructing a GET URL.
>>>>>> Gimme a little more to help me understand how this would work.
>>>>>>>     -- Sandro
>>>>>>> 
>>>> Social Web Architect
>>>> http://bblfish.net/
>>>> 
>>>> 
>>>> 
>>> -- 
>>> Baroula que barouleras, au tiéu toujou t'entourneras.
>>> 
>>>       ~~Yves
>>> 
>> Social Web Architect
>> http://bblfish.net/
>> 
>> 
> 

Social Web Architect
http://bblfish.net/
Received on Tuesday, 20 January 2015 15:17:31 UTC