Sandro's review of Graph Store HTTP Protocol from Sandro Hawke on 2011-11-29 (public-rdf-dawg@w3.org from October to December 2011)

From: Sandro Hawke <sandro@w3.org>
Date: Mon, 28 Nov 2011 23:54:17 -0500
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <1322542457.14130.304.camel@waldron>
SUMMARY: lots and lots of little suggestions, plus some major confusion
about URIs, a few issues with terminology, a problem with using RFC 2616
for response codes, and one problem with how SD is used. 

All in all it's a very thorough document, and generally says what needs
to be said.  My suggestions are generally about ways to make it clearer
to people (like me) who are not already expert in what it's saying.  As
such, most of my suggestions probably wont matter to the hardcore
implementors who need it most -- they'll generally sort through it and
figure out what it means.   I'm reading it more as someone who might be
trying to figure out what this is all about, and thinking about how to
make it clearer for them.

I reviewed:
        http://www.w3.org/2009/sparql/docs/http-rdf-update/Overview.html
        Revision 1.79  2011/11/16 01:46:26  cogbuji
        
title

        I know we went thought a long WG decision process (twice) to
        arrive at the current title, but in actually talking about this
        document to a few people, I find the only way to have it make
        any sense is to use the word "RESTful".    So, I propose we
        amend the title to:
        
        SPARQL 1.1 Graph Store (RESTful) HTTP Protocol

        I'm not in love with that; I just think RESTful is by far the
        most important word in that title.   I could even do without
        SPARQL -- this has almost nothing to do with SPARQL.   If I were
        starting from scratch, I'd probably go with "RDF RESTful API".
        
        (In the REST world, people call these things "RESTful APIs", not
        "protocols", in my experience.)
        
abstract

        
        I think the title should explain the relationship between this
        document and SPARQL.    Maybe add a second sentence:
        
                This interface is essentially an alternative to the
                SPARQL 1.1 Query and Update protocols; (nearly)
                everything that can be done through this interface can
                be done using that interface, but for some clients
                and/or for some servers, this interface may be easier to
                implement or work with.
                
        The "nearly" is because I don't think UPDATE gives a way to
        generate a new graph URI as POST-to-the-dataset-URI SHOULD.
        Trivial.
        
1 introduction

        As with the abstract, I think a little more needs to be said at
        the start about how this relates to the rest of SPARQL,
        including perhaps explaining why this is even considered part of
        SPARQL.
        
        maybe s/self-descriptive/self-describing/
        
        I found the paragraph beginning, "It emphasizes..." pretty hard
        to make sense of.  If it's important to have this kind of
        argument about how this is RESTful it would probably be clearer
        to put it into the numbered list it follows.   In item 1, we can
        talk about constraint 1 and how it's met, etc.
        
        s/an SPARQL Update equivalent/a SPARQL Update equivalent/
        
        After the link to the XML Results format, we should have a link
        to the JSON results format,
        http://www.w3.org/TR/sparql11-results-json/
        
2 terminology

        "Resource - A network-accessible data object or service
        identified by an IRI, as defined in [RFC2616]."    But this
        isn't the definition RDF uses.   RDF-MT says "no assumptions are
        made here about the nature of resources; 'resource' is treated
        here as synonymous with 'entity', i.e. as a generic term for
        anything in the universe of discourse."    Perhaps the best we
        can do is acknowledge this difference and then say it doesn't
        matter for this spec, or that we're using the RFC2616 def'n in
        this spec, unless we say "RDF Resource".
        
        "RDF document - A serialization of an RDF Graph into a concrete
        syntax."  Maybe add "typically an RDF/XML or Turtle document."
        
        "Graph IRI" - I wish the definition used the word "dataset";
        without it, it's not stated what the relationship is between the
        IRI and the graph in the underlying stuff.   We're left to
        assume it the iri-graph pairing in the dataset.
        
        "RDF Graph content".   I can't figure out how this is different
        from "RDF Graph", or "Named Graph" (as the document uses the
        term elsewhere, meaning the second element in a graph-naming
        pair).    So I'm confused by both the term (why not use "RDF
        Graph"?), and the definition.   Sorry.  :-(
        
        "Implementations of this protocol are HTTP/1.1 servers [RFC2616]
        MUST interpret request messages..."   I think there's word
        missing here.  I can't parse the sentence.     
        
        Also, "Implementations of this protocol" doesn't seem quite
        right; clients also implement this protocol, too.  I think we
        mean "Servers implementing this protocol", or "conforming
        servers", or "SPARQL 1.1 Graph Store HTTP Protocol Servers".
        Maybe we can introduce a term for these servers?   "RESTful
        Graph Stores" comes to mind.
            
        (Which makes me think we're missing a conformance clause, as per
        http://www.w3.org/TR/qaframe-spec/#specifying-conformance
        ...  I'm not sure it matters.)

3 protocol model

        s/DOS/Denial-of-Service/  (best to avoid acronyms)
        
4.1  graph identification
        
        Before we get into that, let's talk about URIs.   I felt like I
        was dumped into the middle of a conversation, missing all the
        context.   By the end of the document, I think I had it mapped
        out.   Did I get it right?   
        
        - There is a Service URI.  This is used for:
        
                - constructing indirect reference URIs, which is
                necessary if the server doesn't serve all the Graph
                IRIs, or if we want to access the default graph
                - obtaining the Service Description, which we need in
                order to find out the Dataset URI (see below).
                
                
                Does this have anything to do with a SPARQL service
                endpoint address?   The fact that I can get an SD by
                doing a GET on it was my only clue that it probably is,
                in fact, the same thing.   Can we be quite explicit
                about this, even if it's just to say RESTful Graph
                Stores and SPARQL service endpoints MAY use the same
                address?  I know it gets complicated with ER, since one
                dataset may have multiples EPs.
        
        - There is a Dataset URI.  This is used for:
        
                - to ask the service to invent a new Graph IRI
                
                When we have a multigraph syntax (eg TriG) standardized,
                it seems clear to me that a GET of the Dataset URI would
                return a complete dump of the dataset, and a PUT would
                replace the dataset.   Can we say something
                forward-looking like this?  I think so.   Without this,
                the Dataset URI seems pretty out-of-place here, used
                only for this invent-a-new-Graph-IRI function.   Maybe,
                in any case that function could be done, instead, via
                POST to the Service URI?  (That would be distinguished
                from a QUERY or UPDATE operation by the mime type of the
                POST.)
        
        Then we get into the Direct and Indirect identification URIs.
        I suggest we start with some explanation of the two above URIs,
        then, before we get into 4.1, we give a little overview of these
        two, like:
        
                For a client to use this protocol to access individual
                graphs in the graph store, it needs a URL for each
                graph.   Inside the store, each graph (except the
                default graph) is labeled with a Graph IRI.  In some
                cases ("Direct Graph Identification"), those Graph IRIs
                can be used (possibly after IRI-to-URI conversion) as
                the URLs for HTTP access.   In other cases ("Indirect
                Graph Identification"), the Service URIs is used to
                construct URLs for each graph. 
                
        Which reminds me -- what happens if someone uses those Indirect
        graph URLs as Graph IRIs in the same store?   :-(    Maybe
        there's nothing helpful we can say about that.   
        
        I'm not sure if you realize it, but it's quite possible these
        indirect graph URLs will see a great deal of life outside of
        this protocol -- for provenance and other metadata.  They
        provide a way to refer to a Graph Container inside a SPARQL
        server.   (This was discussed at the last RDF F2F.)   To help
        support this usage, it's probably worthwhile to strongly push
        for Service URIs and SPARQL endpoint addresses to be the same.

4.1 direct graph identification
        
        I think the first sentence needs to be qualified with a
        "sometimes".  I'd start this section with a list of the
        situations in which one can use Direct.
        
        "Intuitively, the set of interpretations that satisfy [RDF-MT]
        the RDF graph that the RDF document is a serialization of can be
        thought of as this RDF graph content."    uuuuuuummmm what?
        Give me an example?    Or something?  I can't make sense of
        this.   The "Graph Content" is a set of interpretations?
        
        The layout of the diagram seems off.   There's a computer
        labeling the arrow.  I would expect one computer at each end of
        the arrow.  Plus it's got the MT stuff in it, which I don't see
        the reason for.   If you want I can try to draw the diagram as
        I'm picturing it...
        
        Oh, maybe this MT stuff is because of ER...!   If so, can we
        should call that out explicitly, and try to hide the complexity
        from people who don't care about it?
        
        "Any server that implements this protocol and receives a request
        URI in this form SHOULD invoke the indicated operation..."
        Instead of "invoke" can we say "perform"?   I first thought
        "invoke" meant it should pass it on to the server for that graph
        (there might be one).
        
        "The embedded URI MUST be an absolute URI and the server MUST
        respond with a 400 Bad Request if it is not."  I think that's
        too strict.  I think there are some bits of the URI grammar that
        folks sometime violate in their SPARQL graph IRIs.  I know when
        I've written RDF parsers that checked the syntax of the IRIs, I
        had to turn off that checking when I hit other people's data.
        Maybe things are better now.
        
        (We should have some test cases about IRI/URI conversion for
        this embedding.)
        
        "As will be discussed later in this document, both HTTP OPTIONS
        and GET requests can be sent to the service and the response to
        such a request is a service description document."   But later
        it's only a SHOULD.  Do we mean that the Service MAY provide RDF
        content, but if it does, that content MUST be an SD?
        
5 graph management

        I'm a little hesitant about privileging RDF/XML like this.  The
        sense I get from the RDF WG is that in a year, 90% of the pure
        RDF content on the Web (ie excluding RDFa and microdata), will
        be Turtle, not RDF/XML.    But, I don't really have a better
        idea.
        
5.1 status codes

        "then the server should respond with a 400 Bad Request."  Is
        that supposed to be all-caps SHOULD?
        
        "should receive a response with a 405 Method Not Allowed".
        Again, is that meant to be all-caps?
        
        Most of the status codes are SHOULD, but two are MUST: 201
        Created, and 404 Not Found -- but only on a DELETE.   I'm
        guessing these are editing errors, and should be SHOULD.  If
        not, an explanation seems warranted.    Personally, I'd lean
        toward all these response codes being MUSTS.   I wonder about
        pulling them out of the text into a separate decision table.   I
        guess there are a lot of meaningful response codes never
        mentioned in this text, though....   I notice that RFC2616
        mostly uses RFC2119 language in talking about what the client is
        to do about these codes; it doesn't say the server
        MUST/SHOULD/MAY send a 404, for instance.     We could do the
        same and just talk about the meaning of response codes. 
        
5.2 http put

        "A request that uses the HTTP PUT method SHOULD store the
        enclosed RDF payload as RDF graph content."    How about: "A
        request that uses the HTTP PUT method indicates the enclosed RDF
        payload is to be stored as RDF graph content."
        
        The example here, and in 5.4 and 5.5 (but not 5.3) are a little
        confusing in formatting.    The required blank line after the
        HTTP headers is missing, but instead, after a blank line, we
        have the SPARQL text for comparison.   It's clear in 5.3 because
        there is separating text.   Borders around the examples would
        solve the problem as well.
        
        "Either the request or the encoded URI (embedded in the query
        component) identifies the RDF payload enclosed with the request
        as RDF graph content."   I don't think it does.   Why would one
        be trying to identify the payload...?     Are we trying to say
        something like, "To complete this operation, the Service MUST
        store the given payload as the new content of the RDF graph
        container labeled with the given Graph IRI."  ?  That's more in
        the style of the current The server MUST NOT attempt to apply
        the request to some other resource.RDF WG discussion, and less
        in the style of the rest of this document; I'm not sure how to
        write it in the existing style. 
        
        "The server MUST NOT attempt to apply the request to some other
        resource."  Do we really need to say this?   It kind of opens
        the door, via "the exception proves the rule", to all sort of
        other crazy behavior we didn't explicitly rule out.    Maybe we
        can give some example as to why this rule isn't obvious?

        "Developers should refer to [SPARQL-UPDATE] for the specifics of
        how to handle empty graphs.  In particular, if the request body
        is empty and there is sufficient authorization to create a new
        named graph with an IRI of that indicated by the request URI,
        then an empty graph would need to be created."   That's not how
        I read UPDATE.   I read UPDATE to be saying some
        "implementations" keep track of empty graphs and some dont.  For
        those who don't, a PUT with a new graph IRI and an empty request
        body has no effect.
        
5.3 http delete

        I don't really like the human intervention bit, the fact that
        success can be reported even if it's not done yet, and the
        "inaccessible location" notion, but I see they are just copied
        out of RFC 2616, ... so I'm not sure what to say.
        
5.4 http post

        "Within a service description document for an implementation of
        this protocol, the URI of an instance of the sd:Dataset class is
        understood to be the identifier of the Graph Store."     I'm not
        sure I'd call it the "identifier of the graph store"; see
        earlier text about Dataset IRI.    IMPORTANT POINT: it shouldn't
        be the type sd:Dataset that matters; it should be the
        sd:defaultDatasetDescription arc.   I expect the type is
        optional, but more importantly, we want to be able to merge SDs.
        Where one gets the SD from matters for trust, but its meaning
        isn't supposed to depend on which endpoint one gets it from.
        
other
        
        Shouldn't we say somewhere that all this applies to HTTPS as
        well?  It's obvious, of course.
        
I guess that's about it...      Now, to try to get some REST.  

     -- Sandro


cf. ACTION-563
Received on Tuesday, 29 November 2011 04:54:28 UTC