Re: (Lost in the noise perhaps - so asking again) - Is a trailing slash 'better' than a trailing hash for vocabs namespace IRIs? from Hugh Glaser on 2022-10-11 (semantic-web@w3.org from October 2022)

From: Hugh Glaser <hugh@glasers.org>
Date: Tue, 11 Oct 2022 13:22:15 +0100
To: Pat McBennett <patm@inrupt.com>
Cc: Pierre-Antoine Champin <pierre-antoine@w3.org>, semantic-web@w3.org
Message-Id: <450EDA77-3F08-433D-9CBF-9C448ECB1C43@glasers.org>
Hi Pat,

(I’ve tried sorting out the quotation levels a bit)

I like your proposal.
However, I think that arguing that slash is no less efficient than hash in terms of network is just wrong.
But it is a price that may well be worth paying in general.
After all, I still think that systems don’t resolve vocab much once they go live.

> On 11 Oct 2022, at 12:54, Pat McBennett <patm@inrupt.com> wrote:
> 
> Hiya Hugh,
> 
> Thanks so much for engaging!
> 
> On Tue, Oct 11, 2022 at 10:23 AM Hugh Glaser <hugh@glasers.org> wrote:
>> Hi,
>> 
>> > On 11 Oct 2022, at 09:49, Pat McBennett <patm@inrupt.com> wrote:
>> > 
>> > Hiya Pierre-Antoine,
>> > 
>> > I'm going to try and reply in-line this time - hopefully GMail won't garble the formatting this time (and I've prefixed my responses with [PMcB] too):
>> > 
>> > On Mon, Oct 10, 2022 at 11:21 AM Pierre-Antoine Champin <pierre-antoine@w3.org> wrote:
>> > Dear Pat,
>> > 
>> > I just wanted to make sure we were on the same page regarding the "best of both worlds" situation, but clearly we are.
>> > 
>> > To answer your question about my points c) and d) below:
>> > when the client retrieves something from http://ex.co/x/, it contains some triples about http://ex.co/x/Z. But when the client wants to know exactly what http://ex.co/x/Z is, how does it determine that it does not need to retrieve http://ex.co/x/Z, because it already retrieved everything there is to know about http://ex.co/x/Z when it retrieved http://ex.co/x/ ?
>> > 
>> > [PMcB] - I'd say it simply queries (i.e., locally, in-memory) the response it got from the server when it de-referenced http://ex.co/x/ (i.e., the "large" representation), to see if that response already contains triples for http://ex.co/x/Z. In other words, I'd expect the Best Practice guidance to state that all vocab term metadata for all vocab terms be returned in that "large" representation, and not just a subset of term metadata, or only the `rdfs:isDefinedBy` triples for vocab terms.
>> > 
>> > And yes, to perform such a query does need a client-side library (like RDF4J or Jena for Java, or RDF-JS for JavaScript, or rdflib for Python, etc.) - but given we're talking about RDF here in the first place, I don't see that as a huge ask. (Caveat: I do know and recognize of course that the current mainstream RDF libraries are very low-level and therefore 'complex', which is why we at Inrupt (and others, such as Ghent University) are actively trying to produce open-source, higher-level, easier-to-use SDKs to make doing such things much, much easier, especially for devs not familiar with RDF at all).
>> > But without getting into that whole client-side library debate(!), I think the fundamental, inevitable answer to your perfectly valid question of "how does it determine...?" has to be "it checks/asks/looks-inside/queries/looks-up the response from the server", and therefore a minimal level of 'client understanding' of server responses will always be necessary.
>> I like slash URIs, but sorry, I can’t like this.
>> The basis of Linked Data is that you find the triples of authority for an entity by resolving the URI, and munching what you get back.
>> 
> [PMcB] Yeah, but that's exactly my whole argument for using slashes :) ! In my view, the IRI of a single vocab term is literally the authority for *just* the triples associated with that one vocab term. I think it's debatable what a vocab's namespace IRI is the authority over (which is why I suggest that it be just a Best Practice guidance that it be considered the authority over the entire vocab's metadata, including *all* the metadata for *all* the terms defined by that vocab too).
> 
> This is also why I say using slashes is 'more correct', as it allows for a clear distinction over what an IRI is an authority over. Hashes conflate that by basically saying the authority over a single vocab term's metadata is actually the entire vocab, instead of the individual term itself. And presumably the vocab namespace IRI is *also* an identifier for that same entire vocab. So this inability of hashes to clearly distinguish these 'authorities' seems to me to be much 'less correct'.
A hash URI fetch, fetches the document at the base URI without the hash, by definition.
With the simplest caching, this means that no subsequent term look-up needs to do any network activity for all the hash URIs that have the same base URI (or whatever the technical term is).
So that is always known to be the “authority” - no further fetching is required.
This does not apply to a slash URI.
>  
>> Simply looking in your triplestore to see if you have triples about that entity is not enough.
>> 
> [PMcB] Why not? One of the biggest lightbulb moments I've ever had with RDF was when I first grokked the elegance and beauty of being able to store all my T-Box data (i.e., schema metadata) alongside all the A-Box (instance) data in a single triplestore repository.
> 
> But Ok, let's say we don't load all the T-Box data from all the vocabs we reference (as that's perfectly reasonable too!). Well then, yeah, you're *always* going to need to 'do more work' to discover the metadata for vocab terms. But I think we'd all agree that you can *never* just assume that a lookup of a single vocab term's IRI will always give you all the entire defining-vocab's info too (because obviously none of the slash-based vocabs will do that, like Schema.org won't today, nor will QUDT, nor gist, etc.). So in other words, I think you'll always have to 'do more work', and I think that necessarily means dereferencing any IRI you've got, and then understanding and processing the response you get back from the vocab-hosting server.
> 
> If that IRI was a hash IRI, then you still have to parse that response to extract the term's info, and you also have to parse that response to determine if it also contains any other vocab term metadata too ('cos you certainly can't (or shouldn't!) just assume that it does). And at this point, for hash IRIs, you now know that you can cache that entire server response, and from now on do cache lookups for more terms from that vocab - great.
You have a different process to me, I would say.
When you fetch LD, you keep it (somewhere - could be an RDF store), having parsed the whole document.
When you want to resolve a URI, you see if you have already fetched it, and then don’t bother if you have (caching).
Then, as normal for any query, you query your RDF store for whatever data you may need for that term, so you can use it for a purpose.
> 
> But the only extra work for a slash-based IRI would be that you have to look for an `rdfs:isDefinedBy` triple, dereference that, and cache that server response for all further vocab term lookups - done. But of course, you only have to do that extra lookup *if* you know you want to retrieve (and cache, presumably) all the info for all the other vocab-defined terms too.
> 
> So again, I think that *potential* extra work (i.e., only do it if you really need it) for slash-based IRIs is well worth the great flexibility it can afford users (and potential users) forever into the future.
>  
>> What happens if some other source has triples with that URI in?
>> rdfs:isDefinedBy might mitigate this to some extent, but even then, why should I think that is any more authoritative than anything else.
>> 
> [PMcB] I'm not sure I follow. Using `rdfs:isDefinedBy` is as authoritative as it's possible to get, as it's metadata asserted on the individual vocab term itself (by definition). But yeah, as I've said before too, I do think providing an `rdfs:isDefinedBy` triple-per-vocab-term should just be a Best Practice *regardless* of this entire slash vs hash discussion (and again, it's just guidance, a recommendation - if you can't, or don't want to, or can't afford the extra T-Box triples - then don't (but just know that you'll be *potentially* hurting some users of your vocab)).
>  
> 
>> Of course, if you give vocabs/onotoliges special status, then you can do this sort of thing.
>> But if you are just treating them as the RDF/Linked Data that they are, then you are in trouble saying this.
>> My standard system with caching triplestore etc. would always want to know it had got the resolved URI at some time.
>> 
> [PMcB] I'm not sure I follow this point either. I most certainly agree with treating vocabs as the RDF/Linked Data that they are (as I said above, that was a big lightbulb moment for me!), and I certainly don't think that there's any need to treat vocabs in any way specially. That's why all vocab terms *must*, by definition, have explicit RDF types stating 'what they are' (i.e., they *must* use `rdfs:Class`, or `rdf:Property`, or `owl:NamedIndividual`, or `owl:Class`, etc.). So when I dereference any IRI at all, I should be able to determine if that response contains info on just a single vocab term, or multiple/all vocab terms, or vocab metadata (e.g., an `rdf:type owl:Ontology` triple), or if any term metadata contains `rdfs:isDefinedBy` triples, etc.
> 
> But I suspect I may be missing your point here and in the preceding point!
Possibly :-)
Yes, to Best Practice.
What I mean is that you need to look very carefully where the rdfs:isDefinedBy triple comes from, when you consult your RDF and find one of them.

Cheers
Hugh
>  
> 
> This is why I think use cases are needed - slash is great for pre-loaded/engineered systems, but for proper dynamics aims Semantic Web, will incur extra fetching costs for the terms.
> 
> [PMcB] So yeah, but again, to the 'use-cases are needed' point - we can *never* know all the potential use-cases up front. Even if you create a vocab intended *only* for a narrowly defined set of use-cases, you still can't know or predict how potential future users might *want* to use it. (And again, this is only a guidance - if you really, really want even potential future users to always have rigid expectations from your vocab, then sure, go ahead and use a hash, and just explain to them why - that'd be perfectly fine with me).
> 
> But my main point is that the 'extra fetching' costs can be massively alleviated if the vocab simply follows Best Practice of providing `rdfs:isDefinedBy` triples, and with just a little bit of extra smarts on the client (but only to handle the cases where you don't know already the vocab's namespace IRI - 'cos if you do, you'd just dereference that and you're done).
> 
> So if you know you want the entire vocab info, and do you already know the vocab's namespace IRI, then just dereference that and you have literally zero extra fetching costs (i.e., you get exactly the same response, regardless of slash or hash). 
> 
> But *if* all you have is an IRI, and that IRI happens to be a single term's IRI, then after you dereference it and parse it, if it's a slash-based IRI, then the only extra work you need to do is to look for an `rdfs:isDefinedBy` triple and dereference that IRI - that's it.
> 
> For all the extra flexibility and consistency (and the 'more correct-ness', in my view) that comes from using slashes, I think there's only, at worst, a tiny extra cost, and something that our RDF libraries and tools can easily handle for us anyway.
> 
> Cheers,
> 
> Pat.
> 
>  
> 
> Best
> Hugh
> > 
> > One way to achieve this would be to include, in the content of http://ex.co/x/, the triple
> >     <http://ex.co/x/Z> rdfs:isDefinedBy <http://ex.co/x/>.
> > 
> > but again, that is a convention that both the server and the client have to share.
> > 
> > [PMcB] - Yep, exactly! I already make doing that a strongly recommended Best Practice for all vocabs I produce or work with, and so I'd love to see that become a more universally shared convention. But yes, it would just be a Best Practice guidance, one that I'd hope would become more and more widespread over time. For sure, we can't enforce it, but we can point at good examples from major, highly successful vocabs out there today, like QUDT and gist and Schema.org and DPV and ...!
> > 
> 
> 
> This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged, confidential and/or proprietary information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), please do not disseminate, distribute, print or copy this e-mail, or any attachment thereto. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the email.
Received on Tuesday, 11 October 2022 12:22:35 UTC