Re: (Lost in the noise perhaps - so asking again) - Is a trailing slash 'better' than a trailing hash for vocabs namespace IRIs? from Pat McBennett on 2022-10-12 (semantic-web@w3.org from October 2022)

From: Pat McBennett <patm@inrupt.com>
Date: Wed, 12 Oct 2022 01:53:43 +0100
To: Hugh Glaser <hugh@glasers.org>
Cc: Pierre-Antoine Champin <pierre-antoine@w3.org>, semantic-web@w3.org
Message-ID: <CABgQ8mJZX1FTXhUH4XZ5xHirWDwV8RsUD_g5FAKO=CusborGjw@mail.gmail.com>
Hiya Hugh,

On Tue, Oct 11, 2022 at 1:22 PM Hugh Glaser <hugh@glasers.org> wrote:

> Hi Pat,
>
> (I’ve tried sorting out the quotation levels a bit)
>

[PMcB] Thanks!


>
> I like your proposal.
> However, I think that arguing that slash is no less efficient than hash in
> terms of network is just wrong.
>

[PMcB] Well, just to be clear, I never said it was *no* less efficient :) !
What I was trying to say was that in the case of simply dereferencing a
vocab's namespace IRI, *in that case*, it's no less efficient - i.e., in
both cases, slash and hash, you'd expect to get back exactly the same
full-vocab-metadata response in a single HTTP request. So if you don't want
to pay any inefficiency cost, then, if possible, just dereference the
vocab's namespace IRI up-front to get everything you need in one single
HTTP request, and just cache it for all further term lookups. That'll give
you exactly the same efficiency as using hash namespace IRI - but only if
you know the namespace IRI beforehand, and can dereference it up-front.

I accept indeed that it will be *less efficient* in the case of looking up
a single vocab term's IRI from a slash-based vocab, since yes, you need to
first dereference that single term IRI, then parse out (hopefully) a
`rdfs:isDefinedBy` triple, and then you have to dereference the RDF Object
value of that triple to get all the metadata for all the vocab terms. So
yes indeed, in that specific case, using slash is 'less efficient' (i.e.,
it requires a bit more client-side processing and knowledge of the
`rdfs:isDefinedBy` predicate, and it's one extra HTTP request). But it
should only be one extra HTTP request per vocab (when you store/save/cache
the server responses), regardless of the number of terms in each vocab - so
not unreasonable I think, and only needed when you don't already know a
vocab's namespace IRI up-front.

But it is a price that may well be worth paying in general.
> After all, I still think that systems don’t resolve vocab much once they
> go live.
>

[PMcB] Yeah, I indeed think it is a price well worth paying (even if *just*
people (in general) can have a single, simple piece of *guidance* to
follow, if they so choose). In other words, I think it's vastly better
(especially for newbies) than saying (in paraphrasing Sarven's position
(sorry Sarven, I'll reply more thoroughly to your thoughts separately :) ))
- i.e., "Well, you need to decide for yourself between slash and hash for
your new vocab, by weighing up: your specific use case; reflecting on
empirical evidence, e.g., what characteristics do the majority of the
vocabs share?; and helping the URI owners when considering persistence
policies". To be honest, I feel that kind of guidance is precisely what
results in newbies running screaming to the hills... :)

And yes, I totally agree too that (from my experience anyway) systems don’t
resolve vocabs much at all (including when they go live). But regardless of
whether they do or not, I think adopting slashes (as mere guidance) helps
pave the way for Linked Data clients to *be able to* more easily and
efficiently choose for themselves to resolve entire-vocab metadata and/or
individual-vocab-term metadata at runtime more and more in the future
(e.g., to drive user interfaces from vocab metadata, to help drive dynamic
queries via link traversals, etc.). Whereas just sticking with the current
empirical evidence of vocabs in the wild today (i.e., hashes) can only
result in limiting future choices for vocab users.

Cheers,

Pat.


>
> > On 11 Oct 2022, at 12:54, Pat McBennett <patm@inrupt.com> wrote:
> >
> > Hiya Hugh,
> >
> > Thanks so much for engaging!
> >
> > On Tue, Oct 11, 2022 at 10:23 AM Hugh Glaser <hugh@glasers.org> wrote:
> >> Hi,
> >>
> >> > On 11 Oct 2022, at 09:49, Pat McBennett <patm@inrupt.com> wrote:
> >> >
> >> > Hiya Pierre-Antoine,
> >> >
> >> > I'm going to try and reply in-line this time - hopefully GMail won't
> garble the formatting this time (and I've prefixed my responses with [PMcB]
> too):
> >> >
> >> > On Mon, Oct 10, 2022 at 11:21 AM Pierre-Antoine Champin <
> pierre-antoine@w3.org> wrote:
> >> > Dear Pat,
> >> >
> >> > I just wanted to make sure we were on the same page regarding the
> "best of both worlds" situation, but clearly we are.
> >> >
> >> > To answer your question about my points c) and d) below:
> >> > when the client retrieves something from http://ex.co/x/, it
> contains some triples about http://ex.co/x/Z. But when the client wants
> to know exactly what http://ex.co/x/Z is, how does it determine that it
> does not need to retrieve http://ex.co/x/Z, because it already retrieved
> everything there is to know about http://ex.co/x/Z when it retrieved
> http://ex.co/x/ ?
> >> >
> >> > [PMcB] - I'd say it simply queries (i.e., locally, in-memory) the
> response it got from the server when it de-referenced http://ex.co/x/
> (i.e., the "large" representation), to see if that response already
> contains triples for http://ex.co/x/Z. In other words, I'd expect the
> Best Practice guidance to state that all vocab term metadata for all vocab
> terms be returned in that "large" representation, and not just a subset of
> term metadata, or only the `rdfs:isDefinedBy` triples for vocab terms.
> >> >
> >> > And yes, to perform such a query does need a client-side library
> (like RDF4J or Jena for Java, or RDF-JS for JavaScript, or rdflib for
> Python, etc.) - but given we're talking about RDF here in the first place,
> I don't see that as a huge ask. (Caveat: I do know and recognize of course
> that the current mainstream RDF libraries are very low-level and therefore
> 'complex', which is why we at Inrupt (and others, such as Ghent University)
> are actively trying to produce open-source, higher-level, easier-to-use
> SDKs to make doing such things much, much easier, especially for devs not
> familiar with RDF at all).
> >> > But without getting into that whole client-side library debate(!), I
> think the fundamental, inevitable answer to your perfectly valid question
> of "how does it determine...?" has to be "it
> checks/asks/looks-inside/queries/looks-up the response from the server",
> and therefore a minimal level of 'client understanding' of server responses
> will always be necessary.
> >> I like slash URIs, but sorry, I can’t like this.
> >> The basis of Linked Data is that you find the triples of authority for
> an entity by resolving the URI, and munching what you get back.
> >>
> > [PMcB] Yeah, but that's exactly my whole argument for using slashes :) !
> In my view, the IRI of a single vocab term is literally the authority for
> *just* the triples associated with that one vocab term. I think it's
> debatable what a vocab's namespace IRI is the authority over (which is why
> I suggest that it be just a Best Practice guidance that it be considered
> the authority over the entire vocab's metadata, including *all* the
> metadata for *all* the terms defined by that vocab too).
> >
> > This is also why I say using slashes is 'more correct', as it allows for
> a clear distinction over what an IRI is an authority over. Hashes conflate
> that by basically saying the authority over a single vocab term's metadata
> is actually the entire vocab, instead of the individual term itself. And
> presumably the vocab namespace IRI is *also* an identifier for that same
> entire vocab. So this inability of hashes to clearly distinguish these
> 'authorities' seems to me to be much 'less correct'.
> A hash URI fetch, fetches the document at the base URI without the hash,
> by definition.
> With the simplest caching, this means that no subsequent term look-up
> needs to do any network activity for all the hash URIs that have the same
> base URI (or whatever the technical term is).
> So that is always known to be the “authority” - no further fetching is
> required.
> This does not apply to a slash URI.
> >
> >> Simply looking in your triplestore to see if you have triples about
> that entity is not enough.
> >>
> > [PMcB] Why not? One of the biggest lightbulb moments I've ever had with
> RDF was when I first grokked the elegance and beauty of being able to store
> all my T-Box data (i.e., schema metadata) alongside all the A-Box
> (instance) data in a single triplestore repository.
> >
> > But Ok, let's say we don't load all the T-Box data from all the vocabs
> we reference (as that's perfectly reasonable too!). Well then, yeah, you're
> *always* going to need to 'do more work' to discover the metadata for vocab
> terms. But I think we'd all agree that you can *never* just assume that a
> lookup of a single vocab term's IRI will always give you all the entire
> defining-vocab's info too (because obviously none of the slash-based vocabs
> will do that, like Schema.org won't today, nor will QUDT, nor gist, etc.).
> So in other words, I think you'll always have to 'do more work', and I
> think that necessarily means dereferencing any IRI you've got, and then
> understanding and processing the response you get back from the
> vocab-hosting server.
> >
> > If that IRI was a hash IRI, then you still have to parse that response
> to extract the term's info, and you also have to parse that response to
> determine if it also contains any other vocab term metadata too ('cos you
> certainly can't (or shouldn't!) just assume that it does). And at this
> point, for hash IRIs, you now know that you can cache that entire server
> response, and from now on do cache lookups for more terms from that vocab -
> great.
> You have a different process to me, I would say.
> When you fetch LD, you keep it (somewhere - could be an RDF store), having
> parsed the whole document.
> When you want to resolve a URI, you see if you have already fetched it,
> and then don’t bother if you have (caching).
> Then, as normal for any query, you query your RDF store for whatever data
> you may need for that term, so you can use it for a purpose.
> >
> > But the only extra work for a slash-based IRI would be that you have to
> look for an `rdfs:isDefinedBy` triple, dereference that, and cache that
> server response for all further vocab term lookups - done. But of course,
> you only have to do that extra lookup *if* you know you want to retrieve
> (and cache, presumably) all the info for all the other vocab-defined terms
> too.
> >
> > So again, I think that *potential* extra work (i.e., only do it if you
> really need it) for slash-based IRIs is well worth the great flexibility it
> can afford users (and potential users) forever into the future.
> >
> >> What happens if some other source has triples with that URI in?
> >> rdfs:isDefinedBy might mitigate this to some extent, but even then, why
> should I think that is any more authoritative than anything else.
> >>
> > [PMcB] I'm not sure I follow. Using `rdfs:isDefinedBy` is as
> authoritative as it's possible to get, as it's metadata asserted on the
> individual vocab term itself (by definition). But yeah, as I've said before
> too, I do think providing an `rdfs:isDefinedBy` triple-per-vocab-term
> should just be a Best Practice *regardless* of this entire slash vs hash
> discussion (and again, it's just guidance, a recommendation - if you can't,
> or don't want to, or can't afford the extra T-Box triples - then don't (but
> just know that you'll be *potentially* hurting some users of your vocab)).
> >
> >
> >> Of course, if you give vocabs/onotoliges special status, then you can
> do this sort of thing.
> >> But if you are just treating them as the RDF/Linked Data that they are,
> then you are in trouble saying this.
> >> My standard system with caching triplestore etc. would always want to
> know it had got the resolved URI at some time.
> >>
> > [PMcB] I'm not sure I follow this point either. I most certainly agree
> with treating vocabs as the RDF/Linked Data that they are (as I said above,
> that was a big lightbulb moment for me!), and I certainly don't think that
> there's any need to treat vocabs in any way specially. That's why all vocab
> terms *must*, by definition, have explicit RDF types stating 'what they
> are' (i.e., they *must* use `rdfs:Class`, or `rdf:Property`, or
> `owl:NamedIndividual`, or `owl:Class`, etc.). So when I dereference any IRI
> at all, I should be able to determine if that response contains info on
> just a single vocab term, or multiple/all vocab terms, or vocab metadata
> (e.g., an `rdf:type owl:Ontology` triple), or if any term metadata contains
> `rdfs:isDefinedBy` triples, etc.
> >
> > But I suspect I may be missing your point here and in the preceding
> point!
> Possibly :-)
> Yes, to Best Practice.
> What I mean is that you need to look very carefully where the
> rdfs:isDefinedBy triple comes from, when you consult your RDF and find one
> of them.
>
> Cheers
> Hugh
> >
> >
> > This is why I think use cases are needed - slash is great for
> pre-loaded/engineered systems, but for proper dynamics aims Semantic Web,
> will incur extra fetching costs for the terms.
> >
> > [PMcB] So yeah, but again, to the 'use-cases are needed' point - we can
> *never* know all the potential use-cases up front. Even if you create a
> vocab intended *only* for a narrowly defined set of use-cases, you still
> can't know or predict how potential future users might *want* to use it.
> (And again, this is only a guidance - if you really, really want even
> potential future users to always have rigid expectations from your vocab,
> then sure, go ahead and use a hash, and just explain to them why - that'd
> be perfectly fine with me).
> >
> > But my main point is that the 'extra fetching' costs can be massively
> alleviated if the vocab simply follows Best Practice of providing
> `rdfs:isDefinedBy` triples, and with just a little bit of extra smarts on
> the client (but only to handle the cases where you don't know already the
> vocab's namespace IRI - 'cos if you do, you'd just dereference that and
> you're done).
> >
> > So if you know you want the entire vocab info, and do you already know
> the vocab's namespace IRI, then just dereference that and you have
> literally zero extra fetching costs (i.e., you get exactly the same
> response, regardless of slash or hash).
> >
> > But *if* all you have is an IRI, and that IRI happens to be a single
> term's IRI, then after you dereference it and parse it, if it's a
> slash-based IRI, then the only extra work you need to do is to look for an
> `rdfs:isDefinedBy` triple and dereference that IRI - that's it.
> >
> > For all the extra flexibility and consistency (and the 'more
> correct-ness', in my view) that comes from using slashes, I think there's
> only, at worst, a tiny extra cost, and something that our RDF libraries and
> tools can easily handle for us anyway.
> >
> > Cheers,
> >
> > Pat.
> >
> >
> >
> > Best
> > Hugh
> > >
> > > One way to achieve this would be to include, in the content of
> http://ex.co/x/, the triple
> > >     <http://ex.co/x/Z> rdfs:isDefinedBy <http://ex.co/x/>.
> > >
> > > but again, that is a convention that both the server and the client
> have to share.
> > >
> > > [PMcB] - Yep, exactly! I already make doing that a strongly
> recommended Best Practice for all vocabs I produce or work with, and so I'd
> love to see that become a more universally shared convention. But yes, it
> would just be a Best Practice guidance, one that I'd hope would become more
> and more widespread over time. For sure, we can't enforce it, but we can
> point at good examples from major, highly successful vocabs out there
> today, like QUDT and gist and Schema.org and DPV and ...!
> > >
> >
> >
> > This e-mail, and any attachments thereto, is intended only for use by
> the addressee(s) named herein and may contain legally privileged,
> confidential and/or proprietary information. If you are not the intended
> recipient of this e-mail (or the person responsible for delivering this
> document to the intended recipient), please do not disseminate, distribute,
> print or copy this e-mail, or any attachment thereto. If you have received
> this e-mail in error, please respond to the individual sending the message,
> and permanently delete the email.
>
>

-- 
This e-mail, and any attachments thereto, is intended only for use by the 
addressee(s) named herein and may contain legally privileged, confidential 
and/or proprietary information. If you are not the intended recipient of 
this e-mail (or the person responsible for delivering this document to the 
intended recipient), please do not disseminate, distribute, print or copy 
this e-mail, or any attachment thereto. If you have received this e-mail in 
error, please respond to the individual sending the message, and 
permanently delete the email.
Received on Wednesday, 12 October 2022 00:54:08 UTC