Re: W3C position on URIs http:// vs. https:// from Hugh Glaser on 2023-06-15 (semantic-web@w3.org from June 2023)

From: Hugh Glaser <hugh@glasers.org>
Date: Thu, 15 Jun 2023 12:09:23 +0100
To: Chris Mungall <cjmungall@lbl.gov>
Cc: semantic-web <semantic-web@w3.org>
Message-Id: <7D8662F3-760A-4960-9B0C-4227D66FEEF6@glasers.org>
Good questions.
I’ll try and be brief and get across the gist.

First thing to say is that we are very much in the Linked Data world.
So the RDF associated with each URI is primarily fetched by resolution, or SCBD of a known triplestore.
When using any single URI, it is assumed that it is one of a possible set of URIs that are known to be sameAs for this particular application and context.
(This sameAs information can be part of a triplestore, of course.)
But the main way we keep such information is in specialised KBs about URIs, informing sameAs (and differentFrom, exactMatch, etc.) services with their own API.

> On 14 Jun 2023, at 23:54, Chris Mungall <cjmungall@lbl.gov> wrote:
> 
> How does this work in practice? Especially if we have billions of URIs? Is each sameAs materialized or is this some kind of regex-based thing?
Yes, there are versions of the sameAs service that allow arbitrary scripting.
> 
> Is the expectation that architecture such as triplestores uses entailment, such that queries can use either? Or is the expectation that the client with essentially do the entailment with a join? Or that the sameAs service is somehow used transparently by the SPARQL query engine?
For a full deployment, there is an infrastructure that has to query all the SPARQL (and other) stores, using the sets of URIs.
It can all get put into stores on demand, and then the store does all the work, but we rarely do that.
Of course, we could extend the SPARQL store to use sameAs services, but we haven’t.
> 
> What about if we want to distribute a subset of the triples separate from the source document/triplestore.
The way to distribute triples is by URI resolution.
If you wanted to have a subset, then I guess you could mint a URI that resolved to a document with those triples in, and declare it sameAs.
We would view different sets of triples as describing possibly different concepts.
> Should these be shadowed by the sameAs triples? Or is the expectation you would do a federated query on the triplestore with the sameAs triples?
I’m aware that we are not thinking of things the same way.
My best answer is that we are doing the federated query that you mention.
> 
> Sorry if I'm being dense and missing something obvious, I am trying to think how to write documentation for such a system that a non-semweb guru could implement and follow.
No - excellent questions.
I’m afraid I struggle to answer - it is over 10 years since I thought about this in any depth.

If you like, you can mail me off-list with more questions, in case we are going off-topic and bore people too much.

Best
Hugh
> 
> On Wed, Jun 14, 2023 at 2:15 PM Hugh Glaser <hugh@glasers.org> wrote:
> We have used a sameas.org compliant service that simply sameAs-ed all http & https URIs, when it becomes an issue.
> So just part of the standard architecture for us.
> 
> Cheers
> 
> > On 14 Jun 2023, at 16:11, Andrea Splendiani <andrea.splendiani@iscb.org> wrote:
> > 
> > Hi,
> > Adding to the thread: is it naïf to move to https when it will be needed or desisted in new releases of data resources, and count on sameAs to assess equivalence?
> > Overkill on one hand, but very common on another as URL rewriting happens.
> > 
> > Best,
> > Andrea
> > 
> > Sent from my iPhone
> > 
> >> On 14 Jun 2023, at 16:32, Chris Mungall <cjmungall@lbl.gov> wrote:
> >> 
> >> 
> >> 
> >> On Wed, Jun 14, 2023 at 6:28 AM Pierre-Antoine Champin <pierre-antoine@w3.org> wrote:
> >> 
> >> On 14/06/2023 00:53, Chris Mungall wrote:
> >>> Hi Pat!
> >>> 
> >>> While this could work in principle, in practice there are likely millions of lines of code like this:
> >>> 
> >>> >>> if pred == "http://www.w3.org/2004/02/skos/core#altLabel": 
> >>> >>>   ...
> >> This is not really the issue, I believe. My (mis?)reading of Pat's suggestion is that the https transparency should be implemented whenever these IRIs are used as URLs. I.e. at the "linked data" level, not the "RDF level".
> >> In other words, all RDF files, RDF database, and code dealing with them, should use the <http://www.w3.org/2004/02/skos/core#altLabel> (no "s"). That's the identifier of the "alternative label" property in Skos, we should not change it.
> >> However, any code that wishes to dereference this identifier to get more info about what it means, could (should?) be updated to automatically replace the http: at the beginning by https:. And fallback to http:// it the former attempt fails.
> >> Oh I see, well that's relatively straightforward, but it's a non-use case for us in OBO. I'm not aware of anyone ever writing code to obtain more information about an entity by dereferencing its URI, although we originally set up a lot of infrastructure to do this, this follow your nose thing has always been fantasy in the biomedical linked data world, and I suspect other domains too. I don't really get what kind of code would be able to do anything meaningful with what it gets by grabbing the turtle for skos:altLabel.
> >>  In fact most biomedical ontology class PURLs resolve only to HTML:
> >> 
> >> curl -L -H "Accept: text/turtle"  http://purl.obolibrary.org/obo/CL_0000540
> >> 
> >> => html
> >> 
> >> Only humans care about the links. Machines use the whole ontology.
> >> 
> >> 
> >>   pa
> >>> 
> >>> or
> >>> 
> >>> >>> if pred == SKOS.altLabel:
> >>> >>>    ...
> >>> 
> >>> That would need to be rewritten to be s-transparent. Perhaps not Y2K code rewrite levels, but a lot. For some of those codebases there may be efficiency considerations - string equality is fast, string processing can be slow.
> >>> 
> >>> A lot of libraries use objects rather than strings which would allow for custom definitions of ==, but this would be a big breaking change, some applications may depend on http and https being inequal.
> >>> 
> >>> Nevertheless it might be an idea to build for the future. Core libraries like rdflib, jena, owlapi could provide sTransparentEquals operations and sNormalize functions such that developers can start writing more future-proof code. Care would have to be taken in defining how sTransparent and legacy codebases interact. It may be difficult for sTransparent code to be s-preserving, which would necessitate complicated re-normalization if codebases are to be mixed. I'm imagining strange bugs in what is already quite a complicated layered stack (owl over rdf, I'm looking at you). And I fear that using a non-standard equality operator would make a lot of semweb code look even more opaque than it already is. 
> >>> 
> >>> I think a lot of information ecosystems would opt to keep the code simple, and if forced to make the change, just bite the bullet, rewire all accessible RDF and provide converters to help do this.
> >>> 
> >>> Both options have high costs, which is why in OBO we have no plans to change our existing http PURLs. But we don't know if there will be further developments that make continued use of http difficult.
> >>> 
> >>>    ...
> >>>    ...
> >>> 
> >>> On Tue, Jun 13, 2023 at 12:49 PM Patrick J. Hayes <phayes@ihmc.org> wrote:
> >>> (On a more constructive note…) 
> >>> 
> >>> Chris, greetings. I agree with everything you say here, but wonder whether there might be a slightly less painful way to bring the Sweb up to date than rewriting every extant ontology. 
> >>> 
> >>> The Web is much bigger than the total Sweb, including all the RDF/OWL ontologies, but that is probably bigger than the sum total of the code of Sweb tools that manipulate these ontologies. So on the principle of making the fix where it causes least pain, could we not encourage semantic web tool-builders to make their engines treat URIs in a s-transparent way, so that http:foodleblax and https:foodelax are simply treated as identical when occurring in any RDF triple. I am not a developer but surely this would not be too onerous a task, would it? It's a tweak to some low-level part of the code that extracts URIs from datastructure or text. Call such an RDF tool 'S-transparent', then asking Sweb developers to ensure 'S-tranparency' would seem (?) to solve the problem and still keep other Web developers happy, for surely they do not care what happens to URIs embedded inside RDF triples, which are never used as Web identifiers in any transfer protocol. (Or do they?)
> >>> 
> >>> Anyway, I will leave y'all with this thought. I'm sure it must have occurred to someone already in any case. 
> >>> 
> >>> If this is nonsense or unworkable, please just ignore it.
> >>> 
> >>> Best wishes
> >>> 
> >>> Pat Hayes
> >>> 
> >>>> On Jun 13, 2023, at 10:01 AM, Chris Mungall <cjmungall@lbl.gov> wrote:
> >>>> 
> >>>> I think it's important for the semantic web community to communicate clearly, simply, unambiguously, and non-dogmatically when it comes to this issue.
> >>>> 
> >>>> While I agree with many points in the TimBL article, the ship has long sailed. I can't show that article to web developers who are asking me why we don't change our PURLs to https, because chrome refuses to allow downloads of them when linked from an https site. They don't understand why we are reluctant to change, because frankly using URLs for identifiers was a pretty odd thing to do in the first place, mixing two separate concerns (semantic identity and network protocols). Browsers and http libraries can happily treat http and https as equivalent, but this is obviously a massive problem for semantic web interoperability.
> >>>> 
> >>>> The lack of guidance has led to confusion. For example, it looks like schema.org is in some superposition state where http or https is considered canonical for semantic identifiers.
> >>>> 
> >>>> https://github.com/solid/solid-namespace/issues/21
> >>>> https://github.com/linkeddata/rdflib.js/issues/550
> >>>> 
> >>>> We are faced with this problem in the OBO community, we adopted http PURLs for both OWL classes and OWL ontologies around 15 years ago, rejecting URN-based LSIDs. We are now faced with the situation where things are breaking as various pieces of web infrastructure start making life for http difficult. 
> >>>> 
> >>>> We tried reading 
> >>>> https://www.w3.org/blog/2016/05/https-and-the-semantic-weblinked-data/
> >>>> But the advice about URI and HSTS is hard to follow for a bunch of ontologists. We just want to make useful ontologies, and not be forced to be network engineers.
> >>>> 
> >>>> Our discussion and eventual decisions are recorded here, if it's useful (and comments welcome if we are doing things incorrectly):
> >>>> 
> >>>> https://github.com/OBOFoundry/purl.obolibrary.org/issues/705
> >>>> 
> >>>> Summary:
> >>>> 
> >>>> 1. Our infrastructure supports both https and http URLs, for both terms and ontologies, these both 302 redirect to the relevant browser or download (using cloudflare)
> >>>> 2. We encourage web sites that need to link to an ontology download to use the https URLs in HTML, but to make it clear that the PURL is the http URI, and the http PURL *must* be used in RDF documents
> >>>> 3. Even though we support https variants of http PURLs for OWL classes, with both 302 redirecting to the same location, we strongly discourage their use in any context, because this can lead to confusion about the canonical URL to use in RDF/OWL documents. We don't want to end up in the schema.org situation. We are building lots of tooling that will check for cases where https is used accidentally in a linked data context, as we expect this to happen a lot.
> >>>> 
> >>>> This has been sufficient to placate frustrated web developers, but it feels like we are delaying the inevitable and that there will one day be pressure to deprecate our http PURLs and switch to https. This would have a massive cost in terms of rewiring massive distributed troves of RDF data and OWL documents, database tables, and a highly painful, long, and confusing transition period. But we are hoping that this day never comes or we can delay it as long as possible, or LLMs will make the whole thing irrelevant.
> >>>> 
> >>>> On Tue, Jun 13, 2023 at 8:48 AM Melvin Carvalho <melvincarvalho@gmail.com> wrote:
> >>>> 
> >>>> 
> >>>> út 13. 6. 2023 v 17:37 odesílatel Hubauer, Thomas <thomas.hubauer@siemens.com> napsal:
> >>>> Hi SemWeb community,
> >>>>  
> >>>> One of my projects is considering making some of our ontologies accessible to customers. As part of these considerations, we have been discussing resolving ontology references (e.g. for imports) which lead us to some lengthy arguments about http:// vs. https:// as protocol part in our URIs (primarily ontology URIs, potentially element URIs as well). 
> >>>>  
> >>>> I am aware of a 2016 post (https://www.w3.org/blog/2016/05/https-and-the-semantic-weblinked-data/) stating that W3C currently considers http and https to be “equivalent” for w3c.org. However, the security guys I am working with are not too happy with this as using a http URI for downloading imported ontologies is vulnerable to a man-in-the-middle attack.
> >>>>  
> >>>> I was unable to find any more recent statement by the W3C on the use of http vs. https. Specifically, I’d be interested to understand if this community (and the W3C) intend to stick with http for the foreseeable future, of if there’s any plans to migrate some/all URIs (e.g. ontology URIs but not element URIs) to https ? Would be nice for us to understand what “the outer world” plans so we can maybe take this as a blueprint for our own guidance on URIs.
> >>>> 
> >>>> I'm with TimBL on this:
> >>>> 
> >>>> "HTTPS Everywhere" considered harmful
> >>>> 
> >>>> https://www.w3.org/DesignIssues/Security-NotTheS.html
> >>>> 
> >>>> The Semantic Web has been around for a couple of decades.  Is there any documented instance of an MITM attack on an ontology ever causing an issue?
> >>>>   
> >>>>  
> >>>> Best regards,
> >>>> Thomas
> >>>>  
> >>>>  
> >>> 
> 
>
Received on Thursday, 15 June 2023 11:09:48 UTC