RE: The use of https: IRIs on the semantic web from Stian Soiland-Reyes on 2017-07-10 (semantic-web@w3.org from July 2017)

From: Stian Soiland-Reyes <soiland-reyes@manchester.ac.uk>
Date: Mon, 10 Jul 2017 10:56:49 +0000
To: Richard Smith <richard@ex-parrot.com>, "semantic-web@w3.org" <semantic-web@w3.org>
Message-ID: <D5780135E58FC940BDB87E7D499910184B3F08C0@MBXP14.ds.man.ac.uk>
I would wholeheartedly support using a https namespace for a new ontology, coupled with a longevity-secured location. With https://letsencrypt.org/ there is not really an excuse today to have a non-encrypted web service, even for experimental work.





Probably the biggest real danger with https:// namespace is that someone will forget to renew the SSL certificates (or recently; SSL clients require higher encryption quality) and the vocabulary becomes “inaccessible”.



SSL expiry is however not as bad as expiry of a custom domain name for the ontology (bad idea), as at least there are technical ways to bypass the expiry warnings.





The JSON-LD community has gone with https from day one, as nobody wants their @context resolution to be vulnerable, or to not work from a https:// web application.



Thus I find that vocabularies using PURLs with a https://w3id.org/ registered namespace are pretty much always on https – although some of them redirect to an insecure http location (a redirection I know Java’s URL support will protest against) or may have accidentally used http://w3id.org/ in their declared namespaces. Here are some of them (where people added a accompanying README): https://github.com/perma-id/w3id.org/search?utf8=%E2%9C%93&q=ontology&type=





--
Stian Soiland-Reyes, eScience Lab
School of Computer Science, The University of Manchester
http://orcid.org/0000-0001-9842-9718



From: Richard Smith<mailto:richard@ex-parrot.com>
Sent: 07 July 2017 22:45
To: semantic-web@w3.org<mailto:semantic-web@w3.org>
Subject: The use of https: IRIs on the semantic web



I hope this is an appropriate mailing list to ask this
question.  I'd be happy to be directed elsewhere if not.

I am defining a new vocabulary.  It's not an extension of an
existing vocabulary, nor will it use the same domain as any
existing vocabulary.  Should I use https: IRIs?

Every source I've consulted says I should prefer http: IRIs.
This includes the Linked Data book [1], the W3 note on "Cool
URIs" [2], and the W3 note on best practices for RDF
vocabularies.

This surprises me slightly.  The world seems to be moving
away from HTTP to HTTPS, yet I know of no vocabulary that
uses https: IRIs, and none of the documents quoted above
even discuss the question.  I can find discussion on why
ftp: or urn: are less well suited, but nothing about https:.

Even though IRIs on the semantic web are primarily
identifiers rather than locators, certainly in the linked
data world, the IRI is assumed to be a good place to look
for more information about entity, and various authorities
recommend 303 redirects to documents like RDF or OWL schemas
or other descriptive documents with further information.

If the IRI is just an identifier, the choice of IRI scheme
is largely irrelevant, and https: is neither better nor
worse than http:.  If it's used as a locator, then at least
the initial request to an http: IRI will be made over plain
HTTP.  It may then be redirectd to HTTPS, and HSTS headers
may mean subsequent requests go directly over HTTPS, but the
first request is still unencrypted.  This has the following
potential problems:

* It is susceptible to a man-in-the-middle attack.  A
   malicious party could inject deliberately inconsistent
   schema information that may affect processing decisions
   made by applications, potentially causing a DoS or otherwise
   disrupting the user experience.

* ISPs may do a MITM themselves to inject adverts into
   content (e.g. [4]).  Really they oughtn't to do this for
   non-HTML content (well, they shouldn't do it at all, but
   that's another matter), but that relies on the ISP caring
   enough to get this right.

* The request might be tracked by ISPs.  The fact that a
   user is using an application that consults a particular
   schema is itself valuable information about the customer
   that can be sold to advertisers which the US Senate
   recently voted to make legal [5].  People are increasingly
   privacy conscious and want to minimise this.

These are all mitigated to a significant degree by using
https: IRIs.  So far as I can see, the counter-arguments are
as follows:

* Not all HTTP client libraries support TLS, but few if any
   only support HTTP over TLS.

* Best practice with TLS changes more frequently than plain
   HTTP.  An HTTP client from 20 years ago will still
   probably work, but TLS has moved on a lot and few servers
   now support SSL 3.

* MITM attacks and injection can still happen over TLS as
   the "Superfish" fiasco demonstrated [6], are probably
   better prevented with digital signatures.

* Use of TLS at the transport level prevents HTTP caching on
   intermediate servers, unless a trusted root certicicate is
   used.

* ISP tracking still happens over TLS because the SNI field
   is not encrypted, and encrypted SNI seems to have been
   droped from TLS 1.3 [7].

* Whether TLS is used at the transport level should be an
   implementation detail that is not exposed in the
   vocabulary, a point Berners-Lee has made forcefully [8].

* In the future it's likely that there the functionality of
   HSTS will be put in DNS [9].

These seem fairly weak arguments to me.  Digital signatures
can be used regardless of whether the resource was fetched
over TLS, and adding authentication at the top of the
semantic web stack shouldn't preclude encryption at the
bottom.  HSTS-in-DNS technologies, which in conjunction with
DNSSEC would alleviate the problem, seem to be stalled, and
I've seen no drafts on the subject since 2011 [9].

I'm wondering whether there's something I'm missing, because
almost universally people are still defining vocabularies
using http: IRIs.

I can see why converting an existing vocabulary from http:
to https: would be difficult, to the point of being
undesirable; I can see too that there are logistic
conveniences to having all vocabulary IRIs on a given domain
use the same IRI scheme, both points Berners-Lee makes in
[8].  But these don't apply to new vocabularies.

Is there some other consideration I'm missing?

Richard


[1] http://linkeddatabook.com/editions/1.0/#htoc10
[2] https://www.w3.org/TR/cooluris/
[3] https://www.w3.org/TR/swbp-vocab-pub/
[4] http://preview.tinyurl.com/om3xxdb
[5] http://preview.tinyurl.com/ybeka8yv
[6] https://brennan.io/2015/02/20/superfish-explained/
[7] https://www.ietf.org/mail-archive/web/tls/current/msg23251.html
[8] https://www.w3.org/DesignIssues/Security-NotTheS.html
[9] https://tools.ietf.org/html/draft-hallambaker-esrv-01
Received on Monday, 10 July 2017 10:57:22 UTC