The use of https: IRIs on the semantic web

I hope this is an appropriate mailing list to ask this 
question.  I'd be happy to be directed elsewhere if not.

I am defining a new vocabulary.  It's not an extension of an 
existing vocabulary, nor will it use the same domain as any 
existing vocabulary.  Should I use https: IRIs?

Every source I've consulted says I should prefer http: IRIs. 
This includes the Linked Data book [1], the W3 note on "Cool 
URIs" [2], and the W3 note on best practices for RDF 
vocabularies.

This surprises me slightly.  The world seems to be moving 
away from HTTP to HTTPS, yet I know of no vocabulary that 
uses https: IRIs, and none of the documents quoted above 
even discuss the question.  I can find discussion on why 
ftp: or urn: are less well suited, but nothing about https:.

Even though IRIs on the semantic web are primarily 
identifiers rather than locators, certainly in the linked 
data world, the IRI is assumed to be a good place to look 
for more information about entity, and various authorities 
recommend 303 redirects to documents like RDF or OWL schemas 
or other descriptive documents with further information.

If the IRI is just an identifier, the choice of IRI scheme 
is largely irrelevant, and https: is neither better nor 
worse than http:.  If it's used as a locator, then at least 
the initial request to an http: IRI will be made over plain 
HTTP.  It may then be redirectd to HTTPS, and HSTS headers 
may mean subsequent requests go directly over HTTPS, but the 
first request is still unencrypted.  This has the following 
potential problems:

* It is susceptible to a man-in-the-middle attack.  A
   malicious party could inject deliberately inconsistent
   schema information that may affect processing decisions
   made by applications, potentially causing a DoS or otherwise
   disrupting the user experience.

* ISPs may do a MITM themselves to inject adverts into
   content (e.g. [4]).  Really they oughtn't to do this for
   non-HTML content (well, they shouldn't do it at all, but
   that's another matter), but that relies on the ISP caring
   enough to get this right.

* The request might be tracked by ISPs.  The fact that a
   user is using an application that consults a particular
   schema is itself valuable information about the customer
   that can be sold to advertisers which the US Senate
   recently voted to make legal [5].  People are increasingly
   privacy conscious and want to minimise this.

These are all mitigated to a significant degree by using 
https: IRIs.  So far as I can see, the counter-arguments are 
as follows:

* Not all HTTP client libraries support TLS, but few if any
   only support HTTP over TLS.

* Best practice with TLS changes more frequently than plain
   HTTP.  An HTTP client from 20 years ago will still
   probably work, but TLS has moved on a lot and few servers
   now support SSL 3.

* MITM attacks and injection can still happen over TLS as
   the "Superfish" fiasco demonstrated [6], are probably
   better prevented with digital signatures.

* Use of TLS at the transport level prevents HTTP caching on
   intermediate servers, unless a trusted root certicicate is
   used.

* ISP tracking still happens over TLS because the SNI field
   is not encrypted, and encrypted SNI seems to have been
   droped from TLS 1.3 [7].

* Whether TLS is used at the transport level should be an
   implementation detail that is not exposed in the
   vocabulary, a point Berners-Lee has made forcefully [8].

* In the future it's likely that there the functionality of
   HSTS will be put in DNS [9].

These seem fairly weak arguments to me.  Digital signatures 
can be used regardless of whether the resource was fetched 
over TLS, and adding authentication at the top of the 
semantic web stack shouldn't preclude encryption at the 
bottom.  HSTS-in-DNS technologies, which in conjunction with 
DNSSEC would alleviate the problem, seem to be stalled, and 
I've seen no drafts on the subject since 2011 [9].

I'm wondering whether there's something I'm missing, because 
almost universally people are still defining vocabularies 
using http: IRIs.

I can see why converting an existing vocabulary from http: 
to https: would be difficult, to the point of being 
undesirable; I can see too that there are logistic 
conveniences to having all vocabulary IRIs on a given domain 
use the same IRI scheme, both points Berners-Lee makes in 
[8].  But these don't apply to new vocabularies.

Is there some other consideration I'm missing?

Richard


[1] http://linkeddatabook.com/editions/1.0/#htoc10
[2] https://www.w3.org/TR/cooluris/
[3] https://www.w3.org/TR/swbp-vocab-pub/
[4] http://preview.tinyurl.com/om3xxdb
[5] http://preview.tinyurl.com/ybeka8yv
[6] https://brennan.io/2015/02/20/superfish-explained/
[7] https://www.ietf.org/mail-archive/web/tls/current/msg23251.html
[8] https://www.w3.org/DesignIssues/Security-NotTheS.html
[9] https://tools.ietf.org/html/draft-hallambaker-esrv-01

Received on Friday, 7 July 2017 21:37:33 UTC