RE: The use of https: IRIs on the semantic web from Pete Rivett on 2017-07-07 (semantic-web@w3.org from July 2017)

From: Pete Rivett <pete.rivett@adaptive.com>
Date: Fri, 7 Jul 2017 22:03:29 +0000
To: Richard Smith <richard@ex-parrot.com>, "semantic-web@w3.org" <semantic-web@w3.org>
Message-ID: <98F37BCF7352AE42AA013633E8CFE9DD028040CB@MBX028-W1-CA-2.exch028.domain.local>
I agree with your points, Richard.
FWIW the Financial Industry Business Ontology (FIBO), which I'm involved in, uses https:// for all IRIs - see the published ontology files [1]: This is an extensive ontology developed by Enterprise Data Management Council (EDMC) and supported by many major financial institutions.

Regards
Pete

[1] https://spec.edmcouncil.org/fibo/ontology/master/latest/tree.html 


Pete  Rivett (pete.rivett@adaptive.com)
CTO, Adaptive Inc
65 Enterprise, Aliso Viejo, CA 92656
cell: +1 949 338 3794 
Follow me on Twitter @rivettp or http://twitter.com/rivettp






-----Original Message-----
From: Richard Smith [mailto:richard@ex-parrot.com] 
Sent: Friday, July 7, 2017 2:35 PM
To: semantic-web@w3.org
Subject: The use of https: IRIs on the semantic web


I hope this is an appropriate mailing list to ask this question.  I'd be happy to be directed elsewhere if not.

I am defining a new vocabulary.  It's not an extension of an existing vocabulary, nor will it use the same domain as any existing vocabulary.  Should I use https: IRIs?

Every source I've consulted says I should prefer http: IRIs. 
This includes the Linked Data book [1], the W3 note on "Cool URIs" [2], and the W3 note on best practices for RDF vocabularies.

This surprises me slightly.  The world seems to be moving away from HTTP to HTTPS, yet I know of no vocabulary that uses https: IRIs, and none of the documents quoted above even discuss the question.  I can find discussion on why
ftp: or urn: are less well suited, but nothing about https:.

Even though IRIs on the semantic web are primarily identifiers rather than locators, certainly in the linked data world, the IRI is assumed to be a good place to look for more information about entity, and various authorities recommend 303 redirects to documents like RDF or OWL schemas or other descriptive documents with further information.

If the IRI is just an identifier, the choice of IRI scheme is largely irrelevant, and https: is neither better nor worse than http:.  If it's used as a locator, then at least the initial request to an http: IRI will be made over plain HTTP.  It may then be redirectd to HTTPS, and HSTS headers may mean subsequent requests go directly over HTTPS, but the first request is still unencrypted.  This has the following potential problems:

* It is susceptible to a man-in-the-middle attack.  A
   malicious party could inject deliberately inconsistent
   schema information that may affect processing decisions
   made by applications, potentially causing a DoS or otherwise
   disrupting the user experience.

* ISPs may do a MITM themselves to inject adverts into
   content (e.g. [4]).  Really they oughtn't to do this for
   non-HTML content (well, they shouldn't do it at all, but
   that's another matter), but that relies on the ISP caring
   enough to get this right.

* The request might be tracked by ISPs.  The fact that a
   user is using an application that consults a particular
   schema is itself valuable information about the customer
   that can be sold to advertisers which the US Senate
   recently voted to make legal [5].  People are increasingly
   privacy conscious and want to minimise this.

These are all mitigated to a significant degree by using
https: IRIs.  So far as I can see, the counter-arguments are as follows:

* Not all HTTP client libraries support TLS, but few if any
   only support HTTP over TLS.

* Best practice with TLS changes more frequently than plain
   HTTP.  An HTTP client from 20 years ago will still
   probably work, but TLS has moved on a lot and few servers
   now support SSL 3.

* MITM attacks and injection can still happen over TLS as
   the "Superfish" fiasco demonstrated [6], are probably
   better prevented with digital signatures.

* Use of TLS at the transport level prevents HTTP caching on
   intermediate servers, unless a trusted root certicicate is
   used.

* ISP tracking still happens over TLS because the SNI field
   is not encrypted, and encrypted SNI seems to have been
   droped from TLS 1.3 [7].

* Whether TLS is used at the transport level should be an
   implementation detail that is not exposed in the
   vocabulary, a point Berners-Lee has made forcefully [8].

* In the future it's likely that there the functionality of
   HSTS will be put in DNS [9].

These seem fairly weak arguments to me.  Digital signatures can be used regardless of whether the resource was fetched over TLS, and adding authentication at the top of the semantic web stack shouldn't preclude encryption at the bottom.  HSTS-in-DNS technologies, which in conjunction with DNSSEC would alleviate the problem, seem to be stalled, and I've seen no drafts on the subject since 2011 [9].

I'm wondering whether there's something I'm missing, because almost universally people are still defining vocabularies using http: IRIs.

I can see why converting an existing vocabulary from http: 
to https: would be difficult, to the point of being undesirable; I can see too that there are logistic conveniences to having all vocabulary IRIs on a given domain use the same IRI scheme, both points Berners-Lee makes in [8].  But these don't apply to new vocabularies.

Is there some other consideration I'm missing?

Richard


[1] http://linkeddatabook.com/editions/1.0/#htoc10
[2] https://www.w3.org/TR/cooluris/
[3] https://www.w3.org/TR/swbp-vocab-pub/
[4] http://preview.tinyurl.com/om3xxdb
[5] http://preview.tinyurl.com/ybeka8yv
[6] https://brennan.io/2015/02/20/superfish-explained/
[7] https://www.ietf.org/mail-archive/web/tls/current/msg23251.html
[8] https://www.w3.org/DesignIssues/Security-NotTheS.html
[9] https://tools.ietf.org/html/draft-hallambaker-esrv-01
Received on Friday, 7 July 2017 22:04:16 UTC