RE: xsd:anyURI, rdf URIs, information resources from Booth, David (HP Software - Boston) on 2008-07-02 (public-awwsw@w3.org from July 2008)

From: Booth, David (HP Software - Boston) <dbooth@hp.com>
Date: Wed, 2 Jul 2008 18:05:49 +0000
To: Alan Ruttenberg <alanruttenberg@gmail.com>, "public-awwsw@w3.org" <public-awwsw@w3.org>
CC: Stasinos Konstantopoulos <konstant@iit.demokritos.gr>, Ivan Herman <ivan@w3.org>, Dan Connolly <connolly@w3.org>, Phil Archer <parcher@icra.org>, W3C SW Coordination Group <w3c-semweb-cg@w3.org>, Matt Womer <mdw@w3.org>, "Peter F. Patel-Schneider" <pfps@research.bell-labs.com>
Message-ID: <184112FE564ADF4F8F9C3FA01AE50009FCF7E25961@G1W0486.americas.hpqcorp.net>
Hi Alan,

This analysis looks rather tricky, and I'm not sure I've properly understood it all, but here are some comments.

> From: Alan Ruttenberg
>
> This note is triggered by a discussion on the SWCG group about POWDER
> and it's desire to discuss, in OWL the relation of a URI to the thing
> it denotes. Specifically they want to have a regular expression on
> the URI define a class of resources.
> I think this is a bit tricky, and it raises an interesting, and
> possibly problematic, interaction between http and rdf.
> It is also prompted by a comment in the OWL working group suggesting
> xsd:anyURI is a subclass of xsd:string.
>
> Suppose we have:
>
> <http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/
> obi.owl"^^xsd:anyURI

One thing you didn't mention: where would such an assertion come from?  In the n3 ontology and rules that I drafted to describe the semantics of HTTP, I've assumed that the *parser* of an RDF document would implicitly add such an assertion to the triples that were explicitly asserted by the document, because it relates the syntax of the URI to the semantics of the resource it denotes:
http://esw.w3.org/topic/AwwswDboothsRules
[[
 41. # We assume that the parser has automatically asserted:
 42.
 43. <http://example/people#dan> uri:hasURI
 44.         "http://example/people#dan"^^xsd:anyURI .
]]

>
> Now.
>
> As I understand it, URIs in RDF are compared character by character,
> as lexically written.
> http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-URI-reference
>
> > Two RDF URI references are equal if and only if they compare as
> > equal, character by character, as Unicode strings.
> >
>
> In XML Schema terms, the mapping of lexical space to value space of
> rdf:anyURI (if one was defined) would be identity.

This that really true?  Later in your message you seem to point out that RDF-MT assumes that there could be some kind of URI normalization applied before the semantics are examined:
http://www.w3.org/TR/rdf-mt/#urisandlit
[[
This document does not take any position on the way that URI references may be composed from other expressions, e.g. from relative URIs or QNames; the semantics simply assumes that such lexical issues have been resolved in some way that is globally coherent, so that a single URI reference can be taken to have the same meaning wherever it occurs.
]]

> For xsd:anyURI the
> lexical to value mapping must take in to account the (schema
> dependent) unescaping of percent encoded characters. (They would seem
> to me to make the pattern facet of xsd:anyURI rather difficult to
> implement in practice, as the pattern matching happens in
> value space).

Why would that make it difficult?  Wouldn't it just mean that URIs would be normalized (to value space) before a pattern is applied?  Or are you saying that this normalization would be difficult?  It's true that normalization isn't free, as the URI spec points out:
http://tools.ietf.org/html/rfc3986#section-6

>
>
> http://www.w3.org/TR/xmlschema-2/#rf-pattern
> > *pattern* provides for:
> >
> > Constraining a *value space* to values that are denoted by literals
> > which match a specific *regular expression*.
> >
>
>
> So compare:
>
> 1. http://neurocommons.org/page/Main%5FPage
> 2. http://neurocommons.org/page/Main_Page
> 3. http://neurocommons.org/page%2FMainPage
>
> All three are different URIs, but 1 and 2, but not 3, will get you to
> the same web page.
>
> Specifically, the value space of xsd:anyURI is the canonicalized URI,
> that is, the unescaped version, as far as I can tell, and this is
> dependent on the scheme, as escaping and unescaping is scheme
> dependent. By my read, 1 and 2 are equal in value space and the value
> is string-equal to #2.
>
> However, RDF URIs don't work like this. Effectively, equality (a
> value comparison) is checked in what corresponds to the *lexical*
> space of anyURI. So these are three different URIs.

But it sounds like RDF-MT admits the possibility of normalization before the semantics is applied, i.e., before the character-by-character comparison of URIs:
http://www.w3.org/TR/rdf-mt/#urisandlit
[[
This document does not take any position on the way that URI references may be composed from other expressions, e.g. from relative URIs or QNames; the semantics simply assumes that such lexical issues have been resolved in some way that is globally coherent, so that a single URI reference can be taken to have the same meaning wherever it occurs.
]]
However I'm not certain that I'm reading that correctly.

>
> These all say the same thing:
>
> <http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/
> obi.owl"^^xsd:anyURI
> <http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/obi%
> 2Eowl"^^xsd:anyURI
> <http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/obi.%
> 6Fwl"^^xsd:anyURI
> <http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/
> obi.ow%77"^^xsd:anyURI
> ...
>
> IF has_name's range is xsd:anyURI. But depending on perhaps the type
> of <http://purl.org/obo/obi.owl>  either one or all of these might be
> true.
>
> <http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/
> obi.owl"
> <http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/obi%
> 2Eowl"
> <http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/obi.%
> 6Fwl"
> <http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/
> obi.ow%77"

I can see that if you write
"http://purl.org/obo/obi%2Eowl" as an untyped string literal then its mapping to value space would be direct, i.e., not normalized, whereas if you write
"http://purl.org/obo/obi%2Eowl"^^xsd:anyURI the URI normalization would be applied in going to the value space for type xsd:anyURI.

But I don't see why the type of <http://purl.org/obo/obi.owl> has much bearing on whether any of the above are true.  Can you explain?

>
> --------------------------
>
> Suppose I have an resource <http://neurocommons.org/page/Main_Page>
> and I make a request for
> GET http://neurocommons.org/page/Main%5FPage (implicitly xsd:anyURI)
>
> Should or should not the response be the same as if I did
> GET http://neurocommons.org/page/Main_Page (implicitly xsd:anyURI)
>
> In fact, I will get the same responses (always, and by definition of
> the http protocol)
>
> If <http://purl.org/obo/obi.owl> is an IR, and it is strictly defined
> by the function that maps to representations, then we would conclude
> that <http://neurocommons.org/page/Main_Page> owl:sameAs <http://
> neurocommons.org/page/Main%5FPage>

Not quite.  You can conclude that the IR *aspects* of those two resources are identical.  But a resource can have characteristics of an IR (i.e., it can satisy all of the assertions required for it to qualify as being an IR) *and* it can have characteristics of other things (i.e., other assertions can also be true of it).

>
> However, what should happen if <http://purl.org/obo/obi.owl> is not
> an IR?
>
> According to RDF, <http://neurocommons.org/page/Main%5FPage> a priori
> could have *absolutely nothing* to do with <http://neurocommons.org/
> page/Main_Page>.  The above owl:sameAs is concluded not based on
> anything in RDF, but by analysis of HTTP. However,  we have no
> separate way to ask for these two resources using HTTP.

The fact that the URI declarations of those two URIs turns out to be the exact same page (via a 303 redirect) doesn't matter.  If the page makes assertions involving one of those URIs and not the other, then the other is unconstrained: you don't know what it denotes in RDF, though you might guess.

> One might
> argue that since the 303 response is just "see other" or "you might
> be interested in this too", there is no harm done. (Using "#" doesn't
> fix this, btw). But if people put RDF there, and we believe the RDF,
> then there could be mistakes easily made.

I don't follow what you mean.  What potential harm?  What mistakes?

>
> So I think we should be worried about the RDF/Web connection if my
> analysis is right. a) This might be turned into an argument why HTTP
> isn't appropriate for SemWeb use.

I don't follow that.  Can you explain?

> b) It points to an possible
> *actual* difference between IRs and non IRs that ought to be
> measurable  in some sense (first that I know of, other than the
> tautological 200 response).

The difference I see is this.  If you get a 200 response when you dereference
http://neurocommons.org/page/Main%5FPage
then you learn *both* that
http://neurocommons.org/page/Main%5FPage
denotes an IR *and* you learn that
http://neurocommons.org/page/Main_Page
denotes an IR, and the IR aspects of the resource are the same.
Whereas if you get a 303 response that redirects to a URI declaration page, then you might only learn what *one* of those URIs denotes.

> c) It make life difficult for those poor
> POWDER folks trying to figure out how to use OWL to do their bidding.

Hmm.

> d) Means we have to look a little more carefully at dbooth's hasURI
> relation.

At present the range of uri:hasURI is defined as xsd:anyURI:
http://esw.w3.org/topic/AwwswDboothsRules
[[
 95. uri:hasURI a rdf:Property ;
 96.    rdf:label "hasURI" ;
 97.    rdf:comment ". . . " ;
 98.    rdfs:subPropertyOf log:uri ;
 99.    # rdfs:domain rdfs:Resource ;
100.    rdfs:range xsd:anyURI .
]]
which, if your analysis of the xsd:anyURI type is correct, uses URI normalization in going to value space.  So if
<http://neurocommons.org/page/Main%5FPage> and
<http://neurocommons.org/page/Main_Page> really are intended to be treated as different URIs in the RDF semantics, then I guess I should change the range of uri:hasURI to be an untyped literal string.

>
> I have assumed in the above, that in the absence of a crystal clear
> stance on the issue URI references in in RDF-MT
> > This document does not take any position on the way that URI
> > references may be composed from other expressions, e.g. from
> > relative URIs or QNames; the semantics simply assumes that such
> > lexical issues have been resolved in some way that is globally
> > coherent, so that a single URI reference can be taken to have the
> > same meaning wherever it occurs.
> >
> >
>
> the RDF/XML equality conditions on RDF URI references are normative.
>
> If you wanted to repair this in  quick hacky way, one could amend
> both the RDF or RDF/XML specifications so that they take in to
> account the http escaping rules for names.

Are you saying that the RDF/XML spec does not specify URI normalization, but RDF-MT admits the possibility of URI normalization, and hence there is an ambiguity in determining which URI(s) denote a particular resource?  So for example, if the following n3 assertion is parsed:

  <http://neurocommons.org/page/Main%5FPage> _:a _:b .

we will not know whether the parser will assert

  <http://neurocommons.org/page/Main%5FPage> uri:hasURI
       "http://neurocommons.org/page/Main%5FPage"^^xsd:anyURI .

or

  <http://neurocommons.org/page/Main%5FPage> uri:hasURI
       "http://neurocommons.org/page/Main_Page"^^xsd:anyURI .

or both, and we will not know whether

  <http://neurocommons.org/page/Main_Page> _:a _:b .

has been asserted.  Is that what you mean?

>
> http://www.w3.org/TR/xmlschema-2/#anyURI
> > 3.2.17 anyURI
> >
> > [Definition:]   anyURI represents a Uniform Resource Identifier
> > Reference (URI). An anyURI value can be absolute or relative, and
> > may have an optional fragment identifier (i.e., it may be a URI
> > Reference). This type should be used to specify the intention that
> > the value fulfills the role of a URI as defined by [RFC 2396], as
> > amended by [RFC 2732].
> >
> > The mapping from anyURI values to URIs is as defined by the URI
> > reference escaping procedure defined in Section 5.4 Locator
> > Attribute of [XML Linking Language] (see also Section 8 Character
> > Encoding in URI References of [Character Model]). This means that a
> > wide range of internationalized resource identifiers can be
> > specified when an anyURI is called for, and still be understood as
> > URIs per [RFC 2396], as amended by [RFC 2732], where appropriate to
> > identify resources.
> >
> > Note:  Section 5.4 Locator Attribute of [XML Linking Language]
> > requires that relative URI references be absolutized as defined in
> > [XML Base] before use. This is an XLink-specific requirement and is
> > not appropriate for XML Schema, since neither the *lexical space*
> > nor the *value space* of the anyURI type are restricted to absolute
> > URIs. Accordingly absolutization must not be performed by schema
> > processors as part of schema validation.
> > Note:  Each URI scheme imposes specialized syntax rules for URIs in
> > that scheme, including restrictions on the syntax of allowed
> > fragment identifiers. Because it is impractical for processors to
> > check that a value is a context-appropriate URI reference, this
> > specification follows the lead of [RFC 2396] (as amended by [RFC
> > 2732]) in this matter: such rules and restrictions are not part of
> > type validity and are not checked by *minimally conforming*
> > processors. Thus in practice the above definition imposes only very
> > modest obligations on *minimally conforming* processors.
>
>
> http://www.cs.tut.fi/~jkorpela/rfc/2396/full.html#2.4.2
>
> > 2.4.2. When to Escape and Unescape
> >
> >    A URI is always in an "escaped" form, since escaping or
> > unescaping a
> >    completed URI might change its semantics.  Normally, the
> only time
> >    escape encodings can safely be made is when the URI is being
> > created
> >    from its component parts; each component may have its own set of
> >    characters that are reserved, so only the mechanism
> responsible for
> >    generating or interpreting that component can determine
> whether or
> >    not escaping a character will change its semantics.
> Likewise, a URI
> >    must be separated into its components before the escaped
> characters
> >    within those components can be safely decoded.
> >
> >    In some cases, data that could be represented by an unreserved
> >    character may appear escaped; for example, some of the unreserved
> >    "mark" characters are automatically escaped by some systems.  If
> > the
> >    given URI scheme defines a canonicalization algorithm, then
> >    unreserved characters may be unescaped according to that
> algorithm.
> >    For example, "%7e" is sometimes used instead of "~" in
> an http URL
> >    path, but the two are equivalent for an http URL.
> >




David Booth, Ph.D.
HP Software
+1 617 629 8881 office  |  dbooth@hp.com
http://www.hp.com/go/software

Statements made herein represent the views of the author and do not necessarily represent the official views of HP unless explicitly so stated.
Received on Wednesday, 2 July 2008 18:10:25 UTC