xsd:anyURI, rdf URIs, information resources from Alan Ruttenberg on 2008-06-29 (public-awwsw@w3.org from June 2008)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Sun, 29 Jun 2008 05:08:06 -0400
To: public-awwsw@w3.org
Cc: Stasinos Konstantopoulos <konstant@iit.demokritos.gr>, Ivan Herman <ivan@w3.org>, Dan Connolly <connolly@w3.org>, Phil Archer <parcher@icra.org>, W3C SW Coordination Group <w3c-semweb-cg@w3.org>, Matt Womer <mdw@w3.org>, "Peter F. Patel-Schneider" <pfps@research.bell-labs.com>
Message-Id: <7B9BE3ED-3443-4346-998E-DE10E30F6FA7@gmail.com>
This note is triggered by a discussion on the SWCG group about POWDER  
and it's desire to discuss, in OWL the relation of a URI to the thing  
it denotes. Specifically they want to have a regular expression on  
the URI define a class of resources.
I think this is a bit tricky, and it raises an interesting, and  
possibly problematic, interaction between http and rdf.
It is also prompted by a comment in the OWL working group suggesting  
xsd:anyURI is a subclass of xsd:string.

Suppose we have:

<http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/ 
obi.owl"^^xsd:anyURI

Now.

As I understand it, URIs in RDF are compared character by character,  
as lexically written.
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-URI-reference

> Two RDF URI references are equal if and only if they compare as  
> equal, character by character, as Unicode strings.
>

In XML Schema terms, the mapping of lexical space to value space of  
rdf:anyURI (if one was defined) would be identity. For xsd:anyURI the  
lexical to value mapping must take in to account the (schema  
dependent) unescaping of percent encoded characters. (They would seem  
to me to make the pattern facet of xsd:anyURI rather difficult to  
implement in practice, as the pattern matching happens in value space).


http://www.w3.org/TR/xmlschema-2/#rf-pattern
> ·pattern· provides for:
>
> Constraining a ·value space· to values that are denoted by literals  
> which match a specific ·regular expression·.
>


So compare:

1. http://neurocommons.org/page/Main%5FPage
2. http://neurocommons.org/page/Main_Page
3. http://neurocommons.org/page%2FMainPage

All three are different URIs, but 1 and 2, but not 3, will get you to  
the same web page.

Specifically, the value space of xsd:anyURI is the canonicalized URI,  
that is, the unescaped version, as far as I can tell, and this is  
dependent on the scheme, as escaping and unescaping is scheme  
dependent. By my read, 1 and 2 are equal in value space and the value  
is string-equal to #2.

However, RDF URIs don't work like this. Effectively, equality (a  
value comparison) is checked in what corresponds to the *lexical*  
space of anyURI. So these are three different URIs.

These all say the same thing:

<http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/ 
obi.owl"^^xsd:anyURI
<http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/obi% 
2Eowl"^^xsd:anyURI
<http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/obi.% 
6Fwl"^^xsd:anyURI
<http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/ 
obi.ow%77"^^xsd:anyURI
...

IF has_name's range is xsd:anyURI. But depending on perhaps the type  
of <http://purl.org/obo/obi.owl>  either one or all of these might be  
true.

<http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/ 
obi.owl"
<http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/obi% 
2Eowl"
<http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/obi.% 
6Fwl"
<http://purl.org/obo/obi.owl>  rdf:has_name "http://purl.org/obo/ 
obi.ow%77"

--------------------------

Suppose I have an resource <http://neurocommons.org/page/Main_Page>  
and I make a request for
GET http://neurocommons.org/page/Main%5FPage (implicitly xsd:anyURI)

Should or should not the response be the same as if I did
GET http://neurocommons.org/page/Main_Page (implicitly xsd:anyURI)

In fact, I will get the same responses (always, and by definition of  
the http protocol)

If <http://purl.org/obo/obi.owl> is an IR, and it is strictly defined  
by the function that maps to representations, then we would conclude  
that <http://neurocommons.org/page/Main_Page> owl:sameAs <http:// 
neurocommons.org/page/Main%5FPage>

However, what should happen if <http://purl.org/obo/obi.owl> is not  
an IR?

According to RDF, <http://neurocommons.org/page/Main%5FPage> a priori  
could have *absolutely nothing* to do with <http://neurocommons.org/ 
page/Main_Page>.  The above owl:sameAs is concluded not based on  
anything in RDF, but by analysis of HTTP. However,  we have no  
separate way to ask for these two resources using HTTP. One might  
argue that since the 303 response is just "see other" or "you might  
be interested in this too", there is no harm done. (Using "#" doesn't  
fix this, btw). But if people put RDF there, and we believe the RDF,  
then there could be mistakes easily made.

So I think we should be worried about the RDF/Web connection if my  
analysis is right. a) This might be turned into an argument why HTTP  
isn't appropriate for SemWeb use. b) It points to an possible  
*actual* difference between IRs and non IRs that ought to be  
measurable  in some sense (first that I know of, other than the  
tautological 200 response). c) It make life difficult for those poor  
POWDER folks trying to figure out how to use OWL to do their bidding.  
d) Means we have to look a little more carefully at dbooth's hasURI  
relation.

I have assumed in the above, that in the absence of a crystal clear  
stance on the issue URI references in in RDF-MT
> This document does not take any position on the way that URI  
> references may be composed from other expressions, e.g. from  
> relative URIs or QNames; the semantics simply assumes that such  
> lexical issues have been resolved in some way that is globally  
> coherent, so that a single URI reference can be taken to have the  
> same meaning wherever it occurs.
>
>

the RDF/XML equality conditions on RDF URI references are normative.

If you wanted to repair this in  quick hacky way, one could amend  
both the RDF or RDF/XML specifications so that they take in to  
account the http escaping rules for names.

Best,
Alan


http://www.w3.org/TR/xmlschema-2/#anyURI
> 3.2.17 anyURI
>
> [Definition:]   anyURI represents a Uniform Resource Identifier  
> Reference (URI). An anyURI value can be absolute or relative, and  
> may have an optional fragment identifier (i.e., it may be a URI  
> Reference). This type should be used to specify the intention that  
> the value fulfills the role of a URI as defined by [RFC 2396], as  
> amended by [RFC 2732].
>
> The mapping from anyURI values to URIs is as defined by the URI  
> reference escaping procedure defined in Section 5.4 Locator  
> Attribute of [XML Linking Language] (see also Section 8 Character  
> Encoding in URI References of [Character Model]). This means that a  
> wide range of internationalized resource identifiers can be  
> specified when an anyURI is called for, and still be understood as  
> URIs per [RFC 2396], as amended by [RFC 2732], where appropriate to  
> identify resources.
>
> Note:  Section 5.4 Locator Attribute of [XML Linking Language]  
> requires that relative URI references be absolutized as defined in  
> [XML Base] before use. This is an XLink-specific requirement and is  
> not appropriate for XML Schema, since neither the ·lexical space·  
> nor the ·value space· of the anyURI type are restricted to absolute  
> URIs. Accordingly absolutization must not be performed by schema  
> processors as part of schema validation.
> Note:  Each URI scheme imposes specialized syntax rules for URIs in  
> that scheme, including restrictions on the syntax of allowed  
> fragment identifiers. Because it is impractical for processors to  
> check that a value is a context-appropriate URI reference, this  
> specification follows the lead of [RFC 2396] (as amended by [RFC  
> 2732]) in this matter: such rules and restrictions are not part of  
> type validity and are not checked by ·minimally conforming·  
> processors. Thus in practice the above definition imposes only very  
> modest obligations on ·minimally conforming· processors.


http://www.cs.tut.fi/~jkorpela/rfc/2396/full.html#2.4.2

> 2.4.2. When to Escape and Unescape
>
>    A URI is always in an "escaped" form, since escaping or  
> unescaping a
>    completed URI might change its semantics.  Normally, the only time
>    escape encodings can safely be made is when the URI is being  
> created
>    from its component parts; each component may have its own set of
>    characters that are reserved, so only the mechanism responsible for
>    generating or interpreting that component can determine whether or
>    not escaping a character will change its semantics. Likewise, a URI
>    must be separated into its components before the escaped characters
>    within those components can be safely decoded.
>
>    In some cases, data that could be represented by an unreserved
>    character may appear escaped; for example, some of the unreserved
>    "mark" characters are automatically escaped by some systems.  If  
> the
>    given URI scheme defines a canonicalization algorithm, then
>    unreserved characters may be unescaped according to that algorithm.
>    For example, "%7e" is sometimes used instead of "~" in an http URL
>    path, but the two are equivalent for an http URL.
>
Received on Sunday, 29 June 2008 09:08:50 UTC