Re: which layer for URI processing? from John Cowan on 2000-05-24 (xml-uri@w3.org from May 2000)

From: John Cowan <jcowan@reutershealth.com>
Date: Wed, 24 May 2000 16:02:56 -0400
To: "Simon St.Laurent" <simonstl@simonstl.com>, "xml-uri@w3.org" <xml-uri@w3.org>
Message-ID: <392C3570.38D7656C@reutershealth.com>

"Simon St.Laurent" wrote:

> I'd appreciate it if you could explain why you it is so critical that lower
> layers of processing handle the considerable amount of effort involved in
> treating URIs _as URIs_ rather than as strings for purposes of comparison,

What "considerable amount of effort"?  Here's some Perl code to do the whole
RFC 2396 resolution.  Given the base URI as an argument, it reads URI
references from the standard input and sends resolved forms to the standard output.

#!/usr/bin/perl

$base = shift @ARGV;
($bscheme, $bauth, $bpath, $bquery, $bfrag) = $base =~
        m%^([a-z0-9+.-]+:)?(//[^/?#]+)?([^?#]+)?(\?[^?]+)?(#.*)?$%;
$bpath2 = $bpath;
$bpath2 =~ s%[^/]+$%%;			# base path without final component

while (<>) {
        chomp;
        if ($_ eq "" || /^#/) {
                print "[current document (not necessarily $base)]$_\n";
                next;
                }
        ($scheme, $auth, $path, $query, $frag) =
                m%^([a-z0-9+.-]+:)?(//[^/?#]+)?([^?#]+)?(\?[^?]+)?(#.*)?$%;
        if ($scheme) {
                print $_, "\n";			# absolute URI
                next;
                }
        $auth = $bauth unless $auth;		# network-path reference
        $scheme = $bscheme;
        if (substr($path, 0, 1) ne "/") {	#relative-path reference
                $path = $bpath2 . $path;
                $path =~ s%\./%%g;		# remove . segment
                $path =~ s%/\.%%g;
                $path =~ s%[^/]+/\.\./%%g;	# remove .. segment
                $path =~ s%[^/]+/\.\.$%%g;
                }
        print $scheme, $auth, $path, $query, $frag, "\n";
        }

This would be easy to translate into C or any other assembly language.  :-)

> and why higher layers (like RDF and other models) can't be trusted with
> that responsibility.

Here's a concrete example.

Let's suppose that we have an XML 1.0 + Namespaces
parser that interns all namespace names; in other words, the strings
returned as namespace names are guaranteed to be the same object iff they have
the same text.  This satisfies the Namespace Rec as written.

Now suppose that an RDF decoder is layered over this parser.  It uses
namespace names to locate RDF schemas for the RDF vocabularies in its
input.  (This need not mean that it just accesses the namespace name
as an URL to fetch the schema; there may be some kind of indirection here
without affecting my point.)  It would like to store the schemas in a
hashtable keyed on the namespace names, to minimize schema-fetching.

This will not work under the status quo, because the namespace name
"foo" used in two different documents will correspond to two different
RDF schemas, but the XML parser will intern "foo" as a single string.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

Received on Wednesday, 24 May 2000 16:03:32 UTC