- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 02 Aug 2011 04:45:34 +0200
- To: www-archive@w3.org
Hi, Finding bugs in syntax-related code is of course quite easy as humans tend to be extraordinarily bad at writing parsers, so there is actually little need to automate finding them, especially if you have a bit of experience in which kinds of errors people like best, for instance, we have Goldfarb's First Law of Text Processing stating that if a text pro- cessing system has bugs, one of them relates to white space processing. Then again, automating this kind of thing is also trivial when systems expose important implementation details properly. If, for instance, the Content-Disposition header's file name was properly exposed through some scripting interface, you could trivially make a test site that finds and exposes differences and bugs among implementations (beyond testing the particular scripting interface feature if you can assume implementations re-use code internally). Of course, such interfaces only very rarely exist, so for something, you know, totally odd, like processing URIs in Web applications, there is no API to extract components from them or to make relative ones absolute or any number of things you'd never ever think of doing in a browser app. There are of course other "platforms" where lobotomy isn't formally man- dated to use them, so I took the RFC 3986 implementation I started some rainy day, finished it, and wrapped it in a little harness that compares how RFC 3986, WinINet, and Perl's URI.pm turn relative references into absolute ones, and points out when they do not agree (as determined by URI.pm's equality testing interface). Some things I've noticed when I briefly looked at the results about the behavior of URI.pm: * It incorrectly adds the fragment of the base to the result * If the relative reference is in <...> or "..." that's stripped * It does not remove dot segments right, like when the rel '/.' * Strange things happen with "relative" inputs like `$:` * The escaping rules URI.pm uses seem a bit odd I also noticed that RFC 3986's transformation algorithm is buggy, there is a `merge` function that's invoked with two paths, but then the merge function actually needs to know whether there was an authority with one of the references, which is not passed to the function. package RFC3986; use strict; use warnings; sub parse { $_[0] =~ m|^(([^:/?\#]+):)?(//([^/?\#]*))? ([^?\#]*)(\?([^\#]*))?(\#(.*))?$|x; # $_[0] =~ m|^(([a-zA-Z0-9+.-]+):)?(//([^/?\#]*))? # ([^?\#]*)(\?([^\#]*))?(\#(.*))?$|x; return scheme => $2, authority => $4, path => $5, query => $7, fragment => $9; } sub merge { my ($base, $ref, $base_has_authority) = @_; return "/$ref" if $base eq "" and $base_has_authority; return "$1$ref" if $base =~ m|^(.*)(/.*?)?$|s; } sub remove_dot_segments { my $in = shift; my $ou = ""; while (length $in) { next if $in =~ s!^\.\.?/!!; next if $in =~ s!(^/\.(/|$))!/!; next if $in =~ s!^/\.\.(/|$)!/! and $ou =~ s!/?[^/]*$!!; next if $in =~ s!^\.\.?$!!; $in =~ s!^(/?[^/]*)!!; $ou .= $1; } return $ou; } sub transform { my %R = parse(shift); my %Base = parse(shift); my %T; if (defined $R{scheme}) { $T{scheme} = $R{scheme}; $T{authority} = $R{authority}; $T{path} = $R{path}; $T{query} = $R{query}; } else { if (defined $R{authority}) { $T{authority} = $R{authority}; $T{path} = remove_dot_segments($R{path}); $T{query} = $R{query}; } else { if ($R{path} eq "") { $T{path} = $Base{path}; if (defined $R{query}) { $T{query} = $R{query}; } else { $T{query} = $Base{query}; } } else { if ($R{path} =~ m|^/|) { $T{path} = remove_dot_segments($R{path}); } else { $T{path} = merge($Base{path}, $R{path}, defined $Base{authority}); # Bug in RFC 3986 $T{path} = remove_dot_segments($T{path}); } $T{query} = $R{query}; } $T{authority} = $Base{authority}; } $T{scheme} = $Base{scheme}; } $T{fragment} = $R{fragment}; return %T; } sub compose { my %U = @_; my $result = ""; $result .= $U{scheme} . ":" if defined $U{scheme}; $result .= "//" . $U{authority} if defined $U{authority}; $result .= $U{path}; $result .= "?" . $U{query} if defined $U{query}; $result .= "#" . $U{fragment} if defined $U{fragment}; return $result; } package main; use Parse::RandGen; use URI; use Win32::Internet; use strict; my $g = Parse::RandGen::Regexp ->new(qr/^[^\x00-\x202-7B-Zb-z\x7f-\xFF)]+$/); my $i = Win32::Internet->new; my $bas = "http://x/1/2/3//?xxx"; for (0..1000000) { eval { my $rel = $g->pick; my $uri = URI->new_abs($rel, $bas); my $win = $i->CombineURL($bas, $rel); my $rfc = RFC3986::compose(RFC3986::transform($rel, $bas)); return if $uri->eq($rfc) and $rfc->eq($win); print join "\t", $rel, $rfc, $uri, $win, "\n"; } } regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 2 August 2011 02:46:14 UTC