Randomized testing of URI processing behavior from Bjoern Hoehrmann on 2011-08-02 (www-archive@w3.org from August 2011)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 02 Aug 2011 04:45:34 +0200
To: www-archive@w3.org
Message-ID: <8cne379bunebld7h382core9416hheclko@hive.bjoern.hoehrmann.de>
Hi,

  Finding bugs in syntax-related code is of course quite easy as humans
tend to be extraordinarily bad at writing parsers, so there is actually
little need to automate finding them, especially if you have a bit of
experience in which kinds of errors people like best, for instance, we
have Goldfarb's First Law of Text Processing stating that if a text pro-
cessing system has bugs, one of them relates to white space processing.

Then again, automating this kind of thing is also trivial when systems
expose important implementation details properly. If, for instance, the
Content-Disposition header's file name was properly exposed through some
scripting interface, you could trivially make a test site that finds and
exposes differences and bugs among implementations (beyond testing the
particular scripting interface feature if you can assume implementations
re-use code internally).

Of course, such interfaces only very rarely exist, so for something, you
know, totally odd, like processing URIs in Web applications, there is no
API to extract components from them or to make relative ones absolute or
any number of things you'd never ever think of doing in a browser app.

There are of course other "platforms" where lobotomy isn't formally man-
dated to use them, so I took the RFC 3986 implementation I started some
rainy day, finished it, and wrapped it in a little harness that compares
how RFC 3986, WinINet, and Perl's URI.pm turn relative references into
absolute ones, and points out when they do not agree (as determined by
URI.pm's equality testing interface).

Some things I've noticed when I briefly looked at the results about the
behavior of URI.pm:

  * It incorrectly adds the fragment of the base to the result
  * If the relative reference is in <...> or "..." that's stripped
  * It does not remove dot segments right, like when the rel '/.'
  * Strange things happen with "relative" inputs like `$:`
  * The escaping rules URI.pm uses seem a bit odd

I also noticed that RFC 3986's transformation algorithm is buggy, there
is a `merge` function that's invoked with two paths, but then the merge
function actually needs to know whether there was an authority with one
of the references, which is not passed to the function.

  package RFC3986;
  use strict;
  use warnings;
  
  sub parse {
    $_[0] =~ m|^(([^:/?\#]+):)?(//([^/?\#]*))?
                ([^?\#]*)(\?([^\#]*))?(\#(.*))?$|x;
  
  #  $_[0] =~ m|^(([a-zA-Z0-9+.-]+):)?(//([^/?\#]*))?
  #              ([^?\#]*)(\?([^\#]*))?(\#(.*))?$|x;
  
    return scheme => $2,
      authority   => $4,
      path        => $5,
      query       => $7,
      fragment    => $9;
  }
  
  sub merge {
    my ($base, $ref, $base_has_authority) = @_;
    return "/$ref" if $base eq "" and $base_has_authority;
    return "$1$ref" if $base =~ m|^(.*)(/.*?)?$|s;
  }
  
  sub remove_dot_segments {
    my $in = shift;
    my $ou = "";
    while (length $in) {
      next if $in =~ s!^\.\.?/!!;
      next if $in =~ s!(^/\.(/|$))!/!;
      next if $in =~ s!^/\.\.(/|$)!/! and $ou =~ s!/?[^/]*$!!;
      next if $in =~ s!^\.\.?$!!;
      $in =~ s!^(/?[^/]*)!!;
      $ou .= $1;
    }
    return $ou;
  }
  
  sub transform {
    my %R    = parse(shift);
    my %Base = parse(shift);
    my %T;
  
    if (defined $R{scheme}) {
      $T{scheme}    = $R{scheme};
      $T{authority} = $R{authority};
      $T{path}      = $R{path};
      $T{query}     = $R{query};
    } else {
  
      if (defined $R{authority}) {
        $T{authority} = $R{authority};
        $T{path}      = remove_dot_segments($R{path});
        $T{query}     = $R{query};
      } else {
  
        if ($R{path} eq "") {
          $T{path} = $Base{path};
          if (defined $R{query}) {
            $T{query} = $R{query};
          } else {
            $T{query} = $Base{query};
          }
        } else {
  
          if ($R{path} =~ m|^/|) {
            $T{path} = remove_dot_segments($R{path});
          } else {
  
            $T{path} = merge($Base{path}, $R{path},
              defined $Base{authority}); # Bug in RFC 3986
            $T{path} = remove_dot_segments($T{path});
  
          }
  
          $T{query} = $R{query};
        }
        $T{authority} = $Base{authority};
      }
      $T{scheme} = $Base{scheme};
    }
  
    $T{fragment} = $R{fragment};
  
    return %T;
  }
  
  sub compose {
    my %U = @_;
    my $result = "";
    $result .= $U{scheme} . ":" if defined $U{scheme};
    $result .= "//" . $U{authority} if defined $U{authority};
    $result .= $U{path};
    $result .= "?" . $U{query} if defined $U{query};
    $result .= "#" . $U{fragment} if defined $U{fragment};
  
    return $result;
  }
  
  package main;
  use Parse::RandGen;
  use URI;
  use Win32::Internet;
  use strict;
  
  my $g = Parse::RandGen::Regexp
    ->new(qr/^[^\x00-\x202-7B-Zb-z\x7f-\xFF)]+$/);
  my $i = Win32::Internet->new;
  
  my $bas = "http://x/1/2/3//?xxx";
  
  for (0..1000000) {
    eval {
      my $rel = $g->pick;
      my $uri = URI->new_abs($rel, $bas);
      my $win = $i->CombineURL($bas, $rel);
      my $rfc = RFC3986::compose(RFC3986::transform($rel, $bas));
      return if $uri->eq($rfc) and $rfc->eq($win);
      print join "\t", $rel, $rfc, $uri, $win, "\n";
    }
  }
  
regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 2 August 2011 02:46:14 UTC