Beta: Character escapes in system identifiers from Bjoern Hoehrmann on 2002-11-03 (www-validator@w3.org from November 2002)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sun, 03 Nov 2002 03:55:30 +0100
To: www-validator@w3.org
Message-ID: <3dc68d41.547900048@smtp.bjoern.hoehrmann.de>

Hi,

two documents,

  <?xml version='1.0' encoding='iso-8859-1'?>
  <!DOCTYPE foo SYSTEM
    "http://www.bjoernsworld.de/cgi-bin/dtd.pl?björn">
  <foo>
  <bar/>
  </foo>

and

  <?xml version='1.0' encoding='iso-8859-1'?>
  <!DOCTYPE foo SYSTEM
    "http://www.bjoernsworld.de/cgi-bin/dtd.pl?bj%c3%b6rn">
  <foo>
  <bar/>
  </foo>

As per XML 1.0 Second Edition section 4.2.2 XML processors must process
these documents as beeing equivalent, the Validator however does not, it
claims the second document beeing valid while the first document is said
to be invalid. It's getting somehow confused by the system identifier in
the first example.

Typically, XML processors get it "right" but request ...?bj\xF6rn or
...?bj\xC3\xB6rn or ...?bj%f6rn instead of ...?bj%c3%b6rn,
...?bj%C3%b6rn or ...?bj%C3%B6rn (which are all equivalent).

dtd.pl is a CGI script that outputs different DTDs depending on whether
the processor is behaving correctly:

  #!/usr/local/bin/perl -w
  print "Content-Type: application/xml-dtd;charset=us-ascii\n\n";
  print "<!ELEMENT foo (bar)>\n"
  if ($ENV{'QUERY_STRING'} eq "bj%c3%b6rn" or
      $ENV{'QUERY_STRING'} eq "bj%C3%b6rn" or
      $ENV{'QUERY_STRING'} eq "bj%C3%B6rn")
  {
    print "<!ELEMENT bar EMPTY>\n"
  }

I.e. the document is valid for conforming processors, invalid for
non-conforming processors.

regards.

Received on Saturday, 2 November 2002 21:55:25 UTC