Re: Problems validating XML

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Sat, 23 Jun 2007 13:31:51 +0900
Message-Id: <>
To: olivier Thereaux <ot@w3.org>
Cc: www-validator@w3.org

Hello Olivier,

Sorry for the delay of my answer.

At 02:24 07/06/20, olivier Thereaux wrote:
>Hello Martin,
>On May 30, 2007, at 05:22 , Martin Duerst wrote:
>> I can definitely submit a patch that goes into XML mode if an
>> XML declaration is present.
>Are you (still) working on this?

No, not yet, sorry.

>If not, I am thinking of going ahead with an implementation


>based on the following regexp, using a test for it in the first
>line of the content:

Strictly speaking, an XML declaration can go over more than one line.
It has to start with an '<' as the very first character of the file,
but then it can include linebreaks. I see that you have incuded
\x9|\xD|\xA below, so maybe you were aware of this, and above,
it should have been "a test for it starting in the firsts line".

I took the liberty of adding an 'x' at the end of this regexp,
and laying it out to better understand it.

I found several issues:
- There was a missing '[' at the start of the WS after encoding.
- There can be space around the equal sign.
- There is no requirement for the XML declaration to be on its own
  line, i.e. the following would be legal:
<?xml version='1.0'?><!-- this is an XML declaration -->
- The version pseudo-attribute is needed (assuming we are not
  talking about a text declaration
  (http://www.w3.org/TR/REC-xml/#sec-TextDecl) because we are not
  validating enities, only documents.
- There are two consecutive space groups (the one at the end of
  standalone and the one before ?>. They don't change what's
  matched, but they may hurt performance when backtracking.

As a result, I get the following:

/^<\?xml [\x20|\x9|\xD|\xA]+ version
  [\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
  ([\x20|\x9|\xD|\xA]+ encoding
   [\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
   ("[A-Za-z][a-zA-Z0-9-_]+"|'[A-Za-z][a-zA- Z0-9_]+')
  ([\x20|\x9|\xD|\xA]+)+ standalone
   [\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
  [\x20|\x9|\xD|\xA]* \?>

I think this can be shortened and made more readable by using variable
substitution, but you have to try out how variable substitionion,
escaping, and the /x interact to make sure you get this right.

>It should match XML declarations as defined by XML 1.0 and 1.1 specs,  
>and not be too greedy.

I definitely don't see a problem with greediness.
Even your version (with a single ']' added) should work; I rarely if
ever have seen anything that makes use of these full rules, and I sometimes
wished the sytax for XML declarations was much more restrictive
(a single space between pseudo-attributes).

You may also want to compare with the regular expression in the RDF
validator, which I added at
(look for the line containing >>>>RE r = new RE("<?xml<<<<).

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     
