Re: Problems validating XML from olivier Thereaux on 2007-06-25 (www-validator@w3.org from June 2007)

From: olivier Thereaux <ot@w3.org>
Date: Mon, 25 Jun 2007 14:18:29 +0900
To: Martin Duerst <duerst@it.aoyama.ac.jp>
Cc: www-validator@w3.org
Message-Id: <EE608546-E84F-4AA2-981C-387207887895@w3.org>
Hi Martin,

Thanks for looking into the regexp, and especially for spotting one  
of my mistakes. Much appreciated.

On Jun 23, 2007, at 13:31 , Martin Duerst wrote:
> Strictly speaking, an XML declaration can go over more than one line.
> It has to start with an '<' as the very first character of the file,
> but then it can include linebreaks.

Indeed. Also as Ville noted, there could be a BOM at the beginning too.

> I took the liberty of adding an 'x' at the end of this regexp,
> and laying it out to better understand it.

Great, I didn't know this regexp modifier. Very handy.


> - There can be space around the equal sign.

I see in your resulting regexp that you are using
[\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
whereas I suspect it should be
[\x20|\x9|\xD|\xA]* = [\x20|\x9|\xD|\xA]*
since there could also be no space, right?

> - There is no requirement for the XML declaration to be on its own
>   line, i.e. the following would be legal:
> <?xml version='1.0'?><!-- this is an XML declaration -->
> - The version pseudo-attribute is needed (assuming we are not
>   talking about a text declaration
>   (http://www.w3.org/TR/REC-xml/#sec-TextDecl) because we are not
>   validating enities, only documents.

Two good points. Agreed.

> As a result, I get the following:
>
> /^<\?xml [\x20|\x9|\xD|\xA]+ version
>   [\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
>   ("1.0"|"1.1"|'1.0'|'1.1')
>   ([\x20|\x9|\xD|\xA]+ encoding
>    [\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
>    ("[A-Za-z][a-zA-Z0-9-_]+"|'[A-Za-z][a-zA- Z0-9_]+')
>   )?
>   ([\x20|\x9|\xD|\xA]+)+ standalone
>    [\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
>    ("yes"|"no"|'yes'|'no')
>   )?
>   [\x20|\x9|\xD|\xA]* \?>
> /x

Hence, when allowing no space around equal sign:

/^<\?xml [\x20|\x9|\xD|\xA]+ version
   [\x20|\x9|\xD|\xA]* = [\x20|\x9|\xD|\xA]*
   ("1.0"|"1.1"|'1.0'|'1.1')
   ([\x20|\x9|\xD|\xA]+ encoding
    [\x20|\x9|\xD|\xA]* = [\x20|\x9|\xD|\xA]*
    ("[A-Za-z][a-zA-Z0-9-_]+"|'[A-Za-z][a-zA- Z0-9_]+')
   )?
   ([\x20|\x9|\xD|\xA]+)+ standalone
    [\x20|\x9|\xD|\xA]* = [\x20|\x9|\xD|\xA]*
    ("yes"|"no"|'yes'|'no')
   )?
   [\x20|\x9|\xD|\xA]* \?>
/x



> I think this can be shortened and made more readable by using variable
> substitution, but you have to try out how variable substitionion,
> escaping, and the /x interact to make sure you get this right.

I think once you document that [\x20|\x9|\xD|\xA] are the authorized  
XML whitespace (including CR/LF) and point to http://www.w3.org/TR/ 
REC-xml/#NT-XMLDecl the regexp is reasonably readable.

> You may also want to compare with the regular expression in the RDF
> validator, which I added at
> http://dev.w3.org/cvsweb/java/classes/org/w3c/rdf/examples/ 
> ARPServlet.java.diff?r1=1.34&r2=1.35
> (look for the line containing >>>>RE r = new RE("<?xml<<<<).

Ah, thanks, I wasn't aware of that piece of code. I guess that regexp  
is more clever in how it uses variable substitution for the quote  
marks, but I find the one you wrote (quoted above) slightly more  
readable. Another difference is that it does not hardcode 1.0 and 1.1  
as the sole XML versions allowed. Should we be using \d\.\d ?

Thank you
olivier
Received on Monday, 25 June 2007 05:18:34 UTC