- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Sat, 23 Jun 2007 13:31:51 +0900
- To: olivier Thereaux <ot@w3.org>
- Cc: www-validator@w3.org
Hello Olivier,
Sorry for the delay of my answer.
At 02:24 07/06/20, olivier Thereaux wrote:
>Hello Martin,
>
>On May 30, 2007, at 05:22 , Martin Duerst wrote:
>> I can definitely submit a patch that goes into XML mode if an
>> XML declaration is present.
>
>Are you (still) working on this?
No, not yet, sorry.
>If not, I am thinking of going ahead with an implementation
Great!
>based on the following regexp, using a test for it in the first
>line of the content:
Strictly speaking, an XML declaration can go over more than one line.
It has to start with an '<' as the very first character of the file,
but then it can include linebreaks. I see that you have incuded
\x9|\xD|\xA below, so maybe you were aware of this, and above,
it should have been "a test for it starting in the firsts line".
I took the liberty of adding an 'x' at the end of this regexp,
and laying it out to better understand it.
I found several issues:
- There was a missing '[' at the start of the WS after encoding.
- There can be space around the equal sign.
- There is no requirement for the XML declaration to be on its own
line, i.e. the following would be legal:
<?xml version='1.0'?><!-- this is an XML declaration -->
- The version pseudo-attribute is needed (assuming we are not
talking about a text declaration
(http://www.w3.org/TR/REC-xml/#sec-TextDecl) because we are not
validating enities, only documents.
- There are two consecutive space groups (the one at the end of
standalone and the one before ?>. They don't change what's
matched, but they may hurt performance when backtracking.
As a result, I get the following:
/^<\?xml [\x20|\x9|\xD|\xA]+ version
[\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
("1.0"|"1.1"|'1.0'|'1.1')
([\x20|\x9|\xD|\xA]+ encoding
[\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
("[A-Za-z][a-zA-Z0-9-_]+"|'[A-Za-z][a-zA- Z0-9_]+')
)?
([\x20|\x9|\xD|\xA]+)+ standalone
[\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
("yes"|"no"|'yes'|'no')
)?
[\x20|\x9|\xD|\xA]* \?>
/x
I think this can be shortened and made more readable by using variable
substitution, but you have to try out how variable substitionion,
escaping, and the /x interact to make sure you get this right.
>It should match XML declarations as defined by XML 1.0 and 1.1 specs,
>and not be too greedy.
I definitely don't see a problem with greediness.
Even your version (with a single ']' added) should work; I rarely if
ever have seen anything that makes use of these full rules, and I sometimes
wished the sytax for XML declarations was much more restrictive
(a single space between pseudo-attributes).
You may also want to compare with the regular expression in the RDF
validator, which I added at
http://dev.w3.org/cvsweb/java/classes/org/w3c/rdf/examples/ARPServlet.java.diff?r1=1.34&r2=1.35
(look for the line containing >>>>RE r = new RE("<?xml<<<<).
Regards, Martin.
#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Saturday, 23 June 2007 07:13:47 UTC