- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Sat, 23 Jun 2007 13:31:51 +0900
- To: olivier Thereaux <ot@w3.org>
- Cc: www-validator@w3.org
Hello Olivier, Sorry for the delay of my answer. At 02:24 07/06/20, olivier Thereaux wrote: >Hello Martin, > >On May 30, 2007, at 05:22 , Martin Duerst wrote: >> I can definitely submit a patch that goes into XML mode if an >> XML declaration is present. > >Are you (still) working on this? No, not yet, sorry. >If not, I am thinking of going ahead with an implementation Great! >based on the following regexp, using a test for it in the first >line of the content: Strictly speaking, an XML declaration can go over more than one line. It has to start with an '<' as the very first character of the file, but then it can include linebreaks. I see that you have incuded \x9|\xD|\xA below, so maybe you were aware of this, and above, it should have been "a test for it starting in the firsts line". I took the liberty of adding an 'x' at the end of this regexp, and laying it out to better understand it. I found several issues: - There was a missing '[' at the start of the WS after encoding. - There can be space around the equal sign. - There is no requirement for the XML declaration to be on its own line, i.e. the following would be legal: <?xml version='1.0'?><!-- this is an XML declaration --> - The version pseudo-attribute is needed (assuming we are not talking about a text declaration (http://www.w3.org/TR/REC-xml/#sec-TextDecl) because we are not validating enities, only documents. - There are two consecutive space groups (the one at the end of standalone and the one before ?>. They don't change what's matched, but they may hurt performance when backtracking. As a result, I get the following: /^<\?xml [\x20|\x9|\xD|\xA]+ version [\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+ ("1.0"|"1.1"|'1.0'|'1.1') ([\x20|\x9|\xD|\xA]+ encoding [\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+ ("[A-Za-z][a-zA-Z0-9-_]+"|'[A-Za-z][a-zA- Z0-9_]+') )? ([\x20|\x9|\xD|\xA]+)+ standalone [\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+ ("yes"|"no"|'yes'|'no') )? [\x20|\x9|\xD|\xA]* \?> /x I think this can be shortened and made more readable by using variable substitution, but you have to try out how variable substitionion, escaping, and the /x interact to make sure you get this right. >It should match XML declarations as defined by XML 1.0 and 1.1 specs, >and not be too greedy. I definitely don't see a problem with greediness. Even your version (with a single ']' added) should work; I rarely if ever have seen anything that makes use of these full rules, and I sometimes wished the sytax for XML declarations was much more restrictive (a single space between pseudo-attributes). You may also want to compare with the regular expression in the RDF validator, which I added at http://dev.w3.org/cvsweb/java/classes/org/w3c/rdf/examples/ARPServlet.java.diff?r1=1.34&r2=1.35 (look for the line containing >>>>RE r = new RE("<?xml<<<<). Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Saturday, 23 June 2007 07:13:47 UTC