W3C home > Mailing lists > Public > www-validator@w3.org > June 2007

Re: Problems validating XML

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Mon, 25 Jun 2007 17:54:53 +0900
Message-Id: <6.0.0.20.2.20070625175033.04bbd180@localhost>
To: olivier Thereaux <ot@w3.org>
Cc: www-validator@w3.org

At 14:18 07/06/25, olivier Thereaux wrote:
>Hi Martin,
>
>Thanks for looking into the regexp, and especially for spotting one  
>of my mistakes. Much appreciated.
>
>On Jun 23, 2007, at 13:31 , Martin Duerst wrote:
>> Strictly speaking, an XML declaration can go over more than one line.
>> It has to start with an '<' as the very first character of the file,
>> but then it can include linebreaks.
>
>Indeed. Also as Ville noted, there could be a BOM at the beginning too.

I'm a bit sceptical here. The BOM is part of guessing the encoding
family, but the regexp is ASCII-based, so it comes after guessing
the encoding family. At least in theory, at one point, there was
some code that would have allowed to also validate EBCDIC-based
stuff, or UTF-16 or UTF-32-based stuff, and the regexp we are working
on here should come at least after we made a general attempt at
transcoding the start of the document into the relevant encoding
family.

>> - There can be space around the equal sign.
>
>I see in your resulting regexp that you are using
>[\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
>whereas I suspect it should be
>[\x20|\x9|\xD|\xA]* = [\x20|\x9|\xD|\xA]*
>since there could also be no space, right?

Yes, sorry, I think I cought some of that before sending out my
mail, but apparently not everything.

Regards,   Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     
Received on Monday, 25 June 2007 10:11:05 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:24 GMT