Re: Problems validating XML from Martin Duerst on 2007-06-25 (www-validator@w3.org from June 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Mon, 25 Jun 2007 17:54:53 +0900
To: olivier Thereaux <ot@w3.org>
Cc: www-validator@w3.org
Message-Id: <6.0.0.20.2.20070625175033.04bbd180@localhost>

At 14:18 07/06/25, olivier Thereaux wrote:
>Hi Martin,
>
>Thanks for looking into the regexp, and especially for spotting one  
>of my mistakes. Much appreciated.
>
>On Jun 23, 2007, at 13:31 , Martin Duerst wrote:
>> Strictly speaking, an XML declaration can go over more than one line.
>> It has to start with an '<' as the very first character of the file,
>> but then it can include linebreaks.
>
>Indeed. Also as Ville noted, there could be a BOM at the beginning too.

I'm a bit sceptical here. The BOM is part of guessing the encoding
family, but the regexp is ASCII-based, so it comes after guessing
the encoding family. At least in theory, at one point, there was
some code that would have allowed to also validate EBCDIC-based
stuff, or UTF-16 or UTF-32-based stuff, and the regexp we are working
on here should come at least after we made a general attempt at
transcoding the start of the document into the relevant encoding
family.

>> - There can be space around the equal sign.
>
>I see in your resulting regexp that you are using
>[\x20|\x9|\xD|\xA]+ = [\x20|\x9|\xD|\xA]+
>whereas I suspect it should be
>[\x20|\x9|\xD|\xA]* = [\x20|\x9|\xD|\xA]*
>since there could also be no space, right?

Yes, sorry, I think I cought some of that before sending out my
mail, but apparently not everything.

Regards,   Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Received on Monday, 25 June 2007 10:11:05 UTC