Re: ampersands, angle brackets, errors, warnings and xml from Marcos Rubino on 2005-07-18 (www-validator@w3.org from July 2005)

From: Marcos Rubino <contact_marcos@yahoo.es>
Date: Sun, 17 Jul 2005 23:48:59 -0400
To: olivier Thereaux <ot@w3.org>
Cc: www-validator@w3.org
Message-id: <42DB26AB.60603@yahoo.es>
Hi Olivier,

Thanks for your response.  I did quite a bit more digging and I now 
think I understand the situation a little better.  See my responses inline.

olivier Thereaux wrote:
> Hi Marc,
> 
> Thanks for sending this message, especially after obvious serious  
> research.
> 
> I think your conclusions are correct (see below for details), but  
> please note that I am not as much of an expert as others on this  list. 
> Hopefully if I say something completely wrong, they'll jump in :).
> 
> On 12 Jul 2005, at 00:28, Marc Richards wrote:
> 
>> 1) Should the validator be throwing an error instead of a warning  
>> whenever it encounters an ampersand or left angle bracket as data  for 
>> a document served as application/xhtml+xml? i.e. was there a  
>> conscious decision made to only throw a warning or is this simply  one 
>> of the XML parser limitations.
> 
> 
> As far as I know, it is not legal in XML and authorized in SGML (with  
> shorttags). Therefore, in XML mode it should throw an error. Whether  it 
> should be a warning in SGML mode is source of controversy : you'll  get 
> an equal number of people asking for it, for the sake of quality,  and 
> of people complaining that the validator should not dare confuse  people 
> with warnings for a valid construct.
> 
> Instead, what happens is:
> - openSP's XML mode is "limited" (you saw the note)
> - in XML mode, openSP throws a warning for such a construct
> - in SGML mode, openSP accepts such constructs, unless asked to
> - XHTML is always parsed using XML mode (see also Bug 1500)
> 
> [Bug 1500] http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500

Isn't bug 1500 misdirected?  Correct me if I am wrong here, but even if 
the XHTML as text/html pages were processed by the validator in SGML 
mode with an XHTML DTD they would still be "valid" (since XML is a 
subset of SGML) and as a result, bugs would still be filed agaist 
Mozilla, Opera and Safari as long as people weren't taking advantage of 
the techniques outlined in appendix C.

It may be useful to offer a XHTML 1.0 Appendix C conformance testing 
service (and it seemd there has been some forays in that direction[1]) 
so that people could get an idea of how well their pages worked in HTML4 
UAs, but that doesn't mean that the validator is doing anything wrong.

[1]http://qa-dev.w3.org/~bjoern/appendix-c/validator/

A legitimate question still remains: Should the validator be parsing 
XHTML served as text/html in SGML mode or XML mode?

While I think it makes sense for standard HTML4 user-agents to process 
text/html documents in SGML mode for backwards compatibility, the 
majority of the users who test their XHTML pages using the validator are 
looking for forwards compatibility and the well formedness that XML 
brings to the table.

In an ideal world, HTML4 only UAs would be served the page as text/html 
and XHTML UAs (including the validator) would be served the same page as 
application/xhtml+xml, however the fact of the matter is that
(a) most people don't have content negotiation setup
(b) serving docs as application/xhtml+xml to current browsers that 
support it is very tricky/error prone (javascript issues, CSS issues, 
browser issues, etc)
(c) people have come to expect the validator to test XHTML pages for xml 
well-formedness

Given the way things stand now I think the best default is for the the 
validator to parse and evaluate the pages as XML.  I can't see any value 
to anyone (end-users, web-developers, UA-developers) in evaluating the 
pages as SGML instead of as XML while still using the XHTML DTD. If you 
are testing XML well-formedness, you already have SGML well-formedness 
covered (right?).  There is some value in testing Appendix C 
conformance, but that is a separate issue.


>> If this *is* one of the XML limitations then I think it would be  
>> helpful to compile a short list of common limitations and list them  
>> on a w3c page in plain English.  I have read the OpenSP page[4] a  
>> couple times and I am still not sure whether or not recognizing "<"  
>> and "&" as invalid is a limitation of the parser; The language on  
>> that page is fairly technical.  The validator could link to this  
>> internal page directly and that page would then link to the OpenSP  
>> page as well.
> 
> 
> This could be a good idea. How about starting a scratchpad on the  wiki, 
> e.g somewhere like: http://esw.w3.org/topic/MarkupValidator/ 
> XML_Limitations and motivate people on the list to contribute?

Done[2].  Everybody feel free to add, subtract, enhance.

[2] http://esw.w3.org/topic/MarkupValidator/XML_Limitations

>> 2) Why are you issuing a warning for the use of ampersands and let  
>> angle brackets in xhtml but not html.  If the warning is in fact  
>> saying "this may be valid in some contexts, but it is recommended  to 
>> use &amp; or &lt;" then this is an SGML warning and should be  shown 
>> for both HTML and XHTML as text/html.  Ideally with and  example like 
>> "R & D valid, R&D invalid".  Is there a bug open for  issuing the 
>> warning for html doctypes as well?
> 
> 
> See above, my remark on the fussy mode. You could search this list  for 
> "fussy" and get an idea of the discussions that happened a while  ago on 
> this topic.

Is it technically possible to get the validator to flag & and < as 
warnings in SGML mode? I couldn't find a bug for this one.

If it is technically doable, I think there is less likelyhood of 
backlash from the community (ala fussy mode) if
- users still got the bright green "this page is valid" at the top of 
the page
- the color of the warnings were made a little more neutral (yellow 
instead of pale red).
- the warning text is clear and helpful.

I am not sure how much utility this solution would really have, so I am 
not terribly gung ho about it, but I will file a bug if people think it 
is likely to help users avoid potential errors.

>> Are there open bugs you can point me to? Are there bugs I should file?
> 
> 
> I think 798 and 1500 are the relevant ones. If you think they do not  
> cover the whole span of the issue, feel free to open others.

Assuming that we are agreed about evaluating XHTML documents served as 
text/html in XML mode, is it technically possible to get the validator 
to flag & and < as errors? Unless I am mistaken, this seems to be the 
most obvious area where UAs choke on the well-formedness test (when 
parsing as XML), but the validator just lets you off with a warning.  Of 
course it would be very imporant to make it clear to users why their 
document doesn't validate using language they can understand, plety of 
examples, and links to more detailed information.

Is there a bug open for this?  Is it likely to fixed without major 
architectural changes? Bug 798 seems to be mislabeled.  As far as the 
soulution that was found is concerned it should be titled "warnings are 
mistakenly suppresed on valid pages".

> Hope this answered your questions.
> 

Sure did, which of course led to more questions.  Thanks for taking the 
time answer.


Marc
Received on Monday, 18 July 2005 03:49:29 UTC