Re: SC 4.1.1 source fails but DOM passes - must a page fail? from Mark Rogers on 2019-01-11 (w3c-wai-ig@w3.org from January to March 2019)

From: Mark Rogers <mark.rogers@powermapper.com>
Date: Fri, 11 Jan 2019 13:49:47 +0000
To: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
Message-ID: <57429A1F-C1B1-4857-9481-1A977D4DBE03@powermapper.com>
The Understanding SC 4.1.1 Parsing doc says 'the Success Criterion requires that the content can be parsed using only the rules of the formal grammar.'

The key bit is 'parsing' - the parsing phase in browsers transforms raw HTML source into the initial DOM tree. Once you have a DOM there's no more parsing involved unless you set innerHtml or outerHtml. If there are parsing problems you may have lost information or produce unexpected side effects, but in many cases the parser can recover with few problems for the end user.

However, there are some assumptions in the SC that aren't true in practice:

1) The formal grammar (DTDs in the case of HTML 4 and XHTML) doesn't always match the normative text in the same spec, or match up with other specs. See below for examples of things that validate but don't work.

2) Duplicate attributes can't occur in the DOM because the DOM has no way to store duplicate attributes:
https://www.w3.org/TR/DOM-Level-3-Core/core.html#ID-1780488922

and the subsequent attributes with the same name are ignored according to spec.

3) Most mismatched start and end tags aren't a problem
For example <h1>Heading</h2> is parsed into <h1>Heading</h1> in the DOM.

Things that do cause problems:

1) Duplicate IDs on different elements - the DOM can contain duplicate IDs, and the DOM spec says behaviour is undefined if they do:
https://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-getElBId

Screen reader behaviour when duplicate IDs are used is very random:
https://www.powermapper.com/tests/screen-readers/labelling/dupe-ids/


2) IDs and source IDREFs that only differ by case. These don't produce validation errors with the HTML 4 doctype, and other doctypes that specify NAMECASE GENERAL YES in the DTD formal grammar (this makes IDs case insensitive). The normative text elsewhere in the HTML 4 recommendation marks ids as case-sensitive. These ID/IDREFs do produce validation errors with the HTML 5 and XHTML doctypes. For example:

a) This code doesn't validate and label not associated due to case mis-match:
<!DOCTYPE html>
<title>Example</title>
<label for='TextField'>Name:</label>
<input id='TEXTFIELD' type='text' >

b) Same code with HTML 4 doctype validates successfully, but label not associated due to case mis-match 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<title>Example</title>
<label for='TextField'>Name:</label>
<input id='TEXTFIELD' type='text' >

3) Misquoted attributes - for example:
<img src='rota.png' alt='Teachers' class rota class='shadow' >
Is parsed into the DOM as
<img src="rota.png" alt="Teachers" class="" rota="">
This is definitely not what the author intended

4) Unterminated HTML comments :
<!-- where does this comment finish 
<html>
...
</html>

> 4.1.1 Parsing: In content implemented using markup languages, 
> elements have complete start and end tags,
> elements are nested according to their specifications, 
> elements do not contain duplicate attributes, 
> and any IDs are unique, 
>except where the specifications allow these features.

If the SC is applied to the DOM most of the things the SC looks for can't happen:

- the DOM can't have incomplete start and tags because each element is represented as a single Element node https://www.w3.org/TR/dom/#node-tree 
- the DOM can't store duplicate attributes https://www.w3.org/TR/dom/#node-tree 
- most nesting problems can't happen, other than using nested interactive elements like  <button>Button <a href='/'>Link</a></button>
- but duplicate IDs can occur

Best Regards
Mark

-- 
Mark Rogers - mark.rogers@powermapper.com
PowerMapper Software Ltd - www.powermapper.com
Registered in Scotland No 362274 Quartermile 2 Edinburgh EH3 9GL
 



On 10/01/2019, 23:56, "Patrick H. Lauke" <redux@splintered.co.uk> wrote:

    On 10/01/2019 20:35, Bristow, Alan wrote:
    [...]
    > b). some browsers are less capable than others and so some may fail to 
    > ‘mend’ some invalid HTML
    > 
    > that I probably have to follow position 1. since it is unequivocal.
    
    The problem used to be that each browser/engine would do its own flavour 
    of error correction/remediation to turn broken markup into a sensible 
    DOM. This problem has now - mostly - gone away since HTML5 defines in 
    much more detail how browsers should parse markup, including broken 
    markup. As such, I would say the requirement of this SC should focus on 
    the generated DOM itself rather than the source that is sent.
    
    P
    -- 
    Patrick H. Lauke
    
    www.splintered.co.uk | https://github.com/patrickhlauke

    http://flickr.com/photos/redux/ | http://redux.deviantart.com

    twitter: @patrick_h_lauke | skype: patrick_h_lauke
Received on Friday, 11 January 2019 13:50:13 UTC