[whatwg] <!DOCTYPE html><body><table><math><mi>foo</mi></math></table> and other parser questions

On Tue, Dec 13, 2011 at 2:32 PM, Ian Hickson <ian at hixie.ch> wrote:
> On Mon, 12 Dec 2011, Adam Barth wrote:
>> I'm trying to understand how the HTML parsing spec handles the following case:
>>
>> <!DOCTYPE html><body><table><math><mi>foo</mi></math></table>
>>
>> According to the html5lib test data, we should parse that as follows:
>>
>> | <!DOCTYPE html>
>> | <html>
>> | ? <head>
>> | ? <body>
>> | ? ? <math math>
>> | ? ? ? <math mi>
>> | ? ? ? ? "foo"
>> | ? ? <table>
>>
>> However, I'm not sure whether that's what the spec actually does.
>>
>> Consider point at which we parse the "f" character token (from "foo").
>> ?The insertion mode will be "in table". ?The spec will execute as
>> follows:
>>
>> -> If the current node is a MathML text integration point and the
>> token is a character token
>> ? * Process the token according to the rules given in the section
>> corresponding to the current insertion mode in HTML content.
>>
>> -> A character token
>> ? * Let the pending table character tokens be an empty list of tokens.
>> ? * Let the original insertion mode be the current insertion mode.
>> ? * Switch the insertion mode to "in table text" and reprocess the token.
>>
>> -> Any other character token
>> ? * Append the character token to the pending table character tokens list.
>>
>> ... the "o" and "o" will be processed similarly and end up in the
>> pending table character tokens list.
>>
>> Now, consider the </mi> token. ?We're still at a MathML text
>> integration point, but the current token is neither a start token
>> (with certain names) nor a character token, so we process the token
>> according to the rules given in the section for parsing tokens in
>> foreign content.
>>
>> -> Any other end tag
>> ? * Run these steps:
>> ? ? ...
>>
>> The net result of which is popping the stack of open elements, but not
>> flushing out the pending table character tokens list. ?The list will
>> eventually be flushed when we process the </table> token, resulting
>> these character tokens getting foster parented:
>>
>> | <!DOCTYPE html>
>> | <html>
>> | ? <head>
>> | ? <body>
>> | ? ? <math math>
>> | ? ? ? <math mi>
>> | ? ? "foo"
>> | ? ? <table>
>
> On Tue, 18 Oct 2011, David Flanagan wrote:
>>
>> Here's my current workaround:
>>
>> In 13.2.5, in the rules for whether to use the current insertion mode or
>> to insert the token as foreign content, if the token is being inserted
>> because the current node is a math (or HTML, but I'm not sure about
>> that) integration point, then first set a text_integration_mode flag,
>> then invoke the current insertion mode, then clear the flag.
>>
>> And in the in table insertion mode, when a character token is inserted,
>> and the text_integration_mode flag is set, then just process the token
>> using in body mode, and otherwise follow the directions that are there
>> now.
>>
>> I'm not sure that is the best way to fix the spec, but it works for me,
>> in the sense that my parser now passes the tests.
>
> I think the real problem is that there's no need to go into the "table
> text" mode if the current node is not a table model element. So I've
> changed the spec at that point.
>
> Please let me know if that doesn't fix the test case or causes any other
> regressions.

That fix seems to work great.

Thanks!
Adam

Received on Wednesday, 14 December 2011 10:58:31 UTC