Perhaps I'm missing something, but isn't "foo<caption>bar</caption>" an
invalid use case? Any top-level element that needs a context can't be mixed
with a text node. Are there cases where this isn't true?
I don't know how the actual parsing works, but the following logic seems
reasonable to me:
If the first character is not < then use the default context.
Otherwise, read all continuous characters that are valid for element names.
If the element name found is valid, then use that to determine the context.
Otherwise, use the default context.
Parse the string using the context determined above.
This should result in every possible string having a deterministic outcome,
based on existing rules.
On Wed, May 9, 2012 at 3:51 PM, Ian Hickson <ian@hixie.ch> wrote:
> On Wed, 9 May 2012, Jonas Sicking wrote:
> >
> > I think having to provide a context every wherewhere you want to
> > parse HTML is creating very bad developer ergonomics.
>
> You wouldn't have to provide it everywhere. The vast majority of the time,
> the default "body" context is fine.
>
>
> > I think the proposals here, and the fact that jQuery has implemented
> > context-free HTML parsing, proves that it is technically possible.
>
> I don't think look-ahead and magically determining the parse mode from a
> preparse of the string is really a sane solution. It doesn't handle all
> cases (e.g. it doesn't handle the <style> example I gave), and it results
> in very weird results ("very bad developer ergonomics") for cases like
> "1GB of text followed by <caption>" vs "1GB of text followed by <coption>"
> (where the former loses the text and the latter does not).
>
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>
>