Re: [whatwg/dom] Valid/Invalid characters in document.createElement() (#849)

That's a great point, though I'd like to actually solve this a bit more before considering options.

It seems that currently you can put the parser in a different state by using `:`, `_`, or a code point greater than U+007F from NameStartChar. However, you don't have much room after that as the parser operates on ASCII code points and you only get `.`, `-`, and `0` through `9` in NameChar, none of which have any effect.

The HTML parser on the other hand requires `a` through `z` (ASCII case-insensitive) to get into the tag name state at which point most things go. However, before it gets there (in tag open state) it also has special handling for `!`, `/`, and `?`, as well as anything that is not `a` through `z`. Meaning you could get into the data state at which point `&` and `<` have some special meaning.

What does this mean? I'm not entirely sure.

It could be that inputs such as `:&lt;script&gt;alert(1)&lt;&#47;script&gt;` become problematic if we need to account for reparsing attacks. Also simpler inputs such as `:<!--`. (Though see https://lists.w3.org/Archives/Public/www-dom/2014JanMar/0175.html.)

---

I wonder if we can get to more lenient rules safely if we better account for ASCII.

If the first code point is `a` through `z` (ASCII case-insensitive), the following code points can be anything except for the code points that escape the tag name state. This ought to be safe as the HTML parser already operates this way.

If the first code point is not `a` through `z` I think it's fine if it's greater than U+007F, but below that it ought to match what NameStartChar allows (i.e., `:` or `_`). And if the first code point is not `a` through `z`, then subsequent code points should be what NameChar allows though again I think anything greater than U+007F ought to be okay too as it cannot influence the parser (except if we also need to account for Unicode normalization but at that point it becomes rather bananas in my opinion).

This is both stricter and more lenient than what @domenic wrote above. It's more strict when it comes to ASCII which I think is a good thing as that is the risky area of potential state transitions. It's "everything goes" for non-ASCII which I think is good mainly from the point of reducing complexity of the types of checks we need to perform.

Thoughts?

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/dom/issues/849#issuecomment-1058064183
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/dom/issues/849/1058064183@github.com>

Received on Thursday, 3 March 2022 13:55:13 UTC