- From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
- Date: Tue, 24 Jan 2006 22:38:53 +1100
Anne van Kesteren wrote: > Quoting Henri Sivonen <hsivonen at iki.fi>: >> I guess the XML style is the simplest thing that could work. :-/ > > You are talking about conformance, but what do you want the parser to > do? And also there is talk about whitespace between -- and > but currently all > kinds of chracters are allowed there (including - for instance). It's important to decide upon what is to be considered a conformant comment and what is not before we can settle upon the best way to parse it. That way we can ensure that all conforming comments are handled correctly and that error handling can be defined in an appropriate and compatible way. As for how to parse it, I'll use these test cases to demonstrate what I consider to be the most sane way to handle comments. (Assume EOF at the end of each one) Test Case | Comment Content | Output -----------------------------------|--------------------------|-------------- PA<!>SS | "" | PASS PA<! ->SS | " -" | PASS PA<! -->SS | " " | PASS PA<!->SS | "-" | PASS PA<!- ->SS | "- -" | PASS PA<!- ->SS --> | "- -" | PASS --> PA<!- <!-->SS --> | "- <!" | PASS --> PA<!- <!-- ->SS --> | "- <!-- -" | PASS --> PA<!- -->SS | "- " | PASS PA<!- -- >SS | "- " | PASS PA<!-- FAIL -->SS | " FAIL " | PASS PA<!--> FAIL -->SS | "> FAIL " | PASS PA<!--> FAIL <!-- -->SS | "> FAIL <!-- " | PASS PA<!--> FAIL <!-- -- -->SS | "> FAIL <!-- -- " | PASS PA<!-- > FAIL -- >SS | " > FAIL " | PASS P<!-- -- >AS<!-- -->S | " " (2 comments) | PASS PA<!-- FAIL -- FAIL -->SS | " FAIL -- FAIL " | PASS P<!-- -- -->AS<!-- -- -->S | " -- " (2 comments) | PASS PA<!-- -- -- -->SS | " -- -- " | PASS PA<!-- FAIL -- FAIL -- FAIL -->SS | " FAIL -- FAIL -- FAIL " | PASS PA<!--- FAIL -->SS | "- FAIL " | PASS PA<!--- FAIL --->SS | "- FAIL -" | PASS <!-- ->FAIL | " ->FAIL" | <!--- ->FAIL | "- ->FAIL" | PA<!--->-->SS | "->" | PASS <!-- --- -> | (not sure) | PA<!-- --- -->SS | " --- " | PASS PA<!--- --- --->SS | "- --- -" | PASS As for actually defining how that is parsed, I believe it should work something like this. Throughout this algorithm, (x) is used to represent the input character, not literal characters. The following isn't perfect, I'm sure I've made some mistakes, but it should (I believe) handle the above cases as described. <! * Switch to marked section open state Marked Section Open State -- * Create comment token * Switch to comment state DOCTYPE * (DOCTYPE state) else (easy parse error) * Create comment token * Append (x) to comment token * Switch to comment end state Comment State - * Switch to comment dash state EOF * Emit comment token and stop else * Append (x) to comment token * Remain in comment state Comment Dash State - * Switch to comment end state EOF * Append '-' to the comment token * Emit comment token and stop else * Append '-' and (x) to comment token * Switch to comment state Comment End State > * Emit comment token * Switch to data state - * Append '-' to comment token else (easy parse error) * Append '--' to comment token * Consume every character up to, but not including, the first occurrence of '>' or EOF (whichever comes first) * Append the characters to the comment token * If the comment token string matches /--\s*$/, then strip those characters. (This ensures that <!-- foo --> and <!-- foo -- > have the same comment data) * Emit the comment token -- Lachlan Hunt http://lachy.id.au/
Received on Tuesday, 24 January 2006 03:38:53 UTC