- From: <lee@sq.com>
- Date: Tue, 24 Sep 96 20:38:17 EDT
- To: U35395@UICVM.CC.UIC.EDU, w3c-sgml-wg@w3.org
You know, if XML had the compiler-theory concept of phases, there would be no problem whatsoever with &#RE;<!-- comment --> because the lexical analyser would eat the comments entirely, so that the parser wouldn't ever see comments at all. This is trivial to implement and is how almost all modern languges do it. It is unfortunate that <! is overloaded in SGML, making comment recognition mach much harder than it need have been. Had there been a separate comment delimiter, e.g. <-- ..... -->, for use within the document, comments would have been much simpler. E.g. consider a parser reading a token at a time... getToken(startMode) { -- handle comments: if input is "<--" -- start comment then eat up to "-->" endif if (inTextMode) { -- handle text if input is MDO then do stuff endif } else if inMarkupMode { -- handle stuff in markup mode } else { . . .} } with this model, which I hope is clear from the incomplete code sketch (the mode names are not meant to correspond to SGML modes in the example), the parser would call getToken() repeatedly to read a file... but comments are never returned, and hence not seen by the parser. Obviously an editor can't use this model directly -- just as editors for the C programming language have to retain comments today (e.g. Visual C, Turbo C++, Brief, etc). I suggest that it be made clear that in XML, all comments are elided entirely before other parsing begins. This does not mean that you have to have multiple passes; it can be implemented concurrently very efficiently using the mechanism I've outlined above. Now, <P> A <?XXX> B </P> is clearly exactly the same as <P> A B </p> which I propose be considered the same as <P> A B </P> by considering any non-zero amount of whitespace to be exactly the same as a single space -- again, as in just about all other modern languages. Whitespace is any sequence of space, tab, form feed, newline, carriage return, and vertical tab (does anyone still use that??). Note that this makes <P> A </P> distinct from <P>A</P> but that is a necessary consequence of not requiring the use of a DTD. It would be possible to use a syntactic distinction between what I shall call formes and sticks. By a forme I mean an element that cannot contain text directly, rather like element context in SGML. By a stick, I mean an element that can contain text directly. For example only, suppose that the names of sticks are to be in lower case (in SGML, you'd have to change NAMECASE (yes?) in the SGML declaration to do this). <SEC> <p>An <emph>important</emph> distinction</p> <SEC> becomes <SEC><p>An <emph>important</emph> distinction</p></SEC> I am not proposing this mechanism for XML. I am describing it so as to make clear that if you don't read the DTD, you have to be able to deduce all the context you need by inspection. If you need varying behaviour from your tokeniser or parser in different contexts, those contexts must be visible syntactically. But it is simplest to do away with such things altogether. Have a style sheet property telling you to discard leading/trailing spaces in an element if you must, and make the rules the same anywhere. Note that, just as in TeX, you would not be able to include unquoted program source code with this proposal. main() { print("hello world\n"); } might become <program-listing> main()&nl; {&nl; &tab;print("hello world\n");&nl; }&nl; </program-listing> But this is not so bad. How many troff-set books have I seen where this program is printed as main() { print("hello world0); } because the \n wasn't escaped? Lee
Received on Tuesday, 24 September 1996 20:38:32 UTC