Re: sgml-lex and CDATA from Daniel W. Connolly on 1996-07-18 (www-html@w3.org from July 1996)

From: Daniel W. Connolly <connolly@w3.org>
Date: Thu, 18 Jul 1996 15:03:58 -0400
To: www-html@w3.org
cc: eric@spyglass.com
Message-Id: <199607181903.PAA24428@anansi.w3.org>
[question and answer forwarded to www-html@w3.org with
permission of Eric]

In message <2.2.32.19960716221423.00bc4408@spyglass.com>, "Eric W. Sink" writes
:
>We're still noodling on HTML lexer/parser issues.
>It seems that entity expansion should be handled at the lexer level.

I agree in principle. There are exceptions...

>  This
>is particularly the case if you want to support entities which expand to
>become markup.

Exactly.

>That said, it would appear to us that it's not really easy to make an HTML
>system where the lexer and parser are truly separate if you're going to
>support CDATA as a content model for elements.

I think you can keep a clean boundary between the parser and lexer.  I
haven't addressed this issue in the lexer API and implementation yet,
but I have it designed in my head. I half-documented it:

=================
http://www.w3.org/pub/WWW/MarkUp/SGML/sgml-lex/sgml.l
$Id: sgml.l,v 1.9 1996/02/07 15:32:28 connolly Exp $

 * The CDATA start condition represents the CON recognition
 * mode with the restriction that only end-tags are recognized,
 * as in elements with CDATA declared content.
 * (@# no way to activate it yet: need hook to parser.)
=================

and:

==========================
http://www.w3.org/pub/WWW/MarkUp/SGML/sgml-lex/sgmllib.py
 $Id: sgmllib.py,v 1.3 1995/11/16 00:59:19 connolly Exp $

# XXX There should be a way to distinguish between PCDATA (parsed
# character data -- the normal case), RCDATA (replaceable character
# data -- only char and entity references and end tags are special)
# and CDATA (character data -- only end tags are special).
==========================

>  If you handle CDATA (where
>entities should not be expanded), then the lexer needs to know the content
>model of the current element, right?

Right. My intent is that there would be an API to access the
start-condition of the lexer, ala:

	typedef enum { SGML_EMPTY, SGML_CDATA, SGML_RCDATA,
			SGML_MIXED, SGML_ELEMENT } SGML_Mode;
	int SGML_lexMode(SGML_Lexer *l, SGML_Mode new_mode);

The parser, on seeing something like <XMP> or <SCRIPT>, would
call:
	SGML_lexMode(l, SGML_CDATA)

On seeing </SCRIPT>, it's the responsibility of the parser to call:

	SGML_lexMode(l, SGML_MIXED);

or whatever the containing mode was. You might need/want to keep a
stack of modes in the parser.

Grumble... this should be in the "Future work" section of the tech
report. Sorry.

>We're not SGML experts here, but it looks like CDATA is supposed to be
>honored for some kinds of attributes too, so you have the same problem:  the
>lexer would need to know which attributes are CDATA and which are not, so
>that it could make proper decisions about whether to expand entities within
>attribute values or not.

SGML is really messy in this regard. Consider:

	<!doctype foo [
	 <!entity other-file system "xyz">
	 <!element foo ANY>
	 <!attlist foo
		bar CDATA #IMPLIED>
	]>

	<foo bar="abc &other-file;">

According to the SGML spec, the value of the bar attribute includes the
contents of the other-file entity, i.e. the "xyz" file. I checked,
and sure enough, sgmls implements it this way.

Hence, the lexical aspects of SGML interact with the entity manager. Yuk!

Oh! and note that &entities are expanded even in CDATA attributes.
CDATA means about 7 different things in the SGML spec. In the context
of attribute declarations, it refers to the _value_ of the attribute
_after_ interpretation, not the syntax of the markup, i.e.
the attribute specification[1]

	[1]http://www.w3.org/pub/WWW/MarkUp/SGML/productions.html#prod33

In the interest of keeping my lexer interface clean (i.e. no calls to
an entity manager), I punted the whole issue of interpreting attribute
value literals (prod34) out of the lexer.

This is the exception to the "entities should be expanded by the lexer"
rule that I was talking about above.

There should be an API to do it, but there isn't (yet):

=====================
http://www.w3.org/pub/WWW/MarkUp/SGML/sgml-lex/sgml-lex.html
$Date: 1996/06/15 19:17:34 $

Note that attribute value literals are output verbatim. Interpretation
is left to the client. Section 7.9.3 of SGML says that an attribute
value literal is interpreted as an attribute value by:

 * Removing the quotes
 * Replacing character and entity references
 * Deleting character 10 (ASCII LF)
 * Replacing character 9 and 13 (ASCII HT and CR) with character 32 (SPACE) 
===========================

I started prototyping it in python:

==================
http://www.w3.org/pub/WWW/MarkUp/SGML/sgml-lex/sgmllib.py
 $Id: sgmllib.py,v 1.3 1995/11/16 00:59:19 connolly Exp $

def sgml_lex_attrval(v): # @@ this should go in sgml_lex API
        #@@ deal with spaces, entity/char references
        return v[1:-1] # strip quotes
==================


>How did you cope/punt with these issues for sgml-lex?

Hope this explains it...

Dan
Received on Thursday, 18 July 1996 15:04:00 UTC