Re: HTML/SGML parsing (re: sgml-lex)

On Wed, 10 Jul 1996, Jim Taylor wrote:
> The question is, what separates tag names from attribute
> specifications? Does SGML explicitly state that attribute
> specifications must be delimited by whitespace, or can any unexpected
> character act as delimiter? 

It must be a delimiter (in the formal SGML sense of "A character string 
assigned to a delimiter role by the concrete syntax" [4.91]) rather than 
just any unexpected character (which is why <abc_def> will fail).

The relevant production for a start-tag is

start-tag = "<", gi, att-spec-list, s*, ">"	(modified [14])

where

att-spec-list = att-spec*			(modified [31])

and

att-spec = s*, (att-name, s*, "=", s*)?, att-val-spec	(modified [32])

Furthermore, "The leading s can only be omitted from an attribute 
specification that follows a delimiter" [7.9]

> And what exactly is whitespace? SPACE, RE, RS, and SEPCHAR only?

Yes, production [5] defines s to be SPACE, RE, RS or SEPCHAR (ie TAB in 
the Reference Concrete Syntax)

> Does that mean <abc
> def>
> is ok? If not, this could be dangerous if an editor or other process
> hard wraps HTML at spaces. 

This is fine as long as def is "an undelimiter name token that is a 
member of a group specified in the declared value for that attribute". 
[7.9.1.2]

> The flex input file seems to indicate that spaces but not other
> whitespace can come after the attribute name and the =. Is this part
> of SGML syntax?

No. See the above production.

James K. Tauber / jtauber@library.uwa.edu.au
University CWIS Coordination Officer
The University of Western Australia

Received on Thursday, 11 July 1996 01:49:59 UTC