Re: Tidying the value of an attribute? from Richard A. O'Keefe on 2000-08-03 (html-tidy@w3.org from July to September 2000)

From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
Date: Thu, 3 Aug 2000 12:54:45 +1200 (NZST)
To: h.rzepa@ic.ac.uk, html-tidy@w3.org
Cc: g.gkoutos@ic.ac.uk
Message-Id: <200008030054.MAA11903@atlas.otago.ac.nz>

	From: "Rzepa, Henry" <h.rzepa@ic.ac.uk>
	
	An <object>, <embed> or <param> element can have an attribute such as
	eg script="...script commands..."
	
	These "legacy" script commands it turns out have some undesirable features
	
	a) they often contain semantic line breaks. for example
	 # comment
	where the # starts on a new line and the comment ends with a line break
	
	b) Even worse, the scripts can use the  < or  > operators.
	
Problem a).

    The SGML standard says
	"An attribute value literal is interpreted as an attribute value
	 by replacing references within it, ignoring end of entity and
	 start of record, and replacing an RE or SEPCHAR with a SPACE".
    That is, tabs and newlines turn into spaces.  What's more, the
    interpretation appears to be
	first replace character and general entity references
	then replace newlines and tabs by spaces
    so that hacks like "
	script='first line&10;
	second line'
    don't work.  XML is required to do the same mapping.
    I don't see any way around problem (a).

Problem b).

    HTML being an application of SGML, there is not the slightest reason
    for < or > to cause any trouble at all in an attribute value.
[34] attribute value literal = ( lit,  replaceable character data*, lit  )
                             | ( lita, replaceable character data*, lita )
[46] replaceable character data = (data character | character reference
                                  |general entity reference | Ee)*

    (Yes, the stars in [34] are redundant.)
    As Goldfarb summarizes it:
	In replaceable character data, all characters are treated as
	data characters execpt for those necessary to recognize a
	character or a general entity reference, as well as the
	characters that would terminate the replaceable character data.
    So really it's

	attribute value literal =
	    " ([^"&] | char ref | ent ref)* "
          | ' ([^'&] | char ref | ent ref)* "

    character references are
	    &# digit+ (;)?
	|   &# (RS|RE|SPACE|TAB) (;)?
	|   &#x hexit+ (;)?
	|   & letter (name char)* (;)?
    and & is NOT supposed to be special if it is followed by anything
    other than a letter or #, but don't count on it.  < and > are NOT
    special inside attribute value literals at all.

    So script='cout << "foo"'  should be perfectly ok in HTML.
    XHTML, however, is an application of XML, and XML 1.0 says
[10] AttValue ::= '"' ([^<&"] | Reference)* '"'
               |  "'" ([^<&'] | Reference)* "'"
    I would be grateful if someone would explain to me why XML goes out
    of its way to ban less than signs in attribute values; they would
    seem to be entirely harmless.

    For	<	>	&	"	'
    use	&#60;	&#62;	&#38; 	&#34;	&#39;
    or	&lt;	&gt;	&amp;	&quot;	&apos;

    HTML Tidy automatically translates < > to &lt; &gt; in attribute
    value literals, so problem (b) would seem to be solved.

Received on Wednesday, 2 August 2000 20:55:10 UTC