Re: Is markup allowed in attribute values?

Murray Altheim (murray@spyglass.com)
Fri, 28 Jun 1996 18:50:57 -0500


Message-Id: <v02110100adf8918634d3@[140.186.34.50]>
Date: Fri, 28 Jun 1996 18:50:57 -0500
To: Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
From: murray@spyglass.com (Murray Altheim)
Subject: Re: Is markup allowed in attribute values?
Cc: www-html@w3.org

Paul Prescod <papresco@calum.csclub.uwaterloo.ca> writes:
>At 08:01 PM 6/26/96 -0500, Murray Altheim wrote:
>>First, note that not all attributes are declared CDATA. I'm not sure what
>>you mean by "sometimes" entities. Most HTML attributes can contain
>>entities; the question is whether or not they will be processed (ie.,
>>replaced).
>
>I'm getting confused by this discussion, so let me see if I can clarify.
>_ALL_ SGML/HTML attributes may have entity references in them. _ALL_
>SGML/HTML attributes allow entity expansion/processing/replacement (choose
>your favourite term).

Paul,

Sorry to cause any confusion -- you're "mostly" correct. Goldfarb makes a
point about this being confusing, and I likewise get confused when the
discussion is not precise.

The reason I didn't state "always" is that there are several instances of
attributes declared as NAME, NAMES or ID in various HTML DTDs, and in those
cases ampersand and semicolon characters are disallowed. Because there is
no "reasonable" instance of general or character entities resolving to
valid NAME, I made the statement. I'll try to explain what I mean by this
below.

In the process of parsing an "attribute value literal" (the text you typed
between quote marks), the parser derives an "attribute value". Any general
or character entities in the _attribute value literal_ are resolved (ie.,
"expanded/processed/replaced") at this point. Attribute value literals
_can_ contain general or character entities. BUT, if in parsing the
attribute value literal, the derived attribute value doesn't fit the
declared value of the attribute, the markup is invalid.

As an example, note that the "NAME" attribute in META is declared as NAME,:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN" [
    <!ENTITY foo "KEYWORDS">
    ]>
    <HTML>
    <HEAD>
    <META NAME="&foo;" CONTENT="mexico,canada,usa">
    ...

In parsing the attribute value literal "&foo;", the derived attribute value
is "KEYWORDS" (not including the quote marks). This is so far valid HTML
markup. If you were so inclined, you could even declare &foo; as
"K&#69;YWORDS", since &#69; is replaced by "E". But had the general entity
&foo; been declared as "K&#201;YWORDS" (where &#201; is E with acute
accent), the derived attribute value would not be a valid NAME, since the
E+acute character is not an allowed NAME character. Likewise, declaring
&foo; as "KEYWORDS_FRENCH" would be invalid, since the underscore is not a
valid NAME character.

In essence, the result of replacing entities in attributes declared as NAME
must result in a valid NAME. Since there's no good reason to use numeric
entity references for valid NAME characters, I assume that the author would
be using a numeric or ISO character reference (such as "&#201;" or
"&Eacute;"), which would result in an invalid NAME. Since general entity
replacement doesn't occur in mainstream browsers, my example above doesn't
work either.

So in no "reasonable" instance can entities occur in HTML attributes
declared as NAME, NAMES, or ID [1]. Technically (in true SGML-conformant
HTML) they can, if their replacement results in a valid NAME. But I don't
see this occurring in mainstream HTML. Hence my statement that attributes
declared as NAME, NAMES or ID can't contain entities.

[...]
>So, as I understand it, entity markup is _always_ allowed in attributes.

I'm not clear on the term "entity markup", but I'm assuming you mean the
presence of general or character entities such as &foo; or &Eacute;. Given
the discussion above, yes, entities are always allowed within attribute
value literals, but their replacement must result in an attribute value
that conforms to the attribute declared value in the DTD or DTD subset.

>Less than and greater than symbols are _never_ interpreted as markup within
>attributes (just as they are not in "replacable character data) so it is
>impossible to put elements in attributes although it is possible (in fact
>quite easy) to put less than and greater than characters in attributes.

In this case, technically, a general entity might resolve to a literal
containing markup. If the attribute was declared as CDATA, the markup
wouldn't be interpreted; if RCDATA, the markup would be interpreted. But
given that general entities are declared in a DTD subset, an SGML feature
that isn't supported in mainstream HTML, and that there are no declared
RCDATA attributes in any HTML DTD I'm aware of, your statement is pretty
safe for current HTML practice, but I wouldn't go so far as to say NEVER. I
have quite a number of SGML/HTML documents that do this type of thing.

Murray

[1] Some examples in HTML-i18n would be HTTP-EQUIV and NAME in META,
%linktype;, the ID and CLASS attributes.

[p.s. One mistake I made in the last message: technically, PCDATA is not a
attribute declared value, but a reserved name. The #PCDATA keyword is used
to indicated content occurring in a context in which text is parsed and
markup is recognized.]

```````````````````````````````````````````````````````````````````````````````
     Murray Altheim, Program Manager
     Spyglass, Inc., Cambridge, Massachusetts
     email: <mailto:murray@spyglass.com>
     http:  <http://www.stonehand.com/murray/murray.html>