- From: Misha Wolf <misha.wolf@reuters.com>
- Date: Sat, 08 Mar 1997 01:27:46 +0000 (GMT)
- To: www-html <www-html@w3.org>, www-international <www-international@w3.org>
I'd like some advice on the use of non-standard HTML attributes, in relation
both to the LANG attribute and to Metadata.
RFC 1866 (Hypertext Markup Language - 2.0) states, in section 4.2.1:
To facilitate experimentation and interoperability between
implementations of various versions of HTML, the installed base of
HTML user agents supports a superset of the HTML 2.0 language by
reducing it to HTML 2.0: markup in the form of a start-tag or end-
tag, whose generic identifier is not declared is mapped to nothing
during tokenization. Undeclared attributes are treated similarly. The
entire attribute specification of an unknown attribute (i.e., the
unknown attribute and its value, if any) should be ignored.
I haven't found a similar statement in the HTML 3.2 spec, but assume the
above is inherited from the HTML 2.0 spec.
Now, in common with many others, I am keen on the implementation of the LANG
attribute, specified in RFC 2070 (Internationalization of the Hypertext Markup
Language). [This attribute is not part of HTML 3.2, but is to be included in
the next version of HTML.] I am also keen on the use of HTML validators.
There is a tension between these two desires. A validator which checks for
HTML 2.0/3.2 conformance will flag as erroneous the use of the LANG attribute.
A similar problem arises with the use of Metadata. At the DC-4 Metadata
Workshop in Canberra (March 3-5), we agonised over a difficult choice:
1. Use a clean syntax for "qualified" (explained below) Metadata, even
though it would rely on a use of attributes not defined in HTML 2.0/3.2.
2. Use a dirty (difficult to parse) syntax, conformant with HTML 2.0/3.2.
Consider:
<META NAME = "DC.DATE" CONTENT = "1997-03-07">
The above syntax is unproblematic [Note: DC stands for Dublin Core].
In some cases, though, it is useful to qualify the Metadata, by naming the
particular "scheme" used to encode the value of the CONTENT attribute. For
dates, one such scheme is ISO 8601. The two syntaxes we discussed in Canberra
are:
<META NAME = "DC.DATE" CONTENT = "(scheme=ISO-8601) 1997-03-07">
or:
<META NAME = "DC.DATE" SCHEME = "ISO-8601" CONTENT = "1997-03-07">
An even stronger need for qualification applies to subject classification
schemes (of which there are many), as in:
<META NAME = "DC.SUBJECT" CONTENT = "(scheme=XYZ) Something or other">
or:
<META NAME = "DC.SUBJECT" SCHEME= "XYZ" CONTENT = "Something or other">
Furthermore, we want to be able to qualify the language of the value of the
CONTENT attribute, eg:
<META NAME = "DC.SUBJECT" SCHEME= "XYZ" LANG = "xy" CONTENT = "Something or other">
We fed something like the above to Microsoft Word '97. When we examined the
saved file, the unknown attributes (SCHEME and LANG) had vanished, together with
their values. This is, I suppose, one possible interpretation of the phrase
"should be ignored", in the earlier quote from RFC 1866.
In any event, even if Word were persuaded to ignore more gently, the proverbial
HTML validator would complain if offered HTML like the above.
How do we reconcile (i) being, in some minor sense, on the leading edge and
(ii) wanting to encourage our users to generate "legal" HTML and to use validators
to make sure it is legal?
Misha
Received on Friday, 7 March 1997 20:26:37 UTC