LANG + Metadata + unknown attributes

Misha Wolf (
Sat, 08 Mar 1997 01:27:46 +0000 (GMT)

Date: Sat, 08 Mar 1997 01:27:46 +0000 (GMT)
From: Misha Wolf <>
Subject: LANG + Metadata + unknown attributes
To: www-html <>, www-international <>
Message-Id: <2146270108031997/A07422/REDMS1/11B3405B2A00*@MHS>

I'd like some advice on the use of non-standard HTML attributes, in relation 
both to the LANG attribute and to Metadata.

RFC 1866 (Hypertext Markup Language - 2.0) states, in section 4.2.1:

   To facilitate experimentation and interoperability between
   implementations of various versions of HTML, the installed base of
   HTML user agents supports a superset of the HTML 2.0 language by
   reducing it to HTML 2.0: markup in the form of a start-tag or end-
   tag, whose generic identifier is not declared is mapped to nothing
   during tokenization. Undeclared attributes are treated similarly. The
   entire attribute specification of an unknown attribute (i.e., the
   unknown attribute and its value, if any) should be ignored.

I haven't found a similar statement in the HTML 3.2 spec, but assume the 
above is inherited from the HTML 2.0 spec.

Now, in common with many others, I am keen on the implementation of the LANG 
attribute, specified in RFC 2070 (Internationalization of the Hypertext Markup 
Language).  [This attribute is not part of HTML 3.2, but is to be included in 
the next version of HTML.]  I am also keen on the use of HTML validators.  
There is a tension between these two desires.  A validator which checks for 
HTML 2.0/3.2 conformance will flag as erroneous the use of the LANG attribute.

A similar problem arises with the use of Metadata.  At the DC-4 Metadata 
Workshop in Canberra (March 3-5), we agonised over a difficult choice:

   1.  Use a clean syntax for "qualified" (explained below) Metadata, even 
       though it would rely on a use of attributes not defined in HTML 2.0/3.2.

   2.  Use a dirty (difficult to parse) syntax, conformant with HTML 2.0/3.2.


   <META NAME = "DC.DATE" CONTENT = "1997-03-07">

The above syntax is unproblematic [Note: DC stands for Dublin Core].

In some cases, though, it is useful to qualify the Metadata, by naming the 
particular "scheme" used to encode the value of the CONTENT attribute.  For 
dates, one such scheme is ISO 8601.  The two syntaxes we discussed in Canberra 

   <META NAME = "DC.DATE" CONTENT = "(scheme=ISO-8601) 1997-03-07">
   <META NAME = "DC.DATE" SCHEME = "ISO-8601" CONTENT = "1997-03-07">

An even stronger need for qualification applies to subject classification 
schemes (of which there are many), as in:

   <META NAME = "DC.SUBJECT" CONTENT = "(scheme=XYZ) Something or other">
   <META NAME = "DC.SUBJECT" SCHEME= "XYZ" CONTENT = "Something or other">

Furthermore, we want to be able to qualify the language of the value of the 
CONTENT attribute, eg:

   <META NAME = "DC.SUBJECT" SCHEME= "XYZ" LANG = "xy" CONTENT = "Something or other">

We fed something like the above to Microsoft Word '97.  When we examined the 
saved file, the unknown attributes (SCHEME and LANG) had vanished, together with 
their values.  This is, I suppose, one possible interpretation of the phrase 
"should be ignored", in the earlier quote from RFC 1866.

In any event, even if Word were persuaded to ignore more gently, the proverbial 
HTML validator would complain if offered HTML like the above.

How do we reconcile (i) being, in some minor sense, on the leading edge and 
(ii) wanting to encourage our users to generate "legal" HTML and to use validators 
to make sure it is legal?