Re: Doctypes and the dialects of HTML 5

On Mar 31, 2007, at 22:44, Jirka Kosek wrote:

> Henri Sivonen wrote:

> I was thinking about optional version information, not mandatory.

OK.

> I think that you are anticipating that in the future there will be  
> only
> one version of HTML (HTML5) which will be then extended and became
> HTML6, etc.

This WG can have an effect on the future on that point. The future  
doesn't just happen. The versions you cite below did not just emerge.  
They were deliberately specced by the XHTML 2.0 WG (and in the case  
of MP, a group roughly subscribing to similar philosophy, apparently).

> But even now there are several versions of HTML in use:
> Strict, Transitional, Frameset, 1.1, Print, Basic, MP. Future may  
> prove
> that this was big mistake and that there should be only one  
> version, who
> knows. But we (at least me) do not have time machine.

MP and Basic are clearly based on a premise that doesn't work for the  
Web. (If HTML can be subset because someone doesn't bother to ship a  
Web browser, you've got a walled garden – not the Web.) Print is not  
for the Web but for talking to a printer from a phone.

1.1 is roughly a superset of Strict, so as far as 1.1 vs. Strict  
goes, conformance checkers could well default to 1.1. (Mine doesn't,  
yet, because upstream wasn't ready and I was too busy with HTML5 to  
edit the schema from upstream. ;-)

There's no reason why a single setting could not accept the features  
of Transitional and Frameset if versioning was absent. IIRC, James  
Clark did this. The main reason against is that XHTML 1.0 defines  
Transitional and Frameset as separate, but that's no argument for  
what this WG should do going forward.

Strict vs. Transitional is the kind of fluffiness (saying one thing  
while really giving in to another) that HTML5 is supposed to avoid.  
(Yeah, there are the WYSIWYG editor fluffiness in HTML5, but I  
disagree with Hixie on that point. :-)

> In my opinion conformance checker should offer at least following
> validation options:
>
> - check document against rules for version which is specified in
> document (or use the latest version if version is unspecified)

What's the use case? Transitional periods e.g. the transition from  
the time when HTML5 is the obvious version to use to the time when  
HTML6 is the obvious version to use? Still, that would make it an  
HTML6 problem, not an HTML5 problem.

> - check document against the latest version
>
> - check document against any arbitrary chosen version

I agree.

>> 5) A CMS uses an implementation-specific subset (e.g. no scripting  
>> and
>> no forms permitted). You want to configure a general-purpose  
>> authoring
>> tool to limit auto-completion to this subset.
>>
>> This use case actually has merit. However, it doesn't have merit as a
>> reason for requiring all authors to include a version='5'  
>> incantation.
>
> Version information could be just optional.

I'm OK with optional authoring tool configuration hooks. (The word  
"versioning" carries a lot of baggage.)

>> Discussing this issue pretty much reduces to the discussion about the
>> bogosity of xsi:schemaLocation and about the merits of a PI for
>> declaring the location of a RELAX NG schema in a document instance.
>
> I must strongly disagree with this point. Both schema association  
> PI and
> xsi:schemaLocation points to concrete schema written in a particular
> schema language. This is very different from specifying just  
> version of
> HTML used.

The thing they do have in common, though, is that the document  
instance declares which rules it wants to have for itself instead of  
the document instance and the rules being to entirely independent  
inputs to the checking process.

> Compare:
...
> The first example specifies concrete schema written in W3C XML schema,
> which is neither very flexible (you might want to use completely
> different schema language), nor very clever (for various  
> performance and
> security reasons you shouldn't fetch schemas from location provided in
> document). However later example gives you much more flexibility  
> because
> it offers one additional level of indirection. It is upon you or your
> application to pick-up correct schema (or processing component, or
> whatever) based on content of version attribute.

Agreed.

>> I think XHTML5 should neither require nor forbid PIs for configuring
>> authoring tools. This is between the author and his/her editor and
>> leaving the artifact in a file that gets served on the Web is mostly
>> harmless.
>
> Editing is only one issue. But many companies use subset of XHTML as
> basic format for creating universal text content. They send this  
> subset
> between various components, for editing,

I think we agree on the editing use case.

> proofreading,

I don't understand why version information would be needed here.  
Isn't this either a special case of editing or a process that only  
adds <ins>/<del> in a way specced in HTML5 proper?

> approving,

How much does approval change the document? Do you envision an  
approval process supporting different subsets that require the  
approval to be encoded in different ways?

> down conversion,

I don't understand why this use case requires versioning. A converter  
supports some part of HTML as the input. Labeling the subset used as  
input doesn't really help. Do you mean labeling the output from a  
conversion with an assertion about what subset the converter  
purported to produce?

> publishing, ...

I don't understand what you mean exactly. I assume you mean something  
other than publishing to browsers.

However, as a general comment, a version identifier is relatively  
useless on input as far as consuming the input goes (for any purpose  
other that checking if the input adheres to its self-claimed  
version). It is useful for constraining *output*. But since you put  
the information on *input*, it makes sense for constraining tools  
capable of *round-tripping*, such as editors.

> They need to carry information about version
> of subset used and without some HTML provision for version labeling  
> they
> have to extend HTML with their own element/attribute which of course
> causes problems in many tools that are recognizing only HTML markup.

This is the main reason why I suggested having an optional attribute  
with user-defined contents instead of suggesting that those who need  
the attribute, extend HTML with such an attribute themselves.

>> I am less sympathetic to an attribute on the root element for the  
>> same
>> purpose, but I'd be willing to concede to an optional attribute with
>> user-defined contents for the purpose of use as a hook in private
>> authoring workflows. E.g. profile='acme-cms-scriptless-and-formless'.
>
> Such attribute should be allowed not only on root (html) but on any
> element which can act as a root of reasonable HTML fragment.

Slippery slope already. :-)

> Ideally
> this should include all HTML elements, but having it at least on block
> elements is a must for CMS that compose final documents from small  
> pieces.

If a CMS does document assembly like that, surely it can filter out  
the version identifiers from fragments while it is at it.

>> With the schema project for (X)HTML5, fantasai and I have built in  
>> some
>> options in the schema for dealing with HTML5 vs. XHTML5  
>> differences and
>> for catering to subsetting in ways that we foresee as reasonable.
>
> Could you please put pointer to this schema here? I would like to see
> how extensibility is handled.

http://syntax.whattf.org/
Please ignore the sentence about XSD, DTD and SGML.

I expect to refactor the rest of the exclusions related to <header>,  
<footer> and sectioning elements from RELAX NG to Schematron soonish.  
If fantasai is OK with it, I'd like to move the exclusions related to  
interactive elements to Schematron as well.

Extensibility (supersetting) is not handled in any particular way.  
Subsetting is. However, if your extensions just add stuff to the  
common content models, you are good to go for supersetting as well.

>> Since subsetters are going to do their own thing anyway, naming the
>> subsets should be user-defined and it would be pointless to try to  
>> come
>> up with a closed list of de jure subset names.
>
> But there could/should be some basic naming policy. Remember that  
> people
> need some guidance (at least majority of people ;-).

One option would be having a wiki-like registry with a low barrier of  
entry for documenting existing subsets so that others can use the  
same name for talking about the same subset. (By "low barrier" I mean  
*way* lower than registering anything with the IANA.)

HTML5 has a trial balloon in the spec about solving rel value  
extensibility like this.

Hopefully subsets won't be identified by HTTP URIs. Otherwise after a  
while someone suggests that the URI should dereference into a schema.

>> Indeed. Online conformance checkers should probably default to the
>> broadest feature set they support. For example, allowing embedded SVG
>> and MathML by default. (The reason why mine doesn't, yet, is that I
>> haven't had time to review the SVG and MathML stuff properly, yet.)
>
> With NVDL you don't have to review and study schemas for new  
> languages. ;-D
> You just say that this namespace is handled by this separate  
> schema, no
> need to integrate schemas using their extensibility hooks.

NVDL wouldn't remove the need to do quality assurance on schemata  
obtained from elsewhere. See what happens when I just deploy stuff  
from upstream:
http://golem.ph.utexas.edu/~distler/blog/archives/001206.html#c008677
;-)

>>> This for example means that you can not embeded XHTML page into SOAP
>>> message and identify version of XHTML used.
>>
>> Considering what I said above, versioning XHTML inside SOAP messages
>> should not be necessary. Interchange with loosely affiliated or
>> unaffiliated parties is similar to the browser use case.
>
> In practice many companies use web-services in pretty tightly coupled
> setups with very strange requirements. You say that specifying HTML
> version should not be necessary. I say I have seen real  
> requirements for
> using controlled subset of XHTML inside payload.

If the setup is tightly coupled, why isn't the subsetting part of the  
tight coupling without an explicit flag?

>>> Example 6. More robust way of labeling document as XHTML Print
>>
>> FWIW, I think XHTML Print has remarkably little relevance to Web  
>> content
>> or even authoring in editors.
>
> Why do you think so?

According to the abstract of the spec, XHTML Print is designed to be  
a language between a mobile phone and a printer. The way I understood  
this was that a program on a phone generates an XHTML Print document  
and sends it to a nearby printer without the document ever being  
served on the Web or being edited with an authoring tool.

(I admit, though, that I don't really understand why XHTML Print  
exists. I find the premise of the spec bizarre. I don't understand  
why a mobile device couldn't spool as PostScript or PDF and why a  
printer vendor would ever want to embed an XHTML+CSS engine in a  
printer instead of making the printer consume a format that encodes  
final-form geometry either as vector graphics or as straight raster  
data.)

> In my personal opinion if you want to ensure that HTML will not fork,
> you have to provide complete and flexible enough language that will
> allow subsetting for more restricted environments like mail, low-cost
> printers, Joe's 10 tag super-simple HTML, ...
>
> Without providing such facility respective subsets will fork and this
> will lead to great confuse for developers, and to incompatibilities  
> for
> content producers.

The way I see it is that the primary facility is writing a delta spec  
in English. If there's software that has to support multiple subsets  
and distinguish them, I'd be OK with an optional configuration hook  
as a PI before root or as an attribute on the root element.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Sunday, 1 April 2007 21:22:49 UTC