Re: [W3C docs] We should teach by example. from Robert Burns on 2007-07-06 (public-html@w3.org from July 2007)

From: Robert Burns <rob@robburns.com>
Date: Fri, 6 Jul 2007 18:53:13 -0500
To: Smylers <Smylers@stripey.com>
Cc: HTML WG <public-html@w3.org>
Message-Id: <B0161650-DDAE-4BB8-A4BA-0A32D218E240@robburns.com>
On Jul 6, 2007, at 12:16 PM, Smylers wrote:

>
> Robert Burns writes:
>
>> I can understand why someone might find xml-like syntax more human
>> readable.
>
> Agreed.
>
>> I imagine that with practice, one gets more capable of coming up with
>> the end-of-an-element in that manner. However, for novices, or those
>> with a lot of experience reading xml-like HTM, or even just those who
>> have trouble thinking like an SGML parser, I think leaving out  
>> closing
>> tags is a human readability issue.
>
> I agree about those with lots of XML experience.
>
> I disagree that it's an issue for somebody just because she is a  
> novice
> or has trouble thinking like an SGML parser.  In fact, quite the
> opposite: I suggest that for some novices this is going to be much  
> more
> natural.  Obviously the people in this group are not the same as the
> people who prefer the XML syntax.
>
> For somebody new to HTML it makes sense to have to use, say, </em> to
> mark where the emphasis should stop, because otherwise the browser  
> can't
> know.  It does not necessarily make sense to have to include </ 
> body> and
> </html>: why should the browser need to be told that it's reached the
> end, when it can see that perfectly well for itself for the simple
> reason that there's no more content.

You're now talking from the perspective of a browser (a machine  
processor) to justify how someone new to HTML might not need to see  
where an element ended. Its a lot to throw at someone that elements  
are bounded by start and end tags and then quickly add that some tags  
may be missing. Its much easier to simply include the all the tags   
(not for a machine, but for a person).

> Note that if a beginner wants to put those closing tags on, that's no
> problem; she can do so.  The point of the HTML syntax is that it's  
> more
> lax, more foregiving of unimportant differences.

Yes, we all understand that here. We're not talking about machines  
though, we're talking about a person: and a novice to the language.

> And putting quote marks round attributes is just one more thing for a
> beginner to have to grasp, and remember.  It slightly raises the  
> barrier
> to entry, unnecessarily so.  (And again, it doesn't matter if a  
> learner
> does do it.)

I personally don't reading HTML with optional quote marks at all  
difficult to read. However, its not at all a good idea to immediately  
burden a novice with all the places, they might be able to leave out  
the quotation marks. Its much easier to just tell them to include  
them always.

> Some users, because of their background or the way their minds  
> work, are
> going to prefer the XML syntax; some are going to prefer the HTML
> syntax.  Having both available makes learning HTML easy for both  
> groups.

I've tried to explain how it can be difficult to read complex code  
with end tags omitted. Would you care to explain how including  
optional end tags when they aren't necessary for machine parsing  
makes it difficult to read for you?

> And I'd be surprised if _anybody_ naturally thinks like an SGML  
> parser.

I would say that someone who does prefer to read source with optional  
tags omitted is thinking a bit more like an SGML parser than I am  
able to do.

>> The fascination some get from the idea that certain end tags can be
>> left out, to me seems a bit reminiscent of the fascination some
>> pioneering programmer once got when he said "eureka, I can express
>> every year throughout eternity with just two digits,... or at least
>> the important ones."
>
> That's thinking about it backwards; it's thinking about it from an
> expert insider's point of view; we (people on this list) know HTML  
> well
> enough to realize it works like that.
>
> A beginner doesn't (necessarily) think "there's a </body> here but I'm
> allowed to omit it"; he simply doesn't even get as far as thinking  
> that
> the browser needs to be told where <body> ends.  Or even starts, since
> <body> is optional too.  Which is great, cos it means a beginner  
> doesn't
> even need to be told about the concept of <body>; they can just write
> HTML content and have it do the right thing.

I don't see how that is great. It leaves authors using constructs  
they don't understand. It hides something mildly complicated from  
them  in a paternalistic way that will only lead to more confusion in  
other ways: in particular confusion over ill-formedness, improper  
nesting and the like.

> Note the evolution of HTML: it wasn't that we started with an XHTML
> syntax and then somebody realized that because some tags could be
> unambiguously omitted, and 'advanced' feature was added to cope  
> without
> them; instead it was that those developing HTML early on saw no reason
> to include the unnecessary tags.  Presumably if it'd been easier for
> humans had those optional tags been there, they would have been
> included.

My understanding was that this all predates HTML and is part of SGML.  
There the need to conserve on bits was much stronger than it is  
today. So they took steps to economize. This is where the two-digit  
year comes in. It was necessary to economize there too. Adding two  
more digits was a big deal. However, today reserving more data width  
for dates is ubiquitous. Similarly, the economizing by omitting  
optional tags is not as important anymore. It makes the language more  
accessible in that many more people can understand a language that  
doesn't economize on bits.

>> This later led to some problems.
>
> Those (2-digit date) problems were because storing 1969 as 69  
> suffers a
> loss of information; optional tags in HTML has no such problem,  
> because
> it's unambiguous what the assumed content is.

Omitting tags also suffers a loss of information. The structure of a  
document has to be encoded into any processing UA. With explicit  
tags, a UA does not have to know up front that <anelement> hasn't yet  
ended until it sees the close tag </anelement>. It can ignore what it  
doesn't understand and simply process what it does understand. The  
entire prospect of adding arbitrary namespaces relies on including  
explicit opening and closing tags.

>> I think now we're seeing similar problems with the optional omission
>> of close tags: not the least of which we're finding our HTML
>> serialization cannot be as expressive as our xml serialization. As
>> examples, the discussion over tying to improve the <img> syntax
>
>
> Allowing <img> to take optional content while also maintaining  
> backwards
> compatibility with standalone 'unclosed' <img> elements introduces an
> ambiguity in "<img>text".

Yes because of an historical need to economize on tag usage. Just as  
the historical need to economize on year digits led to a need for  
processor to be hard-wired with explicit century processing.

> Unfortunately that ambiguity exists whether or not HTML5 insists on
> XML-style closed tags for all new content, so your proposal does not
> help improve the expressiveness of HTML syntax versus XHTML syntax.

Yes for legacy reasons. However, we should be thinking about ways to  
break from those legacy constraints in backwards compatible ways.

>> Also, as Henri just raised, the desire to include foreign namespaces
>> in the HTML serialization is complicated by the lack of closing tags.
>
> In the message I think you're referring to Henri said:
>
>   For reasons of backwards compatibility, the we have only one  
> namespace
>   we can use and this section correctly designates exactly that
>   namespace.
>
>   ... As for foreign namespaces in the text/html serialization, I  
> think
>   the matter of serializing MathML and SVG in text/html has not been
>   pursued far enough yet and is still worth pursuing by this WG.
>
> It isn't immediately obvious to me how a lack of closing tags in newly
> written HTML content is complicating things; please can you elucidate?

Explicit close tags allow UAs to process content from unfamiliar  
namespaces. The UA requires no prior knowledge of the namespace to at  
the very least ignore the content. This is a common mechanism for  
extensibility  all over computing.

>> I think simply the presence of XML and XHTML has led to greater
>> awareness among authors of ill-formedness issues and invalidity. Its
>> difficult to communicate proper nesting to authors while
>> simultaneously trying to communicate the benefits of certain tags
>> being omitted.
>
> We do not need to communicate benefits in omitting certain tags!
> Allowing people who prefer to omit them to do so is sufficient reason
> for for doing so; there's no need to try to persuade those who  
> prefer to
> include them into _not_ doing so!
>
> Neither syntax has to be superior to the other.

No, they both have there advantages and disadvantages. The advantages  
of the economizing syntax has been dwarfed by the advances in  
computing power over the last two decades. There may still be places  
where economizing is a good idea for optimization  purposes. We could  
even add a more compact binary serialization to optimize further.  
However, in the context of an official document of this WG, I see no  
reason we should employ such optimizations at the expense of  
confusing authors who might actually turn to our work as their first  
example to understand HTML.

Take care,
Rob
Received on Friday, 6 July 2007 23:53:43 UTC