[whatwg] Semantics in HTML

On Nov 2, 2006, at 00:17, Anne van Kesteren wrote:

> On Wed, 01 Nov 2006 20:55:58 +0100, James Graham <jg307 at cam.ac.uk>  
> wrote:
>> To take a slight detour into the (hopefully not too) abstract,  
>> what do people think the fundamental point of semantics in HTML is?

I think the fundamental point is allowing programmatic processing of  
documents in ways that are *useful* and that semantic markup makes  
*practical* but that would be considerably less practical with  
presentation-based heuristics *and* enabling the processing without  
those wanting to do it having to negotiate with the author (or  
enabling the author to get off-the-shelf software for processing his/ 
her own documents).

Rendering for media different from the author's primary target is  
such processing done in software controlled by others.

Indexing documents and taking extracts for display in search results  
is such processing done in software controlled by others.

Generating a table of contents could be a case of the author wanting  
to get off-the-shelf software that works with his/her own documents.

So I think the merit of semantic elements in HTML should not be  
judged in terms of the willingness of semanticists to express stuff  
but instead the merit should be judged against the willingness of  
software developers to write software that consumes the expression  
for a useful purpose and the whether authors in general are  
incentivized to support such processing (either knowingly or as a  
side effect of accomplishing other goals).

> Those elements should then not have any presentational aspect

Why not?

To serve media-independent presentation, having reasonable  
presentations for different media is more useful than having a  
semantic definition.

(What kinds of different media there can be is limited by how you can  
deliver data into a human. In the absence of direct-to-brain  
transfers, you are in practice limited to visual, aural and tactile  
media.)

> We probably don't want things like:
>
>   <sci-fi-serie-title>Stargate Atlantis</sci-fi-serie-title>
>
>
> Although I suppose that at some point you do want to able to  
> express the latter.

I think we should not care if someone wants to *express* it unless  
there is notable practical interest in *consuming* the expression.  
(Not "would be cool" interest but "would write software" interest.)

>> Henri has been talking about the possibility of making HTML5 more  
>> "semantically lax", and here Anne is interested in where it is not  
>> "semantically pure", presumably with a desire to fixing it.

My point is that if the semantics for a given element are not precise  
enough or authors aren't incentivized to use them properly so that  
non-presentation use of the semantics becomes impossible or  
prohibitively impractical, what is left is use for media-independent  
rendering and at that point it is enough define the element in terms  
of default presentation or, if the element doesn't have a  
distinguishing default presentation, not include the element.

Example with existing markup:
<dl> has a well-understood default presentation (at least for visual  
media), but on the real Web, it doesn't have precise enough semantics  
to allow heuristic-free reasoning such as compiling a search database  
of definitions for words by scraping the Web. Yet, <dl> is useful for  
achieving a particular kind of organization of pieces of text (list  
of items where the items have an inline label and a block of text) in  
a backwards-compatible way that works even in unstyled HTML.   
Therefore, it is useful to have <dl> around as a media-independent  
grouping device that doesn't have profound semantics.

Example against introducing new markup:
In discussions where <i> is assumed to be axiomatically evil and  
semantic alternatives are sought, it often comes up that in text  
discussing biology the taxonomical Latin names of organisms are  
italicized. Should HTML have an element for marking up a piece of  
text as a biological taxonomical name? I say no. For data mining  
(including search engines) it is easier to compile a list of known  
taxonomical names and compare strings against that list than to  
badger every biologist to use the semantic element. As for  
presentation, <i> works just fine. The effects of <i> on aural or  
tactile media probably won't be so bad that most authors would be  
willing to take special steps. For authors themselves getting off-the- 
shelf software that does useful things, the case is probably too  
specific and lacks processing use cases to create a market. However,  
what authors might want to do is to use the taxonomical names as  
terms in an index in print. However, for that use case to cover  
different kinds of text with index terms, you'd want something more  
generic than markup for biological taxonomical names. (An index is  
not needed for interactive screen media, because you can search for  
any string anyway.)

>> [...] I also don't know which view best fits my position because I  
>> don't really understand what people are trying to achieve with  
>> (the markup in) HTML -- I think there are things I would change in  
>> the current draft, but there seems little point talking about  
>> which markup elements should or shouldn't exist without having  
>> some overall framework against which the merit of various  
>> proposals can be measured.

+1.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/

Received on Saturday, 4 November 2006 06:40:30 UTC