Re: [XHTML2] CITELANG, TITLELANG attributes from Jukka K. Korpela on 2004-07-28 (www-html@w3.org from July 2004)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Wed, 28 Jul 2004 14:36:42 +0300 (EEST)
To: www-html@w3.org
Message-ID: <Pine.GSO.4.58.0407281323520.25604@korppi.cs.tut.fi>
On Wed, 28 Jul 2004, Ian Hickson wrote:

> There are _some_ uses for more general language information -- Google's
> filtering of content based on language, for instance.

As far as I know, Google uses its own guesswork ("heuristics") when
deciding the language of a document, ignoring both HTTP headers and
language markup in (X)HTML. I still think that it is worthwhile to make
language information available, hoping that search engines and other
parties start making use of it. And maybe it's a good thing that many
authors mistakenly believe that their lang or xml:lang attributes are of
some use - since if authors use them, there are more reasons to make use
of them in search engines. The role of XHTML specifications in this
process is that language information mechanisms should be both
well-defined and easy to use to authors.

> When the title is an attribute, getting the title string to pass to a
> function which is then going to display the title consists of:
>
>    1. Get the value of the attribute.
>
> If the title is the textual content of a child element, you have to:
[ do something more complex ]

I'm not sure I understand the complexity. Can't a parser simply recognize
each <title> element as it sees it and associate it with the internal data
structure corresponding to the parent element?

> To summarise, elements are _hard_.

I still don't see the problem, I'm afraid, but if elements are _hard_,
then the problem is in the very idea of markup, which revolves around
elements. Attributes are just properties of elements. If you change
something that is in essence a container for textual data (which might
need some inline markup), hence something that should be an element in
markup, into an attribute containing plain text, for efficiency of
implementation, then I think it's time to consider where this all would
end.

> Note that simply saying "it must be the first element" or "you must not
> nest these elements" and so forth doesn't get you out of any of this,
> since it is trivial to mutate the DOM to get it into these states. The
> behaviour has to be well-defined in all these cases.

Sorry I fail to see the point here. Surely XHTML specifications need to
define the semantics of valid constructs only.

> > To take an analogous case, we currently have the CAPTION element which
> > may be used (only) inside a TABLE element and the SUMMARY attribute that
> > may be used for a TABLE element.
>
> Great example. Implementing "summary" in a meaningful way is significantly
> easier than implementing "caption". By orders of magnitude.

But as I learned in this thread (thanks Anne!), the current draft has made
<summary> an element, which sounds logical. Are you saying that this
should be taken back? (And all browsers implement the "caption"
element, though poorly, whereas "summary" is virtually unimplemented,
there's a mismatch between actual browser behavior and the difference in
the difficulty of implementation that you refer to.)

> > I don't see the possibility as extremely rare. Consider a link - a
> > typical element to which we might wish to assign a TITLE. If the
> > document where the link appears is in French and the linked document is
> > in German, for example, it would be very natural to make the "advisory
> > title" contain the name of the linked document in both French and in
> > German, in many cases.
>
> That is an very rare case.

Is it? To the extent that documents contain a mixture of languages in some
sense, this seems to be a very typical example. And if there is no mixture
of languages, the hreflang and citelang issue becomes a non-issue. Well,
I guess we have partly moved to other directions as well.

> > For example, we would like to have a speech browser read the title using
> > adequate algorithms for speech generation for each language. And
> > "advisory titles" are typical examples of _short_ texts where heuristics
> > so often fail - if you just try to guess whether the language changes
> > within such a text, from the characteristics of the short string itself,
> > you can't be very successful.
>
> Like I said. Highly theoretical. :-)

If we regard such issues as negligible, then I think many parts of the WAI
recommendations should be rewritten. I'm especially thinking about the
_priority 1_ requirement that all changes in language in a document be
indicated in markup. It is highly illogical to make such requirements and
to define the markup language so that not all changes _can_ be indicated.
Or should we read the current (and planned) situation so that authors are
required to use Unicode language tags inside attribute values if there is
a single foreign word in any such attribute?

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Wednesday, 28 July 2004 07:37:47 UTC