Re: updated cite definition - please review from Charles McCathie Nevile on 2013-08-28 (public-html@w3.org from August 2013)

From: Charles McCathie Nevile <chaals@yandex-team.ru>
Date: Wed, 28 Aug 2013 21:58:14 +0200
To: "Bruce Lawson" <brucel@opera.com>, "Jukka K. Korpela" <jukka.k.korpela@kolumbus.fi>
Cc: "HTMLWG WG" <public-html@w3.org>
Message-ID: <op.w2jlrcp2y3oazb@chaals.local>
On Wed, 28 Aug 2013 10:32:41 +0200, Jukka K. Korpela  
<jukka.k.korpela@kolumbus.fi> wrote:

> 2013-08-28 11:12, Bruce Lawson wrote:
>> On 25 August 2013 19:19, Jukka K. Korpela <jukka.k.korpela@kolumbus.fi>  
>> wrote:
>>> If there were an element called <z> in HTML, with italic as default
>>> rendering in browsers,[...] it would be pointless to discuss what the
>>> "right" usage is or to collect statistics of existing usage, or to
>>> study definitions of <z> in past specifications.

No, it wouldn't.

>>> The only sensible thing that browsers, search engines,[...should] do,
>>> is to treat <z> as an element with unknown meaning and no
>>> effect, except for the default rendering (if it is an established  
>>> practice).

Actually, that isn't the case.

Many HTML elements are widely abused. Mostly less than in the past. Yet  
search engines can profitably use them - both for searching for semantics,  
and by comparing what they find to other things in their index to get a  
better idea of whether a given page is using an element correctly.

Which in turn supports things like tools for improving existing content.

>> But there isn't a <z> element, so this is a red herring.
>
> The <cite> element is very similar to <z> in uselessness. Well, <cite>  
> causes italic font by default, but you can achieve just the same with  
> the more concise <i>.

Actually, it seems to be rather more useful.

>>   There *is* a
>> <cite> element, which used to be allowed for marking up titles of
>> works and authors of cited works,
>
> That was two different old specs. One of them allowed it for titles, the  
> other allowed it for citations including author names. Either of these  
> could in principle have been a useful definition, since it would at  
> least allow some conceivable processing for the element in search  
> engines, structured data extraction, etc. (even though nothing like that  
> ever happened).

That's a huge claim - can you prove nobody did that?

> The amalgamated “semantics” makes <cite> even theoretically as useless
> as the hypothetical <z>.

No, it legitimises what is widespread practice, while not legitimising  
"any old usage". So it simplifies life for authors (who also now have a  
way of meeting the use case of attributing things to an author) without  
changing anything real for a search engine except that we can now point to  
a spec that better justifies the way we interpret the element.

>> There are people who wish to denote authors, and millions of
>> websites that already use <cite> to denote author name.
> People want to denote many things. Millions of websites probably use  
> <cite> to denote quotations, too. (Saying that it must/should not be  
> used for quotations effectively says that it is.) Should that be thrown  
> in, too, into the “semantics”?

No, in this case that is probably unnecessary. (Your hypothetical here is  
useless, since a lot depends on what actually happens on the web).

>> The fact that software can't tell the difference between a cited work
>> and a cited author is not a reason to keep the spec from specifying
>> common existing practice.
>
> All that matters in the common existing practice is that <cite> is by  
> default rendering in italic (when possible). Everything else is just  
> idle and confusing “semantics” in the worst meaning of the word – unless  
> someone can come up with an example (even a very theoretical thought  
> experiment) what could possibly be done with <cite> on the basis of the  
> proposed semantic definition.

There's quite a lot of software out there used to detect plagiarism.  
There's also a lot of translation and automated translation. Knowing when  
something is attributed and being able to compare it based on a search,  
even across languages, provides a pretty powerful plagiarism detection  
tool with the ability to save many people a lot of very boring mechanical  
work and focus on the real academic merits of something - or to go home  
earlier, or whatever...

> As far as I can see, any assumption about the meaning, or even  
> structural relationship to the surrounding content (beyond pure  
> syntactic nesting) would conflict with much of existing usage.

How much of a problem that is depends on each particular case. In this  
case, I think the work of rescuing <cite> and making it do some of the  
things people expect, and things people expect to be able to do, seems  
worthwhile.

Of course despite bleatings of living in a data-driven environment, this  
is ultimately a judgement call based on a bet about the future, as we can  
interpret the data any way we want but "in hindsight" is the only sure way  
to get *some* agreement on what it meant.

> “Cite” is a legacy element that has been used to mark up titles of  
> works, names of authors, quotations, and other things. It cannot be  
> defined semantically in any useful way that would not conflict with much  
> of the existing usage.

That is a judgement call. My opinion is that it is wrong in this case.

cheers

Chaals

> Ergo, it should be just documented as one of the elements that cause  
> italic rendering by default. It should be regarded as obsolete, but  
> conforming – there is no reason to punish authors for using it.
>


-- 
Charles McCathie Nevile - Consultant (web standards) CTO Office, Yandex
       chaals@yandex-team.ru         Find more at http://yandex.com
Received on Wednesday, 28 August 2013 19:58:49 UTC