[whatwg] sic element, was: Re: Exposing spelling/grammar suggestions in contentEditable from Benjamin Hawkes-Lewis on 2010-12-31 (public-whatwg-archive@w3.org from December 2010)

From: Benjamin Hawkes-Lewis <bhawkeslewis@googlemail.com>
Date: Fri, 31 Dec 2010 16:30:45 +0000
Message-ID: <AANLkTi=-SNOr83Sv71=Rfei-GGvGEe3DHQNUqbH7A3JE@mail.gmail.com>
On Fri, Dec 31, 2010 at 3:17 PM, Martin Janecke <whatwg.org at kaor.in> wrote:
> Am 30.12.2010 um 22:49 schrieb Benjamin Hawkes-Lewis:
[snip]
>> 1. What problem(s) does indicating where mistakes have been reproduced
>> solve?
>
> I understand the question in this context as a concrete formulation of
> questions such as "What problem(s) does meta data solve? What problem(s) does
> semantic markup solve?"

Not really. Semantic markup is a tool HTML uses to solve problems. The sort of
problem statements we're looking for are things more like this. End-users need
to find information within complicated pages quickly. By marking up headings
semantically we allow users to scan the page visually, or select a heading from
a list, or jump to the next heading with a shortcut key.

> Apart from informing human readers about the correct reproduction of a
> misspelled word, a HTML <sic> would indicate the same to web applications.
> Think of a search engine, which, as one factor of their ranking algorithm,
> considers orthography and grammar in a page as quality factor. The search
> engine could be made to ignore (reasonably few) <sic>-marked errors in such
> an algorithm; i.e. not let <sic>-marked errors rank the page lower.

Would search engines benefit from markup for this?

Seems to me it would be fairly easy for an search engine to spot plain text
"[sic] and act accordingly. Since there's a huge web corpus preceding the
introduction of the "sic" element, since most future authors won't use obscure
semantic markup, and since a lot of content is in the form of non-HTML formats
like plain text, Microsoft Word, and PDF, a good search engine that wanted to
do this would need to develop natural language techniques for detecting "[sic]"
rather than relying on the "sic" element alone.

A search engine that placed heavy weight on semantic markup would take into
account data about the original date of authorship (when whole works are
transferred onto the web) and markup like "q" and "blockquote" when you're just
quoting from another source, and penalize you for misquotations not spelling
errors. So it could ignore the "sic" element.

In either case, I think the effect of all this on rankings would in practice be
so small that it wouldn't be worth the costs to add a feature to HTML to support
it. Search engine vendor testimony to the opposite would be very useful here.

>> 3. What's the advantage of using markup to do this rather than visible text
>> like deadtree.
>
> Sorry, I don't understand "deadtree". Is this an idiom?

Sorry, it's slang for the world of print:

http://en.wikipedia.org/wiki/Hard_copy

>> What's wrong with "The House of Representatives shall chuse [<span
>> lang="la">sic</span>] their Speaker and other Officers"?
>
> In many cases there's nothing wrong with a visible "[sic]". It has
> successfully been done for decades. And it will be in future. There's also
> nothing wrong with plain text in general; it has been used successfully for
> centuries and will be in future. There's nothing wrong with books that use
> presentation oriented markup either, e.g. italics when emphasizing. They have
> been printed successfully for centuries and will be in future.
>
> What is wrong with "Cats [emphasized] are cute animals"

Doesn't reflect an existing idiom and cannot be made bold or italic with CSS.

> or "<span style='text-style:italics'>Cats</span> are cute animals" ?or "<span
> class='emphasized'>Cats</span> are cute animals" instead of "<em>Cats</em>
> are cute animals"?

Mainly, these don't differentiate emphasis (which has a distinctive
presentation across various media) from other common uses of italics (e.g.
headings, foreign terms, titles of works, ship names) which may need different
presentation in different skins, different treatment in different media, and
special handling by user agents (e.g. navigation shortcuts for headings).

> people agreed that [snip] semantic markup is a good thing

They agreed it was a good tool for solving certain types of end-user problems,
not that it was an end in itself. There's a strong bias in the WHATWG community
against semantics for the sake of semantics.

> I think <sic> is a more HTMLish solution than a plain text "[sic]" -- just
> like <ul><li>...<li>... styled with list-style-type:decimal is more HTMLish
> than <div>1. ...<div>2. ...

I think that's like arguing "<sentence>The cat sat on the mat</sentence>" is
more "HTMLish" than "The cat sat on the mat." ;)

> The plain text string "[sic]" doesn't indicate where the start of the
> "[sic]"ed part of text is. That means it provides less information than
> <sic>...</sic>.
>
> "[sic]" can't be handled with @media and CSS in general.

Why does it need to be? Is applying different styling to indicate mistakes in
the original actually a common publisher need (unlike being able to style
headings or block quotations, for example)?

> Note that you can very well style <sic> as "[sic]" with CSS, if that's the
> form of presentation you prefer: sic:after {content:" [sic] "}

You can. However, that loses the linguistic information that "sic" is a Latin
word, which is (theoretically) important for correct pronunciation by speaking
agents.

> And here's a transcription that doesn't use "[sic]" in the same place
> although its publisher considered it important to indicate the correct
> reproduction of the original source in some way as well, as you can tell by
> looking into the wiki markup source code, where he added a comment stating
> the fact:
> http://en.wikisource.org/wiki/Constitution_of_the_United_States_of_America#Section_2

I don't see that the comment adds anything that "q" and "blockquote" do not.

> Having "[sic]" numerous times in a text seems to be annoying. It puts too
> much emphasis on errors. It is easily misunderstood as ridiculing someone's
> orthography though often not intended. Also, readers use full text quotes for
> various purposes, e.g. printing a piece of poetic art out and pinning it to a
> wall just like a painting. Printed "[sic]"s are not desirable there, as they
> are not part of the art. An unobtrusive <sic> would preserve the advantages
> of "[sic]" without its disadvantages in full quotes. It carries its
> information even if made invisible to the common reader. Unlike HTML
> comments, which are also invisible, <sic> is semantic, can be easily made
> visible, and isn't stripped by processing scripts without good reason.

I think you'll find there's a strong bias in the WHATWG community towards the
principle that visible data is preferable to invisible metadata:

http://tantek.com/log/2005/06.html#d03t2359

I think it's unlikely you'll find much support for adding a "sic" element under
the expectation that browsers will render it by adding [sic] at the end. I
think it's even more unlikely you'll find much support for adding a "sic"
element that browsers won't render or otherwise treat specially.

>> 4. It seems like "sic" would be a very rarely used feature. Why do we need
>> to include it in the small, core HTML vocabulary rather than an RDF
>> vocabulary imported into HTML via annotations like RDFa, microdata, or
>> microformats?

> Extensions such as microformats are less widely known and probably always
> will be.

Agreed. This seems appropriate for features that would be rarely used.

If it turns out people are inclined to use the feature (e.g. if a microformat
or whatever gains surprisingly common currency), we can reconsider adding it to
the core vocabulary.

> But indicating where mistakes have been reproduced deliberately isn't a
> special interest/topic/technology application. It's a very basic thing to do,
> it occurs whenever quoting occurs. Almost every blogger does it. People on
> discussion boards quote each other all the time. Newspapers do it, scientific
> papers do it.

Indeed. This doesn't mean they'd actually use the element however. Almost
nobody uses "q"; few people use "cite" in body text. And you're citing examples
where people want to make a visible indication of an error, but proposing an
element that you expect not to have a visible indication, so the element would
not solve their problem, so they mostly wouldn't use it.

--
Benjamin Hawkes-Lewis
Received on Friday, 31 December 2010 08:30:45 UTC