[whatwg] sic element, was: Re: Exposing spelling/grammar suggestions in contentEditable from Tab Atkins Jr. on 2010-12-31 (public-whatwg-archive@w3.org from December 2010)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Fri, 31 Dec 2010 08:22:52 -0800
Message-ID: <AANLkTi=ERnmyg=41mwgMvSTqTJ8vFyaYn-bQOyAz+33_@mail.gmail.com>
On Fri, Dec 31, 2010 at 7:17 AM, Martin Janecke <whatwg.org at kaor.in> wrote:
> Am 30.12.2010 um 22:49 schrieb Benjamin Hawkes-Lewis:
>> 1. What problem(s) does indicating where mistakes have been reproduced solve?
>
> I understand the question in this context as a concrete formulation of questions such as "What problem(s) does meta data solve? What problem(s) does semantic markup solve?" They carry additional information about a text. They solve the problem of not having this information available. Is the additional information worthwhile in this special case? I think so. It's common in plain text ("[sic]") and even spoken language. It's found in scientific papers as well as in respected newspapers.

"Not having this information available" isn't a problem.  What needs
that information, but can't currently have it?  Don't try to
generalize a problem too early - we come up with generic solutions be
looking at many specific problems, not by finding a generic problem
(since generic problems aren't really problems for any particular
person, usually).

In-band annotations of misspellings (that is, just putting "[sic]" in
the text) solves the problem of "Is this misspelling from the source
or from the author?" just fine for newspapers, scientific articles,
and such.  Why is this not acceptable on webpages?  What additional
benefit do you gain from putting this in the markup rather than in the
text?


> Apart from informing human readers about the correct reproduction of a misspelled word, a HTML <sic> would indicate the same to web applications. Think of a search engine, which, as one factor of their ranking algorithm, considers orthography and grammar in a page as quality factor. The search engine could be made to ignore (reasonably few) <sic>-marked errors in such an algorithm; i.e. not let <sic>-marked errors rank the page lower.

How would <sic> inform human readers about a misspelled word?

Does any search engine currently include misspellings in a page's rank?


>> 3. What's the advantage of using markup to do this rather than visible
>> text like deadtree.
>
> Sorry, I don't understand "deadtree". Is this an idiom?

Deadtree = printed on paper.


>> What's wrong with "The House of Representatives
>> shall chuse [<span lang="la">sic</span>] their Speaker and other
>> Officers"?
>
> In many cases there's nothing wrong with a visible "[sic]". It has successfully been done for decades. And it will be in future. There's also nothing wrong with plain text in general; it has been used successfully for centuries and will be in future. There's nothing wrong with books that use presentation oriented markup either, e.g. italics when emphasizing. They have been printed successfully for centuries and will be in future.
>
> What is wrong with "Cats [emphasized] are cute animals" or "<span style='text-style:italics'>Cats</span> are cute animals" ?or "<span class='emphasized'>Cats</span> are cute animals" instead of "<em>Cats</em> are cute animals"? I don't think there's anything really wrong with either of these, but apparently people agreed that it's good to use a standardized markup language for markup, that semantic markup is a good thing and that simple markup is a good thing. <sic> in an HTML page would be simple, semantic and consequent HTML.

The first is not common practice.  The second is just a really
longwinded way of writing <i> that will fail if CSS isn't turned on.
The third is unnecessary because we have an element for emphasis.
That last one is begging the question, of course, but emphasis is a
decidedly useful semantic to impart that doesn't have a
widely-accepted plain-text variant, and which has several
presentations based on the medium (visually, it's commonly italicized;
aurally, it's spoken with more emphasis).


> I think <sic> is a more HTMLish solution than a plain text "[sic]" -- just like <ul><li>...<li>... styled with list-style-type:decimal is more HTMLish than <div>1. ...<div>2. ...
>
> The plain text string "[sic]" doesn't indicate where the start of the "[sic]"ed part of text is. That means it provides less information than <sic>...</sic>.

This doesn't appear to have been a problem worth solving in centuries
of printed media.  Why is it important to solve now, specifically for
HTML?


> "[sic]" can't be handled with @media and CSS in general.
>
> Note that you can very well style <sic> as "[sic]" with CSS, if that's the form of presentation you prefer:
> sic:after {content:" [sic] "}

Again, this hasn't been a problem in the centuries of print media -
annotated misspellings in quotes have evolved a fairly definite
presentation that varies very little between authors.  What additional
problems does HTML bring that make it suddenly necessary to restyle
"[sic]"?


> "[sic]" is hardly used in full quotes/transcriptions, although the advantages of using "[sic]" in short quotes apply to full quotes too. For example, here's a short quote that uses "[sic]" visibly:
> http://en.wikipedia.org/wiki/Article_One_of_the_United_States_Constitution#Clause_5:_Speaker_and_other_officers.3B_Impeachment
> And here's a transcription that doesn't use "[sic]" in the same place although its publisher considered it important to indicate the correct reproduction of the original source in some way as well, as you can tell by looking into the wiki markup source code, where he added a comment stating the fact:
> http://en.wikisource.org/wiki/Constitution_of_the_United_States_of_America#Section_2
> Having "[sic]" numerous times in a text seems to be annoying. It puts too much emphasis on errors. It is easily misunderstood as ridiculing someone's orthography though often not intended. Also, readers use full text quotes for various purposes, e.g. printing a piece of poetic art out and pinning it to a wall just like a painting. Printed "[sic]"s are not desirable there, as they are not part of the art. An unobtrusive <sic> would preserve the advantages of "[sic]" without its disadvantages in full quotes. It carries its information even if made invisible to the common reader. Unlike HTML comments, which are also invisible, <sic> is semantic, can be easily made visible, and isn't stripped by processing scripts without good reason.

That, now, is a use-case.


>> 4. It seems like "sic" would be a very rarely used feature. Why do we
>> need to include it in the small, core HTML vocabulary rather than an
>> RDF vocabulary imported into HTML via annotations like RDFa,
>> microdata, or microformats?
>
> <sic> would be a natural enhancement in the tradition of <blockquote>, <q> and <cite>.
>
> HTML is a widely taught and learned language.
> Indicating where mistakes have been reproduced deliberately is a widely known and widely (though not very often) applied habit, even in spoken language and plain text.
>
> Extensions such as microformats are less widely known and probably always will be. Because they build upon languages such as HTML, people won't learn microformats without the language they are used upon, but many people will learn the language they are used upon without learning microformats. Microformats are great to solve very specific problems and people seeking to solve specific problems will dig into them happily. But indicating where mistakes have been reproduced deliberately isn't a special interest/topic/technology application. It's a very basic thing to do, it occurs whenever quoting occurs. Almost every blogger does it. People on discussion boards quote each other all the time. Newspapers do it, scientific papers do it.

In almost all of those cases, a visible "[sic]" in the text would be adequate.

~TJ
Received on Friday, 31 December 2010 08:22:52 UTC