W3C home > Mailing lists > Public > public-html@w3.org > December 2012

The missing Sentence tag

From: Thomas A. Fine <fine@head.cfa.harvard.edu>
Date: Wed, 05 Dec 2012 12:57:28 -0500
Message-ID: <50BF8B08.5060606@head.cfa.harvard.edu>
To: public-html@w3.org
HTML needs a tag to indicate sentence structure.

So, how do I go about having this tag added?  Is there a formal 
procedure?  Should I submit a bug report?  Is there a specific group or 
mailing list where I should start?  What exactly is the process?

Here's a brief summary of why I think this is needed:

HTML5 has already added a number of other semantic tags which describe 
recognizable pieces of documents which are larger than sentences (e.g. 
SECTION).  And this trend has continued with RDF and Microdata showing 
that there is a significant interest in indicating smaller semantic 
pieces down to the sub-sentence level.

For this reason alone it should be obvious that it would be ludicrous 
for HTML to offer semantic tags for a vast array of different chunks of 
information, and yet ignore the absolutely most common semantic chunk, 
the sentence.

Like other semantic tags, a sentence tag can be useful in attempts to 
extract meaning from a document, or to convert text to speech with more 
reliable inflection, or to provide more reliable translations, and 
probably for many other reasons.

In addition to semantic reasons, my primary interest in this issue is in 
providing a mechanism for sentence spacing.  As HTML could arguably be 
the most consumed document type for the printed word today or in the 
near future, it's shocking that it can't do the one common formatting 
option that typesetters often used for hundreds of years after the 
invention of movable type: wider sentence spacing.

It's not my intention to start or facilitate some kind of war about 
sentence spacing.  Indeed, HTML should absolutely be agnostic on the 
issue.  Unfortunately, it's inability to handle what is historically the 
most basic text formatting operation can not be considered an agnostic 
position.  I've seen arguments of this issue where people hold up HTML 
as evidence that wider sentence spacing is no longer correct.  In other 
words, there is now a belief that the HTML standard has already taken sides.

Here's a few reasons why people might want to adjust sentence formatting:
   * Representation of the look of historical documents.
   * As an aid to new readers, or people learning a new langauge.
   * As an aid to people with learning or visual disabilities.
   * As an additional means of adding emphasis to text.
   * Simply because they prefer it for aesthetic reasons.

While there are suggested algorithms for detecting sentences, none of 
them works completely reliably.  An accurate solution defies even the 
most advanced AI approach, and in fact even another human being would 
likely fail to accurately guess what the content creator had in mind in 
all cases.

If HTML has been given all the modern tools of convenience that we now 
have, shouldn't it also include one of the most basic tools that 
typesetters have been using for centuries?

Received on Wednesday, 5 December 2012 17:58:01 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 29 October 2015 10:16:29 UTC