W3C home > Mailing lists > Public > public-html@w3.org > December 2012

Re: The missing Sentence tag

From: Jirka Kosek <jirka@kosek.cz>
Date: Wed, 05 Dec 2012 20:11:22 +0100
Message-ID: <50BF9C5A.3020904@kosek.cz>
To: "Thomas A. Fine" <fine@head.cfa.harvard.edu>
CC: public-html@w3.org
On 5.12.2012 18:57, Thomas A. Fine wrote:

> HTML needs a tag to indicate sentence structure.

This seems as a quite bold statement given the fact that most authors of
Web content will be too lazy to markup sentences.

> Like other semantic tags, a sentence tag can be useful in attempts to
> extract meaning from a document, or to convert text to speech with more
> reliable inflection, or to provide more reliable translations, and
> probably for many other reasons.

Yes, for translation it is sometimes important to do segmentation to
sentences properly. However as semantics of HTML elements is known in
advance there usually no problem with this. For some rare ambiguous
cases you can use ITS markup (which can be applied to HTML as well) to
set segmentation boundaries

> While there are suggested algorithms for detecting sentences, none of
> them works completely reliably.  An accurate solution defies even the
> most advanced AI approach, and in fact even another human being would
> likely fail to accurately guess what the content creator had in mind in
> all cases.

Well if automatic spacing algorithms fail (which is not that often as
you describe, at least in my experience) you can always fix missing
space by inserting en- or em-space character manually which seems as
much less barrier then putting element around each sentence.

  Jirka Kosek      e-mail: jirka@kosek.cz      http://xmlguru.cz
       Professional XML consulting and training services
  DocBook customization, custom XSLT/XSL-FO document processing
 OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
    Bringing you XML Prague conference    http://xmlprague.cz

Received on Wednesday, 5 December 2012 19:11:52 UTC

This archive was generated by hypermail 2.4.0 : Saturday, 9 October 2021 18:45:59 UTC