W3C home > Mailing lists > Public > public-html-comments@w3.org > April 2012

Re: HTML5 Paragraphs, Sentences and Phrases

From: Thomas A. Fine <fine@head.cfa.harvard.edu>
Date: Thu, 12 Apr 2012 16:48:06 -0400 (EDT)
To: public-html-comments@w3.org
Message-Id: <20120412204806.4FF9AD9A993@bugs.localhost>

This is in response to Benjamin Hawkes-Lewis' response to
Adam Sobieski's proposal for sentence and phrase tags.

Speaking to the "necessity" of these tags, while I'm not sure really
any tag, or HTML or the web or even a good slice of pizza can be
described as necessary, these tags can definitely be useful, and
most likely they can be important.  Sentence and phrase markings can
be very useful to:
  People relying on audio conversion to access the web.
  People relying on automated translation.
  People who are just learning to read.
  People who are reading an article not in their native language.
  People who are interested in inter-sentence spacing or inter-phrase spacing.
  People with commercial interests, looking to maximize their reach.

Of course, simply adding tags won't really help any of these people.
The real point is that such tags can facilitate tools that help
these people.

The problem with using span tags is that they won't facilitate tool
development.  In the absence of a real standard, no one is going
to develop software to process sentences by searching for spans
that might be labeled "sentence" or "sent" or "stc" or who knows
what else.  Only in the presence of a standard tag, can developers use
these tags to improve translation, or emphasize phrasing and sentence
structure for improved readability.

Mr. Hawkes-Lewis wrote:
>The web corpus is not going to get marked up with phrases and
>sentences in the absence of NLP advances that would make such markup
>mostly redundant.

Natural Language Processing is riddled with problems, and there is
nothing to suggest that this will change in the near future.  On
the other hand, someone who is authoring content is in the perfect
situation to accurately identify sentences or phrases.  NLP can be
an aid to that user, and can provide hints to help them select
sentence structure.  But as I said above, no such software would
ever be developed to use NLP to aid users in marking sentence
structure unless there were already dedicated sentence and phrase
tags.  So in essence, you are correct, but only because you're
argument is a self-fulfilling prophesy.

You also suggest simply using a CSS pseudo-tag, and relying on the
unicode sentence breaking conventions.  However, looking at these
conventions, they are just another attempt at some sort of automated
processing, and they acknowledge that this will not work for all cases.
This is just one more argument in favor of giving content providers
the ability to accurately mark up sentence structure.

I'll further note that any form of automated NLP is wholly inadequate
when it comes to users interested simply in formatting control issues.
Giving them a mechanism that does not provide control over where and
when content will be formatted (other than some outside algorithm they
don't control) is not providing any real control over formatting.

If you are saying that you don't think most people will bother, that is
probably true.  But that doesn't mean that there aren't people with
a legitimate and important interest.

So back to the original question, are these tags necessary?  I would
now say yes, these tags are necessary to the development of software
tools to aid users in marking sentence structure, and they are
necessary to the development of tools that allow content providers
to improve readability of their web pages for several classes of
web users.

     tom
Received on Thursday, 12 April 2012 20:48:36 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 12 April 2012 20:48:36 GMT