W3C home > Mailing lists > Public > public-html@w3.org > April 2012

Re: HTML5 Paragraphs, Sentences and Phrases

From: Thomas A. Fine <fine@head.cfa.harvard.edu>
Date: Tue, 24 Apr 2012 12:44:06 -0400 (EDT)
To: public-html@w3.org
Message-Id: <20120424164406.758FBDAC8B7@bugs.localhost>

This discussion seems to be focused so far on natural language processing,
but for me this is not the primary issue.  The primary purpose of HTML is
to provide content authors control over the look of their document.

The use of the unicode (or any other) sentence algorithm together with
a pseudo-tag is inadequate, since the algorithm won't work in all cases,
authors don't actually control what is formatted in this case.  This is
the entire reason a sentence tag is needed.  If any current or proposed
method of parsing sentences 100% matched human interpretation, a
sentence tag would not be necessary.

Marking sentences (and phrases) is certainly a tedious task, and
not something the majority of content creaters are likely to be
interested in.  However there is good evidence that such formatting
is useful (at least) to early readers and new readers coming from
a different language.  There would certainly be those who would be
interested if this were available.

Let's look at a basic case where an author wants to format sentences
with extra space.  A knowledgable and patient author can mark sentences
with span tags, and then use these to (more or less) achieve the desired
formatting.  Someone with little HTML experience, or someone sitting
in front of a web authoring tool is unlikely to be able to accomplish
this at all, or is likely to be steered towards the incorrect solution
of using nbsp.

Ideally, web authoring tools could aid the user in marking sentences
by using sentence detection algorithms, and allow the user to
override those cases where this method fails.  While this is possible
with a span tag, no such tools are ever likely to be developed in
the absence of a dedicated sentence tag.

So my opinion is that while sentence formatting can be accomplished
with a span tag, easy and accessible sentence formatting is unlikely
to be available to most content creators without a dedicated sentence
tag.  For similar reasons, I'd suggest that the sort of tools Mr.
Sobieski has discussed are also unlikely to make significant progress.

The same arguments can easily be extended to phrase tags, especially
since we can't reasonably suggest any algorithm that might yield
phrases.  Things do get a bit more sticky there, as I don't believe
current CSS models are up to the task of correctly formatting phrases
differently from each other and form sentences.

     tom
Received on Tuesday, 24 April 2012 16:44:37 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:17:48 GMT