Re: HTML5 Paragraphs, Sentences and Phrases from Benjamin Hawkes-Lewis on 2012-04-09 (public-html@w3.org from April 2012)

From: Benjamin Hawkes-Lewis <bhawkeslewis@googlemail.com>
Date: Mon, 9 Apr 2012 14:20:24 +0100
To: Adam Sobieski <adamsobieski@hotmail.com>
Cc: public-html@w3.org
Message-ID: <CAEhSh3ehTB8SoJQgM2MxFOqGCvrCzQoL+h8wr78DHQ1f7W+xRA@mail.gmail.com>
On Mon, Apr 9, 2012 at 12:41 PM, Adam Sobieski <adamsobieski@hotmail.com> wrote:
> While HTML5 presently has a document structure granularity of
> paragraphs, for sentences and phrases in hypertext, options include the
> <span> element, e.g. <span class="sentence"> and <span class="phrase">, and
> the use of XML from other XMLNS.

Also microdata, RDFa, and Unicode sentence segmentation:

http://www.unicode.org/reports/tr29/#Sentence_Boundaries

> HTML5 markup elements for sentences and phrases are possible.

Possible, but their necessity is undemonstrated.

> In any eventuality, sentences and phrases are important CSS3 usage scenarios.

> A non-exhaustive list of the benefits of sentences in hypertext include:
>
> 1. Sentence-level granularity can be of use to the styling, layout and
> rendering of hypertext. Topics include layout with regard to columns and
> pages as well as intersentence spacing. Sentence and phrase granularity in
> documents can facilitate readability, reading speed and comprehension
> (http://lists.w3.org/Archives/Public/www-style/2012Apr/0153.html).

The web corpus is not going to get marked up with phrases and
sentences in the absence of NLP advances that would make such markup
mostly redundant. If you want a way to tweak this spacing from CSS, a
::sentence pseudo-element (comparable to ::first-character and
::first-line) that selected sentences based on the Unicode sentence
segmentation algorithm would work reasonably at web scale, whereas a
dedicated <sentence> semantic would only work in the small subset of
documents that applied it. Authors who want to tweak the spacing in
particular cases can use <span>. I suggest you propose ::sentence for
CSS Selectors Level 4.

> 2. Media overlays in EPUB, based upon SMIL, "text elements' src attributes
> refer to EPUB Content Document elements by their IDs. The granularity level
> of the Media Overlay therefore depends on how the EPUB Content Document is
> marked up. If the finest level of markup is at the paragraph level, then
> that is the finest possible level at which Media Overlay synchronization can
> be authored. Likewise, if sub-paragraph markup is available, such as span
> elements representing phrases or sentences, then finer granularity is
> possible in the Media Overlay. Finer granularity gives Users more precise
> results for synchronized playback when navigating by word or phrase and when
> searching the text, but increases the file size of the Media Overlay
> Documents."
> (http://idpf.org/epub/30/spec/epub30-mediaoverlays.html#sec-media-overlays-granularity)

>From that document, it sounds like <span> already works for their use case?

> 3. Natural language processing of hypertext. See also:
> http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation .

Is NLP that needs markup to discern sentence boundaries still NLP?

> 4. Navigational. Sentence elements with IDs can be navigated to and
> specifically referenced. See also:
> http://idpf.org/epub/linking/cfi/epub-cfi.html .

@id already works for this. Why is the "sentence" semantic needed?

> 5. Sentence-level granularity of structure can facilitate new semantics
> including annotational. For example, the epub:type attribute, resembling the
> role attribute, with some uses indicated at
> http://idpf.org/epub/vocab/structure/#h_document-text including
> "concluding-sentence" and "topic-sentence".

We don't need to introduce new semantics to the core vocabulary to
facilitate annotations that can already be made with microdata/RDFa.

> 6. Speech synthesis. SSML includes paragraphs and sentences
> (http://www.w3.org/TR/speech-synthesis11/#S3.1.8.1). Sentence granularity
> can enhance the audio output of synthesis processors processing hypertext.

The Unicode sentence segmentation algorithm sounds good enough for
this. If it's not, improving the NLP algorithms of text-to-speech
agents is going to more cost effective than trying to persuade authors
to add sentence markup to the corpus.

> A non-exhaustive list of the benefits of phrases in hypertext include:
>
> 1. Phrase-level granularity can be of use to styling, layout and rendering.
> Topics include text wrapping. Sentence and phrase granularity in documents
> can facilitate readability, reading speed and comprehension
> (http://lists.w3.org/Archives/Public/www-style/2012Apr/0153.html).

Can you summarize from your reading list what these benefits would be
and why they can't be achieved using existing mechanisms like Unicode
non-breaking spaces?

> 2. Media overlays in EPUB [snip]
> 3. Natural language processing of hypertext.
> 4. Phrase-level granularity of structure can facilitate new semantics
> including annotational. For example, the epub:type attribute, resembling the
> role attribute, with some uses indicated
> at http://idpf.org/epub/vocab/structure/#h_document-text including
> "keyword".

Already discussed above.

> 5. Speech synthesis. For example, pauses between words may differ inside and
> between phrase elements.

Do you have an example of this? This behavior sounds like it would be
phrase-specific rather than general to everything authors might mark
up with <phrase>. How are you defining "phrase" here anyway?

--
Benjamin Hawkes-Lewis
Received on Monday, 9 April 2012 13:21:14 UTC