Re: HTML5 Paragraphs, Sentences and Phrases

Text to Speech APIs (especially chrome.tts) are particularly relevant.

On a word-level, projects like OmegaWiki are relevant.

There are good reasons for tagging content, but they're not what I'd consider mainstream.

-Charles


On Apr 16, 2012, at 12:21 PM, Adam Sobieski <adamsobieski@hotmail.com> wrote:

> Benjamin Hawkes-Lewis,
>  
> With regard to sentence segmentation and phrase annotation, there are some semi-automated, interactive natural language processing approaches and techniques as well as other natural language processing approaches for authoring software scenarios which, presently, may require more computation than rapid for page loading and initialization.   While authoring software can provide users with language-specific features, browsers' natural language functionalities may be expected to be broadly multilingual.  Markup and style can be used to store information from authoring software natural language processing software components in hypertext documents for functionalities including dynamic readable layouts.
>  
> Thank you for the information about Unicode sentence segmentation information and the ::sentence pseudo-element idea.
>  
> While span-based solutions are functional, for scenarios including EPUB3, span for sentences is a popular usage scenario.  Span for phrases could become so as well with CSS3 speech and text features including text-wrap.
>  
> A summarization of some research is that reading speed, comprehension and retention can be enhanced by text formatting including phrase-based.  With regard to indicating phrase structure in hypertext with markup and style, there are at least two scenarios; in one, indicated phrases are sparse in hypertext, and, in another, regions of hypertext are more or less segmented entirely into phrases.  Authors and authoring software could make use of phrase structure for scenarios including keywords, phrasemes, collocations and compound terms.
>  
>  
>  
> Kind regards,
>  
> Adam
>  
>  
> > From: bhawkeslewis@googlemail.com
> > Date: Mon, 9 Apr 2012 14:20:24 +0100
> > To: adamsobieski@hotmail.com
> > CC: public-html@w3.org
> > Subject: Re: HTML5 Paragraphs, Sentences and Phrases
> > 
> > On Mon, Apr 9, 2012 at 12:41 PM, Adam Sobieski <adamsobieski@hotmail.com> wrote:
> > > While HTML5 presently has a document structure granularity of
> > > paragraphs, for sentences and phrases in hypertext, options include the
> > > <span> element, e.g. <span class="sentence"> and <span class="phrase">, and
> > > the use of XML from other XMLNS.
> > 
> > Also microdata, RDFa, and Unicode sentence segmentation:
> > 
> > http://www.unicode.org/reports/tr29/#Sentence_Boundaries
> > 
> > > HTML5 markup elements for sentences and phrases are possible.
> > 
> > Possible, but their necessity is undemonstrated.
> > 
> > > In any eventuality, sentences and phrases are important CSS3 usage scenarios.
> > 
> > > A non-exhaustive list of the benefits of sentences in hypertext include:
> > >
> > > 1. Sentence-level granularity can be of use to the styling, layout and
> > > rendering of hypertext. Topics include layout with regard to columns and
> > > pages as well as intersentence spacing. Sentence and phrase granularity in
> > > documents can facilitate readability, reading speed and comprehension
> > > (http://lists.w3.org/Archives/Public/www-style/2012Apr/0153.html).
> > 
> > The web corpus is not going to get marked up with phrases and
> > sentences in the absence of NLP advances that would make such markup
> > mostly redundant. If you want a way to tweak this spacing from CSS, a
> > ::sentence pseudo-element (comparable to ::first-character and
> > ::first-line) that selected sentences based on the Unicode sentence
> > segmentation algorithm would work reasonably at web scale, whereas a
> > dedicated <sentence> semantic would only work in the small subset of
> > documents that applied it. Authors who want to tweak the spacing in
> > particular cases can use <span>. I suggest you propose ::sentence for
> > CSS Selectors Level 4.
> >
> > > 2. Media overlays in EPUB, based upon SMIL, "text elements' src attributes
> > > refer to EPUB Content Document elements by their IDs. The granularity level
> > > of the Media Overlay therefore depends on how the EPUB Content Document is
> > > marked up. If the finest level of markup is at the paragraph level, then
> > > that is the finest possible level at which Media Overlay synchronization can
> > > be authored. Likewise, if sub-paragraph markup is available, such as span
> > > elements representing phrases or sentences, then finer granularity is
> > > possible in the Media Overlay. Finer granularity gives Users more precise
> > > results for synchronized playback when navigating by word or phrase and when
> > > searching the text, but increases the file size of the Media Overlay
> > > Documents."
> > > (http://idpf.org/epub/30/spec/epub30-mediaoverlays.html#sec-media-overlays-granularity)
> > 
> > >From that document, it sounds like <span> already works for their use case?
> > 
> > > 3. Natural language processing of hypertext. See also:
> > > http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation .
> > 
> > Is NLP that needs markup to discern sentence boundaries still NLP?
> > 
> > > 4. Navigational. Sentence elements with IDs can be navigated to and
> > > specifically referenced. See also:
> > > http://idpf.org/epub/linking/cfi/epub-cfi.html .
> > 
> > @id already works for this. Why is the "sentence" semantic needed?
> > 
> > > 5. Sentence-level granularity of structure can facilitate new semantics
> > > including annotational. For example, the epub:type attribute, resembling the
> > > role attribute, with some uses indicated at
> > > http://idpf.org/epub/vocab/structure/#h_document-text including
> > > "concluding-sentence" and "topic-sentence".
> > 
> > We don't need to introduce new semantics to the core vocabulary to
> > facilitate annotations that can already be made with microdata/RDFa.
> > 
> > > 6. Speech synthesis. SSML includes paragraphs and sentences
> > > (http://www.w3.org/TR/speech-synthesis11/#S3.1.8.1). Sentence granularity
> > > can enhance the audio output of synthesis processors processing hypertext.
> > 
> > The Unicode sentence segmentation algorithm sounds good enough for
> > this. If it's not, improving the NLP algorithms of text-to-speech
> > agents is going to more cost effective than trying to persuade authors
> > to add sentence markup to the corpus.
> > 
> > > A non-exhaustive list of the benefits of phrases in hypertext include:
> > >
> > > 1. Phrase-level granularity can be of use to styling, layout and rendering.
> > > Topics include text wrapping. Sentence and phrase granularity in documents
> > > can facilitate readability, reading speed and comprehension
> > > (http://lists.w3.org/Archives/Public/www-style/2012Apr/0153.html).
> > 
> > Can you summarize from your reading list what these benefits would be
> > and why they can't be achieved using existing mechanisms like Unicode
> > non-breaking spaces?
> > 
> > > 2. Media overlays in EPUB [snip]
> > > 3. Natural language processing of hypertext.
> > > 4. Phrase-level granularity of structure can facilitate new semantics
> > > including annotational. For example, the epub:type attribute, resembling the
> > > role attribute, with some uses indicated
> > > at http://idpf.org/epub/vocab/structure/#h_document-text including
> > > "keyword".
> > 
> > Already discussed above.
> > 
> > > 5. Speech synthesis. For example, pauses between words may differ inside and
> > > between phrase elements.
> > 
> > Do you have an example of this? This behavior sounds like it would be
> > phrase-specific rather than general to everything authors might mark
> > up with <phrase>. How are you defining "phrase" here anyway?
> > 
> > --
> > Benjamin Hawkes-Lewis
> > 

Received on Monday, 16 April 2012 19:34:14 UTC