[whatwg] Sentence structure from Thomas A. Fine on 2013-01-10 (public-whatwg-archive@w3.org from January 2013)

From: Thomas A. Fine <fine@head.cfa.harvard.edu>
Date: Thu, 10 Jan 2013 14:32:34 -0500 (EST)
To: whatwg@whatwg.org
Message-Id: <20130110193234.9FA761D773DB@bugs.localhost>
[Apologies to those who read public-html or www-style at w3.org
where I've raised these issues (although this is more comprehensive).
The process for modifying HTML is way more complicated than it use
to be, and I'm still trying to figure out all the parts and the
best approach.]

HTML needs support for identifying sentence structure.

Use Cases:
  1. Formatting sentence spacing to approximate the look of
     almost all books in English from 1650-1950.
  2. Formatting sentence spacing because it is very likely an
     aid to scanning text, and there are some indications that it
     is helpful for new readers, readers learning a new language,
     and readers with visual scanning issues and other learning
     disabilities.
  3. Formatting sentence spacing because I like it that way.
  4. Clarifying sentence boundaries would be an aid in machine
     translation software.
  5. Clarifying sentence boundaries would be an aid to screen
     readers to help provide correct inflection.

  When it comes to the formatting use cases, there are a huge number
  of people who currently use two spaces between sentences already,
  even in web content where it currently is wasted.  Some significant
  portion of these people are likely to be interested in sentences
  formatting if such a feature was available in a practical form.

  As for machine-parsed uses, it's not actually my field, so I'm
  not sure how helpful it would be, only that it would be helpful
  to some degree.  In my limited experience with text-to-speech,
  sentence inflection errors are usually a noticeable problem.


Existing practices, with some obvious pluses(+) and minuses(-):
  * The most popular recommendation on the web is to use &nbsp;.
    + Many people are familiar with it.
    - Not so fun to type.
    - Only the non-collapsing aspect is needed, the non-breaking
      aspect interrupts line breaks and creates uneven justification
      (left and right).
    - No fine-grained or dynamic control that CSS could provide.
    - Not really so useful for machine translation of screen readers
      as it doesn't eliminate ambiguity.

  * Use other space entities.
    + Doesn't have the justification problem of &nbsp;.
    + Allows some degree of fine control with different space sizes.
    - There exists no space entity which is the same size as a space
      and which breaks but doesn't collapse.
    - Many content creators are not aware of these entities.
    - Not really so useful for machine translation of screen readers
      as it doesn't eliminate ambiguity.
    - Still not as fine-grained as a CSS solution and no dynamic
      control.

  * Use spans to wrap sentences (not commonly used).
    - Very tedious.
    + Allows fine-grained and dynamic control through CSS.
    - Clean CSS for formatting is not obvious (e.g. some
      recommendations say to use the box model, which disrupts line
      breaking and creates uneven margins).

  * Set white-space to pre-wrap (not commonly used).
    + Very simple for content creators.
    - Doesn't provide unambiguous sentences to machine parsers.
    - Pre-wrap honors new lines which may be undesirable to some
      authors [why isn't there a white-space option that preservers
      spaces but not newlines?].
    - No fine-grained or dynamic control.


Possible improvements, with some obvious pluses(+) and minuses(-):
  * Detect sentences from text with an off-the-shelf algorithm.
    + Works on all existing content.
    - Available algorithms are some combination of unreliable
      and expensive.
    - Content creator doesn't have any control over what the
      algorithm will decide is or is not a sentence.  Some sort of
      tag or entity could be used only for exceptions but again,
      the content creator wouldn't know where the exceptions might
      occur without a specified algorithm.

  * CSS setting that tells the parser that two spaces after terminal
    punctuation can be trusted as a reliable method of detecting
    sentences without ambiguity.
    + Would work immediately for some existing content.
    + By far the simplest solution for content creators.
    + Gives content creator full control.

  * Explicit sentence tag that surrounds each sentence (and some
    associated CSS to format it).
    + The most "traditional" solution.
    + The only solution here that fully marks the entire sentence,
      not just the end or gap so there is no extra processing to
      find the beginning of a sentence.  (Consider this a minus
      on all the other approaches, even though I didn't list it.)
    - Very tedious to do by hand
    + A dedicated tag could spur implementation of HTML editors
      that mark the text for you.

  * Dedicated tag to mark the gap between sentences.
    - Somehow this just seems weird to me, a tag that's only
      purpose is to contain a space.
    + An easy substitution to make in an editor or post-processor.

  * New entity that marks the ends of sentences, or the gap between
    two adjacent sentences (and associated CSS to manipulate it).
    + An easy substitution to make in an editor or post-processor.

  * New Unicode character that provides an unambiguous full stop
    + The language could really need it.
    - There are also ambiguous cases for the question mark and the
      exclamation point, notably when used in a quote within a
      sentence, but also other odd balls too.
    - Don't hold your breath getting anything into Unicode.

Thanks for your careful consideration in this matter,

      tom
Received on Thursday, 10 January 2013 19:33:01 UTC