Identifying non-flow text

So,

  I started the other day to list criteria for recognising text that is
fixed, pre-formatted, possibly list-like, that should not be reflown. I
did not get very far, but...

  * Repeating prefixes and repeating suffixes that are not shared with 
    other pieces or original text. The most common prefix would be white
    space for indented code or other text, but there are also a number 
    of other cases, like lists of references

      [1] ...
      [2] ...

    and other forms of code have similar patterns, like

      <example>
        <... />
        <... />
      </example>

    and it might be worth to consider matching punctuations as matches,

      {
        ...
      }

    and for suffixes there are things like

      color: red;
      background-color: white;
      margin: 0; /* ... */
      padding: 0; /* ... */

    where ";" and "*/" are repeating suffixes.

  * Similarily, there is alignment, when you get the same character in
    the same column position in many lines, which might be considered
    repeating infixes,

      example => 1,
      other   => 2,
      ...

    where there is alignment based on the " => " infix, in addition to
    the repeating suffix ",".

  * Similarily, the notion of repeating infixes can be extended beyond
    looking for aligned repetitions, for instance, Outlook blurps have

      From: ...
      To: ...
      Data: ...
      Subject: ...

    the repeating infix ": ". A simple first step would be to count the
    "words" instead of the characters and then check for alignment. In
    my "data model" I look for URIs as "words", and a typical case where
    you would not want to reflow is when lines consist of almost nothing
    but a single URI (ends-with-URI suffix).

  * Similarily, a simple tolerance could be to accept white space in
    place of a sequence of non-white-space characters as weak criterion,
    that would more easily detect wrapped list items

      1) ### ### ### ### ### ### ### ### ### ### ### ### ### ###
         ### ### ### ### ### ### ### ### ### ### ### ### ### ###

      2) ### ### ### ### ### ### ### ### ### ### ### ### ### ###
         ### ### ### ### ### ### ### ### ### ### ### ### ### ###

    which technically could be easily reflowed in many cases, but it'd
    often look a bit messy. It's also possible that this would be taken
    care of by recognizing that prefixes repeat not from line to line
    but within the block.

  * A big one is line length, lines with fixed formatting tend to be a
    good bit shorter than flowed text and when flowed text is ordinarily
    wrapped, long lines similarily indicate fixed formatting. This is a
    bit of a chicken-egg problem though, so this is a very weak factor,
    it works best when you have a lot of both in a single mail. However,
    with the other indicators this is probably a very good verifyer.

  * Relatedly, white space distribution is interesting. If you know the
    typical line length of original text, you can extend other "areas"
    of text to roughly that length and look at the white space and you
    will find that short lines have a lot of white space "to the right"
    and a lot of white space overall, especially relative to "letters".
    White space prefixes, indentation, works the same way, and other
    forms of alignment are the same, to recall an earlier example,

      example => 1,
      other   => 2,
      ...

    That has a sequence of three spaces, which would be very unusual
    in flowed text. Another example would be cases of no inter-word
    white space,

      * ABC
      * DEF
      * GHI

    And frequent and unusually short sequences of non-white-space

      f 0 _ = 0
      f _ 0 = 0
      f x y = x + y

    Or unusually long words

      example = new AbstractSingletonProxyFactoryBean();

    Or for that matter,

      document.getElementsByTagNameNS("http://www.w3.org/2000/svg", ...

  * Distribution of other characters can also be interesting, which can
    be as simple as looking at the letter to non-letter ratio...

  * ...

There are various false positives to be aware of:

  * Some people indent all their paragraphs, and some of them also use 
    overly long lines, so you get a repeating pattern but the text is
    actually reflowable.

  * Some people use code-like markers to indicate the text they wrote,
    something like this for instance

      <john>
      ...
      </john>

    Similarily,

      [john] ...
      [john] ...

  * ...

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Sunday, 17 March 2013 00:16:28 UTC