Enlisting the help of HTML mails

Greetings,

  We are still trying to identify quoted and reformattable text parts in
mails. Now, it was clear from the beginning that since some people send
HTML-only mails, even as replies in list discussions, I would probably
have to handle them in some form. But seeing how horrendously bad common
HTML-to-text implementations are, and various anomalies they cause, and
that includes "generate HTML and text alternatives from internal data",
where interesting issues like "is not marked as quoted in the HTML but
the plain text alternative has a not-manually-entered `>` in front", I
figure the HTML parts need some more attention.

Interestingly, when I started looking into this I did do some cusory,
manual comparisons between plain text versions and HTML versions, and to
my surprise the HTML version were often in no better shape than the text
versions, so I did not put much faith into that. Even in theory it may
well be that sometimes the HTML version is much better than the text one
while at other times the plain text version is much better, and there is
a lack of good automated HTML to text formatting tools, a big problem
being that you would want to put a `>` in front of `<blockquote>` lines.

In any case, for these issues it would be helpful to be able to compare
plain text versions and HTML alternatives. But last I looked into HTML
diff utilities, when I created <http://www.w3.org/wiki/HtmlDiff>, things
did not look good, to put it mildly. So I figured if it involves diffing
and HTML, I am going to have a bad time. Then again, perhaps it's not so
difficult. So one thing I made is <https://gist.github.com/7126471>.

In order to use the script stand-alone it takes just a HTML document as
input and calls `lynx` to get a derived plain text version. It then uses
XML::LibXML's HTML parser to get a DOM representation. It then splits
the text version into tokens, and then splits DOM text nodes also into
tokens (trying not to pick up ones in <script> and <head> and so on),
and then uses the familiar `sdiff` function to match the two lists of
tokens up with one another, so every token in the plain text version can
know the corresponding text node in the DOM of the HTML version.

That way the plain text tokens can have properties like "in the HTML
version this token is a descendant of a blockquote element" or "this
token does not appear in the HTML version" and so on and so forth. One
thing I did with that was running it over a HTML-heavy list archive and
noted changed tokens with non-ascii content and tabulated them,

   3053 c ’ '
   1808 c “ "
   1799 c ” "
   1149 c – -
    363 c – ­
    324 c ’ ¹
    173 c … .
    161 c ” ²
    159 c “ ³
    124 c · *
     59 c — -

are the most common ones in this data set, mostly asciification and some
encoding errors, disturbingly frequent encoding errors actually. Note
that this covers only single token changes, the ellipsis character is
most likely replaced by three full stops, but only one full stop shows
up as a change, the others are additions the way I wrote the code. On an
individual message basis it can look like this:

  <63294A1959410048A33AEE161379C8023D02102C2D@SP2-EX07VS02.ds.corp...>
  c 1 1st
  - st
  c ’ '
  c ’ '
  + public-tracking@w3.org<mailto:

One problem here is that there are at least two ways to deal with e.g.

  <x>A<y>B</y></x>

or

  <x>A</x><y>B</y>

It could be handled as "A B" or as "AB" and depending on what `<x>` and
`<y>` are, and what the style sheet says, one or the other behavior may
be best. For that matter, in some cases it may also be best to consider
everything on a single character basis (instead, my code tokenizes with
roughly /[a-z]|./, so punctuation is a single token but ascii words are
possibly very long sequences; that is often the best choice, but not in
all cases).

In the example above, it's a case of `1<sup>st</sup>`. If it had been a
case of `1<p>st</p>` the problem would not show up. Keeping in mind that
the plain text generators are sometimes excrutiatingly bad at their job,
there are no clear rules for this, but there is room for improvement. It
would also be possible to specifically scan the `sdiff` output for cases
such as this, but that is not clearly needed as yet.

The last case is a result of turning 

  <a href='http://example.org'>example website</a>

into something like

  example website<http://example.org>

without considering that's a very stupid thing to do when the input is

  <a href='http://example.org'>http://example.org</a>

as is almost always the case. There are all sorts of "funny" issues with
URLs as it turns out, one popular instance of text that is added making
the plain text version is this:

     30 + <file://localhost/sip/zakim@voip.w3.org
     ...
     13 + <file:///sip/zakim@voip.w3.org

Another typical pattern is something like this:

  <CA+Z3oObD7SWU4upN+dXm5gu=k539LKREOVnyj0f94-n3mSBLrA@mail.gmail.com>
  + ********
  + *********
  + *
  + *
  + *
  + *
  + *
  + *
  + *
  + *
  + *
  + ********
  + ********
  + ********
  + ****
  + ****
  + ****
  + ****
  + ********

Right, you see, if you have an `<u>` then that's mapped to `*`, and if
you have an `</u>` that's mapped to another `*`, so with literally

  <u></u><u></u>

you get `****`, so readers of the plain text version can see your under-
lined empty strings more clearly. These braindead techniques can also be
combined, for instance,

  http://example.org/**abc-**def<http://example.org/abc-def>

is a common pattern. Here the `<u>` elements may perhaps serve as marks
for wrapping opportunities, but that does not really lower the what-the-
f...actor for anybody.

There are also interesting cases where the text version is abruptly cut
off and lacks much of the text in the HTML variant (consisting of quotes
of quotes of quotes of quotes of quoted text from email terrorists).

One thing I will have to deal with to make use of such information is
mails that pretend to have a plain text alternative but actually do not,
think "your mailer does not support frames^W HTML" style messages though
I am not sure that's actually ever encoded as alternative, but with spam
for instance it's common to have big differences between HTML and text.

Anyway, the idea would probably be to give little weight to parts that
are not in both HTML and text version when determining which parts are
quoted, and of course as the script above does using information like an
element having a blockquote ancestor in the HTML variant to be an indi-
cation that the text is in fact quoted (from the parent mail), but there
will have to be some research to see how unreliable that is. In the one
test I've tried it worked good...

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Friday, 25 October 2013 23:02:48 UTC