Re: ASCII Art

Len,

Yes, I think that looking at the statistical properties of the characters
relative to the language, as you suggest, would be a better method of
detection. Unfortunately, our intern has left the university for a real job
and we haven't got anyone to follow up on this. I'm hoping that the simple
rules already documented should detect at least 90% of ASCII art. Once I get
it coded, I'll start checking sites to see how it works.

Chris


----- Original Message -----
From: Leonard R. Kasday <kasday@acm.org>
To: Chris Ridpath <chris.ridpath@utoronto.ca>
Cc: <w3c-wai-er-ig@w3.org>
Sent: Tuesday, January 04, 2000 5:00 PM
Subject: Re: ASCII Art


> A more general way to find ASCII Art would be to use statistical
properties
> of English (or whatever language is in use).  For example, if you look at
> the frequency of letter pairs, some are relatively high like "th" and some
> are relatively low, like "mq".
>
> There are lots of refs on this.   It's a classic topic.  If you're into
> 50's style Experimental Psychology, You can find references in any
> intermediate psych textbook that deals with "information theory".
Garner's
> a good author.  For the Engineering inclined, check out elmentary
> infromation theory textbooks. Computer science fans can check out
> compression theory.  Cryptography devotees can check out elementary
methods
> for substitution cyphers.   As you see, it's used all over place.
>
> And there are standard statistical tests to see if distributions
> match.  See any intermediate stat book.
>
> So if you compare the contents of <PRE> or <XMP> with the statistics for
> English (or whatever language is in use) and the match is poor, it's
> probably ACSII Art.  Unless you have someone who likes to write long
> strings of Acronyms.  But hey, acronyms are arguably ACSII Art in a sense
> anyway.
>
> So you may want to set your intern loose on this approach...
>
> Actually, what you really want are statistics that take into account use
of
> other characters like underlines, spaces, other ACSII characters.  The
> sorts of things that showed up in the ad hoc rules in
> http://www.w3.org/WAI/ER/IG/ert/AsciiArt.htm.  So what you really want is
a
> program that just does those statistics, which you can turn loose on
> ordinary web pages, and get distributions to compare against.
>
> (of course you can also check statistics of strings of 3, 4, 5... letters
> but for this purpose I bet 2 is enough.)
>
> Len
>
> p.s.
> These statistics can also be used to check what language something is
> written in.
>
>
> At 03:38 PM 1/3/00 -0500, you wrote:
> >For technique 1.1.K (http://www.w3.org/WAI/ER/IG/ert/#Technique1.1.K) we
> >need to determine if a page contains ASCII art. Our intern had a look at
a
> >how ASCII art is used on the web and prepared the following document:
> >
> >http://www.w3.org/WAI/ER/IG/ert/AsciiArt.htm
> >
> >Note: this does not deal with emoticons " :) " etc.
> >
> > >From this report I think we can create an algorithm that will reliably
test
> >a page for ASCII art. I'll code something this week and test it on
several
> >sites.
> >
> >Chris
> >
>
> -------
> Leonard R. Kasday, Ph.D.
> Institute on Disabilities/UAP, and
> Department of Electrical Engineering
> Temple University
> 423 Ritter Annex, Philadelphia, PA 19122
>
> kasday@acm.org
> http://astro.temple.edu/~kasday
>
> (215) 204-2247 (voice)
> (800) 750-7428 (TTY)

Received on Wednesday, 5 January 2000 10:08:48 UTC