- From: Chris Ridpath <chris.ridpath@utoronto.ca>
- Date: Wed, 5 Jan 2000 10:08:30 -0500
- To: "Leonard R. Kasday" <kasday@acm.org>
- Cc: <w3c-wai-er-ig@w3.org>
Len, Yes, I think that looking at the statistical properties of the characters relative to the language, as you suggest, would be a better method of detection. Unfortunately, our intern has left the university for a real job and we haven't got anyone to follow up on this. I'm hoping that the simple rules already documented should detect at least 90% of ASCII art. Once I get it coded, I'll start checking sites to see how it works. Chris ----- Original Message ----- From: Leonard R. Kasday <kasday@acm.org> To: Chris Ridpath <chris.ridpath@utoronto.ca> Cc: <w3c-wai-er-ig@w3.org> Sent: Tuesday, January 04, 2000 5:00 PM Subject: Re: ASCII Art > A more general way to find ASCII Art would be to use statistical properties > of English (or whatever language is in use). For example, if you look at > the frequency of letter pairs, some are relatively high like "th" and some > are relatively low, like "mq". > > There are lots of refs on this. It's a classic topic. If you're into > 50's style Experimental Psychology, You can find references in any > intermediate psych textbook that deals with "information theory". Garner's > a good author. For the Engineering inclined, check out elmentary > infromation theory textbooks. Computer science fans can check out > compression theory. Cryptography devotees can check out elementary methods > for substitution cyphers. As you see, it's used all over place. > > And there are standard statistical tests to see if distributions > match. See any intermediate stat book. > > So if you compare the contents of <PRE> or <XMP> with the statistics for > English (or whatever language is in use) and the match is poor, it's > probably ACSII Art. Unless you have someone who likes to write long > strings of Acronyms. But hey, acronyms are arguably ACSII Art in a sense > anyway. > > So you may want to set your intern loose on this approach... > > Actually, what you really want are statistics that take into account use of > other characters like underlines, spaces, other ACSII characters. The > sorts of things that showed up in the ad hoc rules in > http://www.w3.org/WAI/ER/IG/ert/AsciiArt.htm. So what you really want is a > program that just does those statistics, which you can turn loose on > ordinary web pages, and get distributions to compare against. > > (of course you can also check statistics of strings of 3, 4, 5... letters > but for this purpose I bet 2 is enough.) > > Len > > p.s. > These statistics can also be used to check what language something is > written in. > > > At 03:38 PM 1/3/00 -0500, you wrote: > >For technique 1.1.K (http://www.w3.org/WAI/ER/IG/ert/#Technique1.1.K) we > >need to determine if a page contains ASCII art. Our intern had a look at a > >how ASCII art is used on the web and prepared the following document: > > > >http://www.w3.org/WAI/ER/IG/ert/AsciiArt.htm > > > >Note: this does not deal with emoticons " :) " etc. > > > > >From this report I think we can create an algorithm that will reliably test > >a page for ASCII art. I'll code something this week and test it on several > >sites. > > > >Chris > > > > ------- > Leonard R. Kasday, Ph.D. > Institute on Disabilities/UAP, and > Department of Electrical Engineering > Temple University > 423 Ritter Annex, Philadelphia, PA 19122 > > kasday@acm.org > http://astro.temple.edu/~kasday > > (215) 204-2247 (voice) > (800) 750-7428 (TTY)
Received on Wednesday, 5 January 2000 10:08:48 UTC