>>>Danny Ayers said: > <- However, the resulting files generally aren't usually legal Unicode > <- or thus legal XML, so probably your XML/RDF parser will crash and > <- burn afterwards on the output anyway if it doesn't get blown away by > <- memory leaks/growth. > > This really seems like a productive area ;-) Well, some parsers such as the C ones: Jason Diamonds' Repat and my Rapier, can handle all the data up to the illegal character sequences and are only limited by I/O speed, not memory. The java ones all tend to leak until they collapse, unless your machine has oodles of memory. > > <- Small enough to enclose below (also deletes Adult area for less > <- embarassing demos!) > > Damn fine idea. I don't speak Perl, what's going on with the 3 values? As it says: > <- # 0 - before first Adult topic > <- # 1 - during Adult topics > <- # 2 - afterwards So it has three states in processing. not really too important... but best not to create an RDF pr0n database. Incidently, Alberto pointed out the original sed code I converted http://www-diglib.stanford.edu/diglib/ginf/download/dmoz/ but that runs 10-20x slower than the perl DaveReceived on Tuesday, 27 March 2001 13:43:35 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 23 April 2007 18:19:48 GMT