W3C home > Mailing lists > Public > www-jigsaw@w3.org > September to October 1996

parseURI and others...

From: Anselm Baird-Smith <abaird@w3.org>
Date: Fri, 6 Sep 1996 09:16:50 +0500
Message-Id: <9609061316.AA14223@www18.w3.org>
To: Viktor <viktor@irisz.hu>
Cc: "'www-jigsaw@w3.org'" <www-jigsaw@w3.org>
viktor@irisz.hu writes:
 > Instead of fixing the bug, I wrote an new one, 
 > which uses a regexp-package of... hmmmmm...
 > Jonathan Payne, from Starwave Co. 
 > I lost the original url, and he didn't
 > wrote it in his docs. I will lookup in a search enginee..
 > So it's: http://www.starwave.com/people/jpayne/java/

Yep, I know about that.

 > Maybe it doesn't fits for your needs, sure it's slower
 > than parsing it 'manualy'.

Yes, and as this is a senisble place, I would rather like to stick to
something "fast".

 > Actualy the header-extract of HTMLResource is uses
 > this too (match /<title>\s*(.*)\s*</title>/), but it's terrible
 > slow in case of files *not* having title, even when
 > I limited to the first 1K of the file...
 > So now I'm looking for other solution, getting
 > information and make experiments with some mistic
 > tools, like jax/jell/cup (flex/lalr) etc... (thay are mistic for me
 > at least, I try to discover which is for what...).
 > Since I'm using perl I really hate to write this kind of
 > things with a lot of while/if/break/else/continue...

How about using avh HTML parser (let me know if you need the
URL). It's pretty flexible, and quite fast. Note that parsing the file
should be done in FileResource.updateFileAttributes, so that it gets
done only once, when the file (ie the content) of the resource

Received on Friday, 6 September 1996 09:17:39 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:25:30 UTC