- From: David Rennie Hinshelwood <hinsheld@crl.nmsu.edu>
- Date: Wed, 19 Jul 2000 23:18:30 -0600
- To: "Georgopoulos N" <ngeorgopoulos@city.ac.uk>
- Cc: "'Html-Tidy" <html-tidy@w3.org>
- Message-ID: <NEBBJDDLAMECEDNEDDLKMECMCBAA.hinsheld@crl.nmsu.edu>
I used JTidy on NT4.0 with the following config settings: EncloseText = false FixBackslash = true FixComments = true HideEndTags = false IndentAttributes = false IndentContent = false KeepFileTimes = true LogicalEmphasis = false MakeClean = false NumEntities = false OnlyErrors = false Quiet = false QuoteAmpersand = true QuoteMarks = false QuoteNbsp = flase RawOut = false ShowWarnings = true Slidestyle = null SmartIndent = false Spaces = 2 Tabsize = 4 TidyMark = true UpperCaseAttrs = false UpperCaseTags = false Word2000 = false WrapAsp = true WrapAttVals = false WrapJste = true Wraplen = 68 WrapPhp = true WrapScriptlets = false WrapSection = true Writeback = false XHTML = true XmlOut = false XmlPi = false XmlPIs = false XmlTags = false Most of these are standard. The output worked (attached) but because of the char set, JTidy really messed up. The formatting is hairy and the chars displayed are intelligible. I have come across this problem with all char sets except the 4 supported by JTidy. This is the main problem facing my use of JTidy to pull apart web pages for translation and summarization. To be honest it's absolutely key to the project I'm working on. Any suggestions by yourself or anybody on the list will be much appreciated. p.s. to get file output from JTidy: File f_html = new File("anyfilename" + ".html"); //creats file to output to FileOutputStream FOS_html = new FileOutputStream(f_html); //creats output stream associated with file InputStream input_html = new URL("http://" + URLtoParse).openStream(); //grabs web page and pipes through an inputstream Tidy t = new Tidy(); //instantiates jtidy t.parse(input_html, FOS_html); //calls jtidy Using java.io and java.net Again any help with the char set problem would defiantly help. David Hinshelwood CRL NMSU Tel: (505) 646 3342 (office) (505) 645 5537 (home) -----Original Message----- From: html-tidy-request@w3.org [mailto:html-tidy-request@w3.org]On Behalf Of Georgopoulos N Sent: Wednesday, July 19, 2000 12:55 PM To: html-tidy@w3.org Subject: JTidy and java.lang.NullPointerException Hi, I have tried to parse the following site and create a DOM tree: http://www.in.gr/ with just the following settings: tidy.setMakeClean(true); tidy.setXmlTags(false); tidy.setBreakBeforeBR(true); and I get the following message: Tidy (vers 30th April 2000) Parsing "InputStream" line 11 column 1 - Warning: <link> lacks "type" attribute line 13 column 1 - Warning: <script> lacks "type" attribute line 205 column 140 - Warning: <img> lacks "alt" attribute line 209 column 2 - Warning: <table> lacks "summary" attribute line 215 column 9 - Warning: <table> lacks "summary" attribute line 221 column 64 - Warning: <img> lacks "alt" attribute line 231 column 45 - Warning: <img> lacks "alt" attribute line 235 column 94 - Warning: <img> lacks "alt" attribute line 251 column 7 - Warning: <table> lacks "summary" attribute line 255 column 38 - Warning: <img> lacks "alt" attribute line 257 column 93 - Warning: <img> lacks "alt" attribute line 267 column 13 - Warning: <table> lacks "summary" attribute line 289 column 29 - Warning: <img> lacks "alt" attribute line 299 column 20 - Warning: missing <td> line 317 column 7 - Warning: <table> lacks "summary" attribute line 323 column 13 - Warning: <table> lacks "summary" attribute line 327 column 56 - Warning: <img> lacks "alt" attribute line 331 column 15 - Warning: missing <tr> line 331 column 58 - Warning: <img> lacks "alt" attribute line 333 column 60 - Warning: <img> lacks "alt" attribute line 355 column 19 - Warning: <img> lacks "alt" attribute java.lang.NullPointerException at java.text.MessageFormat.format(Compiled Code) at java.text.MessageFormat.format(Compiled Code) at java.text.Format.format(Compiled Code) at java.text.MessageFormat.format(Compiled Code) at org.w3c.tidy.Report.attrError(Compiled Code) at org.w3c.tidy.Lexer.parseAttrs(Compiled Code) at org.w3c.tidy.Lexer.getToken(Compiled Code) at org.w3c.tidy.ParserImpl$ParseInline.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseTag(Compiled Code) at org.w3c.tidy.ParserImpl.access$071(Compiled Code) at org.w3c.tidy.ParserImpl$ParseBlock.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseTag(Compiled Code) at org.w3c.tidy.ParserImpl.access$071(Compiled Code) at org.w3c.tidy.ParserImpl$ParseRow.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseTag(Compiled Code) at org.w3c.tidy.ParserImpl.access$071(Compiled Code) at org.w3c.tidy.ParserImpl$ParseTableTag.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseTag(Compiled Code) at org.w3c.tidy.ParserImpl.access$071(Compiled Code) at org.w3c.tidy.ParserImpl$ParseBlock.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseTag(Compiled Code) at org.w3c.tidy.ParserImpl.access$071(Compiled Code) at org.w3c.tidy.ParserImpl$ParseRow.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseTag(Compiled Code) at org.w3c.tidy.ParserImpl.access$071(Compiled Code) at org.w3c.tidy.ParserImpl$ParseTableTag.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseTag(Compiled Code) at org.w3c.tidy.ParserImpl.access$071(Compiled Code) at org.w3c.tidy.ParserImpl$ParseBlock.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseTag(Compiled Code) at org.w3c.tidy.ParserImpl.access$071(Compiled Code) at org.w3c.tidy.ParserImpl$ParseRow.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseTag(Compiled Code) at org.w3c.tidy.ParserImpl.access$071(Compiled Code) at org.w3c.tidy.ParserImpl$ParseTableTag.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseTag(Compiled Code) at org.w3c.tidy.ParserImpl.access$071(Compiled Code) at org.w3c.tidy.ParserImpl$ParseBody.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseTag(Compiled Code) at org.w3c.tidy.ParserImpl.access$071(Compiled Code) at org.w3c.tidy.ParserImpl$ParseHTML.parse(Compiled Code) at org.w3c.tidy.ParserImpl.parseDocument(Compiled Code) at org.w3c.tidy.Tidy.parse(Compiled Code) at org.w3c.tidy.Tidy.parseDOM(Tidy.java:1152) The same message java.lang.NullPointerException appeared after parsing other sites too. Is there anyway to avoid it in general (and not only to catch it)? Should I change any setting? And a second question that I would be grateful if someone could help: How can I redirect the output of JTidy to save it for example in a StringBuffer? Is there any particular I/O stream that I can use? Thank you very much Nikos
Attachments
- text/html attachment: www.in.gr.html
Received on Thursday, 20 July 2000 01:15:16 UTC