I'm trying to use JTidy to convert HTML pages to
XML. The HTML has several 'un-tagged' entries. For
example:
<P><A
name=Hit3><B>3.</B></A> <A href="http://www.matrixscience.com/cgi/protein_view.pl?file=../data/20020130/FaioSfs.dat&hit=4">gi|11528046</A>
<B>Mass:</B> 74711 <B>Score:</B>
43
(AF197556) coat protein [Beet necrotic yellow
vein virus]
<B> Observed Mr(expt)
Mr(calc) Delta Start
End Miss Peptide</B>
564.70 564.70
565.25 -0.55 168 -
171 0 FEDR
828.00 828.00
828.51 -0.51 44
- 51 0 AANLSIIK
1032.30 1032.30
1032.56 -0.26 509 -
519 0 AAVAMTALASK
2271.60 2271.60
2271.16 0.44 556
- 578 0
YVHTGIQGGAQLAGAMAVGAMLR
<B>No match to:</B> 1021.10,
3511.70
Is there an easy way to get JTidy to 'tag' the
un-tagged text? For example, the text between the <B>'s? I'd
rather not right a java program to tag these lines prior to sending it to
JTidy.
I'm setting the following params on
JTidy:
tidy.setMakeClean(true);
tidy.setBreakBeforeBR(true);
tidy.setShowWarnings(false);
tidy.setOnlyErrors(true);
Thanks.