On Pre-processing and Other Issues from basel shishani on 2001-04-11 (html-tidy@w3.org from April to June 2001)

From: basel shishani <bshishani@yahoo.com>
Date: Wed, 11 Apr 2001 06:52:26 -0400 (EDT)
To: html-tidy@w3.org
Message-ID: <20010411105217.34510.qmail@web9603.mail.yahoo.com>
Hi,

I have a couple of points that can be added to the
wish
list, and probably won't cost a lot of coding effort.

- issue: preprocessing

I use a preprocessor with my html files - ePerl
embedded perl tool, which also has preprocessing
directives similar to CPP. I also use JSP and server
side java.

Since tidy cannot handle files that contain
preprocessing directives and special tags, I would
like
to suggest the following two approaches to handle
these
cases in a generic way.

one approach is to introduce stream based translations
that are applied before passing the text to the
parser,
these will be regex based:


eg: in the config file
    note: regexes are perl style

1: begin translations 
2:   "$#include\s+body\.txt\s*^"		"<body>"
3:   "$#include\s+.*^"				""          
4:   "<!--#include\s+\"body.txt\"\s*-->"	"<body>"    
5:   "<:.*?:>"					""
6:   "<eperl>.*?</eperl>"			""
7: end translations


some explanations on the above:

2: the author has defined a file for a lengthy body
tag, just insert a body tag to keep parser happy. the
preprocessor uses a CPP style directive at line
beginnings.

3: the rest of the includes are just for general well
balanced html fragments, so they can be safely blocked
from the parser.

4: similar to 2:, but for a tag based directive

6: eperl style tags should be ignored by parser

7: the tags and the contents to be ignored.



some rules and restrictions:

  order is significant, rules that come first are
  applied first.

  minimal matching: quantifiers are not greedy - this
  can be made the default or left to user to specify
  (?).

  no overlap: there is no overlap in matched stream
  segments. this is not allowed to happen in
  implementation.
 
  the first string is a regex while the second is just
  an ordinary string. Or maybe the second string can
  use values from the regex string as in perl back
  references, in case needed.

implementation suggestion: 

  tri-tree structure: each regex match divides stream
  into three nodes: pre / match / post. we start with
  the whole stream in one node.

  translations are run in order, a pass for each rule.
  
  translations do not cross node boundaries, to
prevent
  overlap of matched segments - as mentioned above.

  generated segments are wrapped in special tags that
  are acceptable to the parser or maybe html comments,
  and contain refs to original segments, such as to
  tree nodes. eg:

  
  <!--#tidy begin-trans id=some-ref --> 

  generated segment . . .

  <!--#tidy end-trans --> 

  the tree is traversed to generate the input stream
to
  the parser

  on output and after the parser is done, the original
  content is inserted instead of the generated
  segments,using the refs in the wrapper tags , and
  wrapper tags removed. any style changes or
  indentation within the special tags is discarded, or
  not applied to start with.


for debugging purposes, there can be generated a log
file for translations that occurred, or the pre-parser
stream.
 
Also, to give users flexibility, we can have command
line options and named sections to allow switching
certain rules on or off.

eg


begin translations jsp
  . . . 
end translations    jsp 


begin translations eperl
  . . .
end translations eperl


then on the command line (or in the config file
itself)

> tidy -translate jsp . . .
> tidy -translate jsp eperl . . .

order of rules will be significant across sections.



A second approach: or can be used in addition to the
above one.

control directives for tidy, where everything in
between is affected by the directives. Authors will
insert these tags to direct tidy to apply some action
to a given section: eg:

<!--#tidy translate="" //ignores everything in between
-->


<?php >
. . . 

<!--#tidy end-translate -->

this will replace everything in between with null.
 


<!--#tidy translate="<body>" //replace with a body tag
for 
                        the purposes of validation -->
. . .
<!--#tidy end-translate -->



in implementation, translation can be performed
without
interference with the parser.  the above can be
entered
into the parser as for example:


<!--#tidy id="23"  translate=""this ignores everything
in between  -->
<!--#tidy end-translate -->

and

<!--#tidy id="24"  translate="<body>" //replace with a
body tag for 
                        the purposes of validation -->
<body>
<!--#tidy end-translate -->

where the id's are references to original content
between tags saved in some structure.

on output, we remove the id tags, and re-translate
back
to original segment content.

no modifications are made on content or style of the
original segments in between tags.

of course this approach can be too verbose if we are
talking wrapper tags around tags containing
preprocessor style attributes.


I understand that the above two approaches can have
many inter-actions with existing functionality, but I
wish that others take a look, and see if such
techniques would handle situations that there are
facing.


- issue: compact style

a compaction options: since some browsers have weird
behavior based on the physical layout of tags, eg
white
spaces.  rather than having that affect readability,
we
can use two versions of a file, a fully compact
version
is used for testing in a browser or for final
deployment, while a fully indented version can be used
for editing.

by fully compact we mean no line breaks at all and no
white spaces outside tags. so the whole stream is a
single line.

not sure if this single line code will be safe for all
browsers, maybe there can be another option to
generate
a compact version ` but with all line breaks within
tags and no breaks on tag boundaries.
 
eg

 
  tidy --indent compact/yes/no/auto  --nocomments  . .
.
    
I find this much easier than trying to keep track of
which <br> or <img> has spaces or newlines and what
causes blue whiskers or ugly tables and the like.
  

this can be further enhanced by an option for removing
comments.

eg 
 
  tidy --indent compact/yes/no/auto  --nocomments  . .
.


there exists another tool called htmltidy
[http://people.itu.int/~lindner/] which implements
something close, but it also tries to trim some
attributes and code deemed redundant. it sees to rely
on patterns, so the output is not completely clean


- issue: files containing html fragments: 
  
usually, a file containing shared sections of html is
maintained separately, and included within other files
using preprocessors or SSI. it would be nice to have
these validated and pretty-printed separately.

the fragments can be restricted to a tree or a forest,
as a reasonable design style is to have the fragments
well balanced.

since tidy is using a parser, I think implementing
this
won't be a problem. so a fragment will be validated as
a sub-tree of a full html document. eg

tidy --html-fragment  . . .   frag1.txt

this command will enforce well balanced tree/forest
and
valid html tags.

some options that apply to full html files might be
meaningless for this case.


---------------
Basel Shishani



__________________________________________________
Do You Yahoo!?
Get email at your own domain with Yahoo! Mail. 
http://personal.mail.yahoo.com/
Received on Thursday, 12 April 2001 13:23:58 UTC