Re: hand authoring web pages (was Re: Exploring new vocabularies for HTML)

Neil,

I agree with most everything you say: namely that a) facts are hard 
to come by b) hand-authoring is probably small c) a lot of authoring 
is hybrid (partly robotic, partly by hand).

What I don't agree with are just two things:

1) >Virtually all of the pages on commercial sites (amazon, ebay, 
craigslist, facebook, youtube, ...) are generated by 
software.  That's almost certainly the majority of web pages right there.

While I agree that most of those sites are no doubt generated by 
code, I'm not so sure that they constitute the majority of web pages. 
Maybe they do, and maybe there is data to substantiate that claim, 
since on the surface, yes those sites have zillions of pages. But I'm 
thinking of Zipf's law: while "the", "of" "a" have frequencies of 
usage in English higher than all the infrequent words like 
"defenestrate" the infrequently used words, en masse, make up a 
higher proportion of the corpus. By similar rationale, all  the 
little oddball pages authored by non-mainstream authors may represent 
a higher proportion than is represented by the low-hanging fruit 
represented by the amazons and e-bays..

2)> Some (many?) of the software that generates web pages leave marks 
that identify its source.  Being at google, you or a colleague could 
do a little work and determine an upper bound on the number of web 
sites that are hand authored.

Well, I would have a couple methodological problem with that: a) the 
hybrid pages. Sometimes folks use a thing like FrontPage or 
Dreamweaver to mock up a prototype of the page in Wysiwyg mode, and 
then do the bulk of their serious authoring (script, styles, and 
server-side http requests) manually. The marks that the mock-up 
software leaves may never be removed, hence a bot which attempts to 
inventory them could underestimate the hand-cranking that has been 
done, and, possibly, do so significantly. b) the exhaustiveness of 
Google's robotic search which, over the years, seems to crawl less 
deeply into, for example, academic web sites. One might question the 
results of studies that do not, in truth, sample representatively 
with equal probability.

Other than these quibbles, I generally agree. But one could, in the 
absence of data, just as easily conclude (just to be contrary) the 
opposite: that it is hard to see how anything other than 
hand-authoring could be seen as a priority.

There are also telltale signs of hand-authoring like mismatched 
tag/endings or improper nesting, and isn't there data that suggests 
the vast majority of pages are ill-formed in some way? Perhaps some 
of those ways might be uniquely biological (rather like a Turing test 
of sorts) ?

Maybe we could have Google sponsor a summer of data-sharing, and 
allow a few thousand of our favorite queries about authoring 
practices to be empirically analyzed -- (we might have to ask them to 
crawl the trees a little deeper for our purposes though). That would be fun.

cheers,
David

At 10:05 AM 4/1/2008, you wrote:
>I suspect that this topic will generate more flames than data, but...
>
>On Mon, Mar 31, 2008 at 11:25 PM, Ian Hickson 
><<mailto:ian@hixie.ch>ian@hixie.ch> wrote:
>
> > One unfortunate thing about the discussion on hand authoring is that it
> > has mostly been devoid of facts.  Some *facts* on percentages of
> > hand-authored vs machine-authored HTML should be part of a reasoned
> > discussion, but sadly neither side has produced any such facts.
>
>Indeed. Unfortunately it isn't clear how to collect such information.
>
>My experience has been that many pages are in fact hand-authored, either
>directly in a text editor, or through CMS systems that provide raw HTML
>editors, or through templates that are hand edited. I do not think we can
>forgo addressing the needs of hand-authoring content creators.
>
>
>
>As researchers and designers, it is important for us to realize our 
>own biases and the uniqueness of the world that immediately 
>surrounds us.  Eg, if we ask our colleagues, we might deduce that 
>20% - 30% of computer users use emacs.  Of course, that is complete 
>nonsense, but that is the kind of false statistic/impression we 
>would get from our own immediate environment.
>
>I suspect that a similar phenomena leads some people to conclude 
>that hand authoring of web pages is common practice, and hence 
>should be a priority in the design of HTML5.  It would be great to 
>have some statistics on hand authoring.
>
>There are at least two possible ways to measure the importance of 
>hand authoring:
>1.  The number of authors who hand-author web pages.  Since most 
>people both hand author and use tools, this number probably needs to 
>be broken down in some way.
>
>2.  The total number of web pages that are authored by hand.
>
>
>On the surface, it would appear to me that hand authoring accounts 
>for a tiny fraction of the total number of pages.  Maybe less than 
>0.001%.  Here's why I think that:
>
>Virtually all of the pages on commercial sites (amazon, ebay, 
>craigslist, facebook, youtube, ...) are generated by 
>software.  That's almost certainly the majority of web pages right there.
>
>Another large group of web pages consist of wikis and blogs.  Again, 
>the web pages are generated by software.  Some (probably small) 
>group of people occasionally might edit the raw HTML to fix a 
>problem, but editing to fix a problem isn't really helped much by 
>terseness, and complex tag minimization rules might actually make it 
>harder to edit if the software generating the page tried to take 
>advantage of them.
>
>Another group of pages is generated by what we typically think of as 
>web page editing tools (Dreamweaver, FrontPage, GoLive, XMLSpay, 
>...).  Adding up the sales of those can give some indication of the 
>number of users of those products.  Or for the open source version, 
>looking into the number of downloadsI don't have numbers for these, 
>but again, this is probably a substantial number.
>
>Similarly, in some organizations (schools or business or 
>...),  content management systems are used for web 
>authoring.  Again, software would mainly be used for web page 
>development although some hand editing might be done as you noted.
>
>Yet another chunk of web pages come from programs such as Word or 
>OpenOffice, etc., which offer a "Save as HTML" option.
>
>Against all of these sources, it seems like hand authoring would 
>account for a tiny, tiny fraction of web pages.  I suspect that even 
>if you limit the population to .edu sites, the numbers are 
>small.  Some (many?) of the software that generates web pages leave 
>marks that identify its source.  Being at google, you or a colleague 
>could do a little work and determine an upper bound on the number of 
>web sites that are hand authored.  That could be used to justify the 
>high priority you have placed on hand authorablility.
>
>However, without real facts, it is hard to see how hand authoring 
>can be considered a priority.  Our personal experiences are just not 
>informative of what the majority of users do.
>
>Neil Soiffer
>Senior Scientist
>Design Science, Inc.
><http://www.dessci.com>www.dessci.com
>~ Makers of Equation Editor, MathType, MathPlayer and MathFlow ~

Received on Tuesday, 1 April 2008 15:11:01 UTC