Webpage Corpus

Hi!

It’s been 5 months since the original Analysis Framework proposal which included an action item to share set of page view sequences to use as a testing corpus. However, this corpus has yet to appear. I think we need to consider the possibility that we may need to construct a corpus ourselves, without this preexisting data set.

I’ve been working on the range-request model for streamable fonts, and in my research, I’ve had to gather my own corpus because I can’t let my work be blocked on lawyers at another company. I’m not suggesting that I use my corpus, but I am suggesting that the Working Group comes up with a plan (and eventually executes it) for gathering our own, non-proprietary corpus.

Ideas off the top of my head:
Web crawler. There are many off-the-shelf crawlers. I’ve been using http://nutch.apache.org <http://nutch.apache.org/> and it seems to work pretty well.
Wikipedia. They offer dumps <https://dumps.wikimedia.org/backup-index.html> of their entire database.
Your idea here?
Importantly, none of these ideas include page view sequences; they only include static content. If we use one of these ideas, then we probably will have to either a) try to create artificial page view sequences out of the corpus, or b) modify the analysis framework to not require page view sequences.

Thanks,
Myles

Received on Thursday, 31 October 2019 17:34:53 UTC