Re: Webpage Corpus

Agree that we should pursue the creation of an external corpus
independently from our current efforts to get one created internally that
we can publish.

As you suggested we'll probably want to start with a crawl of the web (this
could be gathered by the wg, or could use something like HTTP archive).
Then I think there's probably a few different options for generating the
page view sequences:
  - If we have some sort of ranking of the popularity of the pages in the
index. We could use that to weight the generation of random sequences of
pages from the index.
  - If the index includes information on which pages link to other pages we
could generate traversal sequences by starting from a random page and then
randomly traversing the graph.
 - Or ideally if we have both of the above you could combine popularity
weighting along with the graph traversal.

Wikipedia could be a good starting point to validate an approach, but I
found when grabbing pages in languages other than english the majority of
articles were extremely short. So I think the wikipedia data may not be a
representative sample of the web at large for non-latin.


On Thu, Oct 31, 2019 at 10:39 AM Myles C. Maxfield <mmaxfield@apple.com>
wrote:

>
>
> On Oct 31, 2019, at 10:34 AM, Myles C. Maxfield <mmaxfield@apple.com>
> wrote:
>
> Hi!
>
> It’s been 5 months since the original Analysis Framework proposal which
> included an action item to share set of page view sequences to use as a
> testing corpus. However, this corpus has yet to appear. I think we need to
> consider the possibility that we may need to construct a corpus ourselves,
> without this preexisting data set.
>
> I’ve been working on the range-request model for streamable fonts, and in
> my research, I’ve had to gather my own corpus because I can’t let my work
> be blocked on lawyers at another company. I’m not suggesting that I use my
> corpus
>
>
> Whoops: this should be “that we use my corpus” above
>
> , but I am suggesting that the Working Group comes up with a plan (and
> eventually executes it) for gathering our own, non-proprietary corpus.
>
> Ideas off the top of my head:
>
>    - Web crawler. There are many off-the-shelf crawlers. I’ve been using
>    http://nutch.apache.org and it seems to work pretty well.
>    - Wikipedia. They offer dumps
>    <https://dumps.wikimedia.org/backup-index.html> of their entire
>    database.
>    - Your idea here?
>
> Importantly, none of these ideas include page view sequences; they only
> include static content. If we use one of these ideas, then we probably will
> have to either a) try to create artificial page view sequences out of the
> corpus, or b) modify the analysis framework to not require page view
> sequences.
>
> Thanks,
> Myles
>
>
>

Received on Friday, 1 November 2019 23:37:20 UTC