Re: Webpage Corpus

Some notes from today’s call on this topic:

- Internet Archive (“HTTP archive”?) is a good source too
- try to simulate dynamic content by chopping up a single page into multiple chunks
- also can simulate dynamic content by using comments sections of websites
-jpamental volunteers to help (thank you!!!!!)


> On Nov 1, 2019, at 4:37 PM, Garret Rieger <grieger@google.com> wrote:
> 
> 
> Agree that we should pursue the creation of an external corpus independently from our current efforts to get one created internally that we can publish.
> 
> As you suggested we'll probably want to start with a crawl of the web (this could be gathered by the wg, or could use something like HTTP archive). Then I think there's probably a few different options for generating the page view sequences:
>   - If we have some sort of ranking of the popularity of the pages in the index. We could use that to weight the generation of random sequences of pages from the index.
>   - If the index includes information on which pages link to other pages we could generate traversal sequences by starting from a random page and then randomly traversing the graph.
>  - Or ideally if we have both of the above you could combine popularity weighting along with the graph traversal.
> 
> Wikipedia could be a good starting point to validate an approach, but I found when grabbing pages in languages other than english the majority of articles were extremely short. So I think the wikipedia data may not be a representative sample of the web at large for non-latin.
> 
> 
>> On Thu, Oct 31, 2019 at 10:39 AM Myles C. Maxfield <mmaxfield@apple.com> wrote:
>> 
>> 
>>>> On Oct 31, 2019, at 10:34 AM, Myles C. Maxfield <mmaxfield@apple.com> wrote:
>>>> 
>>>> Hi!
>>>> 
>>>> It’s been 5 months since the original Analysis Framework proposal which included an action item to share set of page view sequences to use as a testing corpus. However, this corpus has yet to appear. I think we need to consider the possibility that we may need to construct a corpus ourselves, without this preexisting data set.
>>>> 
>>>> I’ve been working on the range-request model for streamable fonts, and in my research, I’ve had to gather my own corpus because I can’t let my work be blocked on lawyers at another company. I’m not suggesting that I use my corpus
>>> 
>>> Whoops: this should be “that we use my corpus” above
>>> 
>>> , but I am suggesting that the Working Group comes up with a plan (and eventually executes it) for gathering our own, non-proprietary corpus.
>>> 
>>> Ideas off the top of my head:
>>> Web crawler. There are many off-the-shelf crawlers. I’ve been using http://nutch.apache.org and it seems to work pretty well.
>>> Wikipedia. They offer dumps of their entire database.
>>> Your idea here?
>>> Importantly, none of these ideas include page view sequences; they only include static content. If we use one of these ideas, then we probably will have to either a) try to create artificial page view sequences out of the corpus, or b) modify the analysis framework to not require page view sequences.
>>> 
>>> Thanks,
>>> Myles
>> 

Received on Monday, 4 November 2019 17:41:42 UTC