- From: Marcel Ferrante <marcelf@gmail.com>
- Date: Mon, 9 Jun 2014 10:12:55 -0300
- To: semantic-web@w3.org
- Message-ID: <CAEFXnnhBc9kSrsRz27mFP01bHOzJJT17SSkgJk==EfihqSfkDQ@mail.gmail.com>
Maybe this is a newbie question, but I like to hear yours opinions. The google search engine has a repository that contains full html of every internet web page: http://infolab.stanford.edu/~backrub/google.html http://infolab.stanford.edu/~backrub/over.gif My question is: assuming is possible convert text to ontology (ontology learning, semantic extraction, aka text-to-onto), is possible create a semantic repository of ALL internet pages ? The proposal is: 1. crawler and convert each webpage in a semantic version (rdf or owl) 2. store it in a semantic repository. Could be a noSQL database, and each document will be the semantic version of a web resource 3. create a semantic index like YARS or tripeStore 4. create a restfull API and allows sparql queries 5. do semantic retrieval / semantic matching 6. import existing semantic databases (like dbpedia, linkedata) and gradually integrate ontologies Typically the raw page text is about 2kb and assuming the internet has about 25 billions pages, and the semantic version of webpage increase 100% of document size, we will need about 100 Tbytes to do the semantic repository. With another 100Tb to do the semantic index, so this project nowadays is quite viable. This could used by programs, agents, apps for mashup services and everything else. The main objective is launch a start for Semantic Web 1.0, a parallel and semantic version of web. Wait users to use rdfa/microformats in webpages will never happen! Anyone knows if there a project ou research like this? What are the biggest challenges or barries to do this ? Thanks -- Marcel Ferrante Silva +55 62 8108-1277 "The Power of Ideas" skype: marcelferrante msn/gtalk: marcelf@gmail.com lattes.cnpq.br/6034149800479841
Received on Monday, 9 June 2014 13:13:22 UTC