Is Possible Create Semantic Repository of ALL webpages of Internet (Semantic Web 1.0)?

Maybe this is a newbie question, but I like to hear yours opinions.

The google search engine has a repository that contains full html of every
internet web page:
http://infolab.stanford.edu/~backrub/google.html
http://infolab.stanford.edu/~backrub/over.gif

My question is: assuming is possible convert text to ontology (ontology
learning, semantic extraction, aka text-to-onto), is possible create a
semantic repository of ALL internet pages ?

The proposal is:

1. crawler and convert each webpage in a semantic version (rdf or owl)
2. store it in a semantic repository. Could be a noSQL database, and each
document will be the semantic version of a web resource
3. create a semantic index like YARS or tripeStore
4. create a restfull API and allows sparql queries
5. do semantic retrieval / semantic matching
6. import existing semantic databases (like dbpedia, linkedata) and
gradually integrate ontologies

Typically the raw page text is about 2kb and assuming the internet has
about 25 billions pages, and the semantic version of webpage increase 100%
of document size, we will need about 100 Tbytes to do the semantic
repository. With another 100Tb to do the semantic index, so this project
nowadays is quite viable.

This could used by programs, agents, apps for mashup services and
everything else.

The main objective is launch a start for Semantic Web 1.0, a parallel and
semantic version of web. Wait users to use rdfa/microformats in webpages
will never happen!

Anyone knows if there a project ou research like this? What are the biggest
challenges or barries to do this ?

Thanks
-- 
Marcel Ferrante Silva
+55 62 8108-1277
"The Power of Ideas"
skype: marcelferrante
msn/gtalk: marcelf@gmail.com
lattes.cnpq.br/6034149800479841

Received on Monday, 9 June 2014 13:13:22 UTC