- From: <w3t-archive+esw-wiki@w3.org>
- Date: Sat, 05 Nov 2005 03:00:46 -0000
- To: w3t-archive+esw-wiki@w3.org
Dear Wiki user, You have subscribed to a wiki page or wiki category on "ESW Wiki" for change notification. The following page has been changed by GoutamSaha: http://esw.w3.org/topic/its0908LinguisticMarkup ------------------------------------------------------------------------------ =='''Understanding Content Domain Level Markups:-'''== - In order to find out the content domain for a paragraph of text, we normally find that content domain is nothing but the '''most frequently occurred word''' (e.g. a noun) in that paragraph. For example, in a paragraph, if we see that the word-frequency of a word say, "football" is the maximum among other words' frequencies, then the content domain is "football" only. + In order to find out the content domain for a paragraph of text, we normally find that content domain is nothing but the '''most frequently occurred word''' (e.g. a noun) in that paragraph. For example, in a paragraph, if we see that the word-frequency of a word or its synonym word say, "football" is the maximum among other words' frequencies, then the content domain is "football" only. - Again, a word with the maximum '''word-desnsity''' may often be a Content Domain. The ratio of the number of times a word appears in a document to the size (total number word counts) of the document is called the word density. It is a measure of how important a word is to the overall content of the document. A higher word density results in a higher relevance ranking. + Again, a word with the maximum '''word-desnsity''' may often be a Content Domain. The ratio of the number of times a word ( or its synonyms) appears in a document to the size (total number word counts) of the document is called the word density. It is a measure of how important a word is to the overall content of the document. A higher word density results in a higher relevance ranking. - We should not consider preposition, interjection (e.g., Hallo, Sir etc.,) in counting the word density. In many speech/ communication we see "Sir" as the highest word density and it may mislead in finding out the Content Domain. Rather, we should consider the noun words in finding the most frequently occurred word towards Content Domain. + We should not consider preposition, interjection (e.g., Hallo, Sir etc.,) in counting the word density. In many speech/ communication we see "Sir" as the highest word density and it may mislead in finding out the Content Domain. Rather, we should consider the noun words in finding the most frequently occurred word towards Content Domain. We may use the content domain like music (for musical text) or doctor (for his/her prescriptions) or mathematics or defence or sports and so on according to the content text. == Challenges ==
Received on Saturday, 5 November 2005 15:24:02 UTC