connectibyte
Addict
- Joined
- May 18, 2015
- Posts
- 53
- Reaction
- 58
- Points
- 118
- Age
- 29
The size of the World Wide Web (The Internet)
The Indexed Web contains at least 4.72 billion pages (Saturday, 30 May, 2015).
The Dutch Indexed Web contains at least 241.9 million pages (Saturday, 30 May, 2015).
The Indexed Web | You do not have permission to view the full content of this post. Log in or register now.
GB = Sorted on Google and Bing
BG = Sorted on Bing and Google
The size of the World Wide Web:
Estimated size of Google's index
The size of the World Wide Web:
Estimated size of Bing index
The size of the World Wide Web:
Estimated size of Yahoo Search index
We don't measure Yahoo.com anymore, because Yahoo has stopped showing the total number of results.
The size of the World Wide Web:
Estimated size of Ask's index
We don't measure Ask.com anymore, because Ask has stopped showing the total number of results.
How is the size of the World Wide Web (The Internet) estimated?
The estimated minimal size of the indexed World Wide Web is based on the estimations of the numbers of pages indexed by Google, Bing, Yahoo Search. From the sum of these estimations, an estimated overlap between these search engines is subtracted. The overlap is an overestimation; hence, the total estimated size of the indexed World Wide Web is an underestimation.
Since the overlap is subtracted in sequence, starting from one of the four search engines, several orderings (and total estimations) are possible. We present two total estimates, one starting with Bing (BG) and one starting with Google (GB). The figure reported at the top of the page refers to the GB estimation.
The size of the index of a search engine is estimated on the basis of a method that combines word frequencies obtained in a large offline text collection (corpus), and search counts returned by the engines. Each day 50 words are sent to all four search engines. The number of webpages found for these words are recorded; with their relative frequencies in the background corpus, multiple extrapolated estimations are made of the size of the engine's index, which are subsequently averaged. The 50 words have been selected evenly across logarithmic frequency intervals (see You do not have permission to view the full content of this post. Log in or register now.). The background corpus contains more than 1 million webpages from You do not have permission to view the full content of this post. Log in or register now., and can be considered a representative sample of the World Wide Web.
When you know, for example, that the word 'the' is present in 67,61% of all documents within the corpus, you can extrapolate the total size of the engine's index by the document count it reports for 'the'. If Google says that it found 'the' in 14.100.000.000 webpages, an estimated size of the Google's total index would be 23.633.010.000.
The overlap between the indices of two search engines is estimated by daily overlap counts of URLs returned in the top-10 by the engines that were returned in a sufficiently large number of random word queries. The words were randomly drawn from the DMOZ background corpus.
You can download my paper You do not have permission to view the full content of this post. Log in or register now. containing detailed information about the method (written in Dutch). This work was carried out as a Master thesis project at the Faculty of Arts of You do not have permission to view the full content of this post. Log in or register now.), within the You do not have permission to view the full content of this post. Log in or register now..
Note
No countings have taken place for the following dates:
7th of July till the 7th of August 2006.
3th of October till the 16th of October 2007 (Only for The Index Web).
19th of January till the 30th of January 2008 (Only for The Dutch Web).
20th of March till the 1th of April 2008.
5th of May till the 14th of May 2010.
3th of April till the 13th of April 2012.
11th of July till the 29th of July 2012.
8th of October till the 12th of October 2012.
10th of May till the 30th of May 2013.
22th of january till the 27th of january 2014.
10th of july till the 17th of september 2014.
The Indexed Web contains at least 4.72 billion pages (Saturday, 30 May, 2015).
The Dutch Indexed Web contains at least 241.9 million pages (Saturday, 30 May, 2015).
The Indexed Web | You do not have permission to view the full content of this post. Log in or register now.
- You do not have permission to view the full content of this post. Log in or register now.
- You do not have permission to view the full content of this post. Log in or register now.
- You do not have permission to view the full content of this post. Log in or register now.
- You do not have permission to view the full content of this post. Log in or register now.
GB = Sorted on Google and Bing
BG = Sorted on Bing and Google
The size of the World Wide Web:
Estimated size of Google's index
- You do not have permission to view the full content of this post. Log in or register now.
- You do not have permission to view the full content of this post. Log in or register now.
- You do not have permission to view the full content of this post. Log in or register now.
- You do not have permission to view the full content of this post. Log in or register now.
The size of the World Wide Web:
Estimated size of Bing index
- You do not have permission to view the full content of this post. Log in or register now.
- You do not have permission to view the full content of this post. Log in or register now.
- You do not have permission to view the full content of this post. Log in or register now.
- You do not have permission to view the full content of this post. Log in or register now.
The size of the World Wide Web:
Estimated size of Yahoo Search index
We don't measure Yahoo.com anymore, because Yahoo has stopped showing the total number of results.
The size of the World Wide Web:
Estimated size of Ask's index
We don't measure Ask.com anymore, because Ask has stopped showing the total number of results.
How is the size of the World Wide Web (The Internet) estimated?
The estimated minimal size of the indexed World Wide Web is based on the estimations of the numbers of pages indexed by Google, Bing, Yahoo Search. From the sum of these estimations, an estimated overlap between these search engines is subtracted. The overlap is an overestimation; hence, the total estimated size of the indexed World Wide Web is an underestimation.
Since the overlap is subtracted in sequence, starting from one of the four search engines, several orderings (and total estimations) are possible. We present two total estimates, one starting with Bing (BG) and one starting with Google (GB). The figure reported at the top of the page refers to the GB estimation.
The size of the index of a search engine is estimated on the basis of a method that combines word frequencies obtained in a large offline text collection (corpus), and search counts returned by the engines. Each day 50 words are sent to all four search engines. The number of webpages found for these words are recorded; with their relative frequencies in the background corpus, multiple extrapolated estimations are made of the size of the engine's index, which are subsequently averaged. The 50 words have been selected evenly across logarithmic frequency intervals (see You do not have permission to view the full content of this post. Log in or register now.). The background corpus contains more than 1 million webpages from You do not have permission to view the full content of this post. Log in or register now., and can be considered a representative sample of the World Wide Web.
When you know, for example, that the word 'the' is present in 67,61% of all documents within the corpus, you can extrapolate the total size of the engine's index by the document count it reports for 'the'. If Google says that it found 'the' in 14.100.000.000 webpages, an estimated size of the Google's total index would be 23.633.010.000.
The overlap between the indices of two search engines is estimated by daily overlap counts of URLs returned in the top-10 by the engines that were returned in a sufficiently large number of random word queries. The words were randomly drawn from the DMOZ background corpus.
You can download my paper You do not have permission to view the full content of this post. Log in or register now. containing detailed information about the method (written in Dutch). This work was carried out as a Master thesis project at the Faculty of Arts of You do not have permission to view the full content of this post. Log in or register now.), within the You do not have permission to view the full content of this post. Log in or register now..
Note
No countings have taken place for the following dates:
7th of July till the 7th of August 2006.
3th of October till the 16th of October 2007 (Only for The Index Web).
19th of January till the 30th of January 2008 (Only for The Dutch Web).
20th of March till the 1th of April 2008.
5th of May till the 14th of May 2010.
3th of April till the 13th of April 2012.
11th of July till the 29th of July 2012.
8th of October till the 12th of October 2012.
10th of May till the 30th of May 2013.
22th of january till the 27th of january 2014.
10th of july till the 17th of september 2014.
Attachments
-
You do not have permission to view the full content of this post. Log in or register now.