Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Articles

spacer

D-Lib Magazine
July/August 2008

Volume 14 Number 7/8

ISSN 1082-9873

Google Still Not Indexing Hidden Web URLs

 

Kat Hagedorn
Metadata Harvesting Librarian
Digital Library Production Service, University of Michigan Libraries
Ann Arbor, MI
<khage@umich.edu>

Joshua Santelli
Applications Programmer
Digital Library Production Service, University of Michigan Libraries
Ann Arbor, MI
<santelli@umich.edu>

Red Line

spacer

Introduction

This report is a follow-up to the McCown et al. article in IEEE Internet Computing two years ago, in which the researchers investigated the percentage of URLs from OAI records in Google, Yahoo and MSN search indexes [1]. We were interested in whether Google in particular had increased the number of OAI-based resources in its search index.

To this end, we used a slightly different methodology using the OAIster [2] metadata corpus to see what percentage of the corpus was found in the Google search index only. OAIster harvests and aggregates OAI metadata with links to digital resources – those without links to digital objects are removed during our transformation and indexing process.

Methodology

On June 6, 2008, a snapshot was taken of the harvested content in OAIster. The snapshot contained 978 repositories comprised of 16,276,756 records. Each repository was placed into one of four groups based on the number of Dublin Core records indexed in OAIster.

Group A is made up of repositories with 100 or fewer records; Group B has 101 to 1,000 records; Group C has 1,001 to 10,000 records; and repositories with more than 10,000 records were put into Group D. (See Table 1.)

  Number of Repositories in Group Number of Records Sampled in Each Group
Group A: <100 Records 139 6,582
Group B: 101 - 1,000 Records 334 14,670
Group C: 1,001 - 10,000 Records 363 64,150
Group D: >10,000 Records 142 61,893

Table 1. Randomly sampled records for each group.

Since OAIster only indexes records with URLs, each record has at least one URL. For records containing more than one URL, a single URL was selected at random from within the record.

Because Group A was very small (6,582 records) we selected all the records to run against the Google search index. From Group B, we randomly selected 10% of the records from each repository; from Group C, we selected 5% from each repository; and from Group D we selected 1% from each repository. With this method, we selected and tested a total of 147,305 URLs. Sampling size was chosen to maintain at least a 95% confidence level (±1%).

This method differs from that of McCown et al. They grouped the records using a different method, and they randomly selected 1,000 records from each group. They also searched MSN, Yahoo and Google while we searched for the records only in the Google search index.

To determine if a record was indexed by Google, we made an "info" request (e.g., info:http://oaister.org/) for each sampled URL against the Google Research API using the University Research Program for Google Search [3]. Either zero or one result was returned from the API. If a result was returned we marked that record as "found"; if no results were returned, we marked that record as "not found".

Results and Caveats

Of the sampled records, 44.35% of them were found in Google. (See Table 2.)

  Number of Records Found Number of Records Not Found % Records Found % Records Not Found
Group A 4,908 1,674 74.57% 25.43%
Group B 8,462 6,208 57.68% 25.43%
Group C 32,775 31,375 51.09% 48.91%
Group D 19,182 42,711 30.99% 69.01%
All 65,327 81,968 44.35% 55.65%

Table 2. Records found and not found in the Google search index.

We spot-checked the sampling by choosing a few repositories and requesting all the URLs in the repositories in the Google search index. We chose one repository from each of Groups B, C and D. (Group A was already fully represented.) We chose these particular repositories because of the mostly even split between "found" and "not found" records and wanted to test the assumption that this would be the case for all the records in the repositories. We found that our assumption was correct. (See Tables 3 and 4.)

Repository Chosen Per Group Number of
Records Found
Number of Records
Not Found
% Records Found % Records
Not Found
Total Records
Sampled
Group B: UVicDSpace 12 67 15.19% 84.81% 79
Group C: Universität Frankfurt am Main
Hochschulschriften OPUS
105 170 38.18% 61.82% 275
Group D: Digitala Vetenskapliga
Arkivet (DiVA)
148 103 58.96% 41.04% 251

Table 3. Original requests for three repositories to the Google search index.

 

Repository Chosen Per Group Number of
Records Found
Number of Records
Not Found
% Records Found % Records
Not Found
Total Records
Sampled
Group B: UVicDSpace 131 665 16.46% 83.54% 796
Group C: Universität Frankfurt am Main
Hochschulschriften OPUS
1,961 3,555 35.55% 64.45% 5,516
Group D: Digitala Vetenskapliga
Arkivet (DiVA)
14,813 10,291 59.01% 40.99% 25,104

Table 4. Requests for all records in the three repositories to the Google search index.

We are aware that URLs in OAI records can be constructed differently from URLs accessed by Google for its index in its normal course of operations. For instance, a record from the Project Euclid repository accessed via OAI has the URL http://projecteuclid.org/euclid.bams/1183524923, with the title "Every planar graph with nine points has a nonplanar complement". If you perform an info request for this resource in the Google search index, the article is not found [4]. If you look for the URL in Google with the addition of a "/handle" element1 – http://projecteuclid.org/handle/euclid.bams/1183524923 – the article is found [5]. Both types of URLs resolve correctly on the Project Euclid site. We are not able to determine how widespread this case is across repositories.

There is the potential that running all the records for the small repositories skewed the results by representing these more. Alternatively, choosing 1% of the records in Group D repositories could also have skewed the results by including too many records from a single, large repository.

Conclusions

Google's indexing does not seem to have retrieved more of the hidden web since the publication of the McCown, et al. article in 2006. We would venture to conclude that Google has not endeavoured to increase their support and access to OAI materials. Even taking into account the caveats, we would also conclude that aggregations of OAI records are as valuable for user research purposes as they were at least two years ago.

From our own experience, we know that providing the OAIster records in bulk to Google proved problematic for them, and eventually they requested only the OAIster URLs instead of the complete metadata. We are not, at this point, certain that Google is using these URLs (crawling them) for addition to their search index.

It is also interesting to note that Google has recently dropped support of OAI for website indexing [6]. Given the resulting numbers from our investigation, it seems that Google needs to do much more to gather hidden resources, not less. (Granted, the OAI for Sitemaps feature may not have been an appropriate approach for Google.)

We are very interested in others' evaluation of our data crunching. We would also like to encourage other OAI aggregators to run their metadata against the Google index, to prove or disprove our conclusions. Our source code and raw data are available upon request.

Acknowledgements

This research draws on data provided by the University Research Program for Google Search, a service provided by Google to promote a greater common understanding of the web.

Bibliography

[1] McCown, F., Liu, X., Nelson, M. L., and Zubair, M. "Search engine coverage of the OAI-PMH corpus." IEEE Internet Computing 10:2 (March/April 2006) pp. 66-73. <http://doi.ieeecomputersociety.org/10.1109/MIC.2006.41>.

[2] OAIster website. Accessed June 20, 2008. <http://www.oaister.org/>.

[3] University Research Program for Google Search website. Accessed June 19, 2008. <http://research.google.com/university/search/>.

[4] Info request for "http://projecteuclid.org/euclid.bams/1183524923" in Google search index. Accessed June 19, 2008. <http://www.google.com/search?q=info:http://projecteuclid.org/euclid.bams/1183524923>.

[5] Info request for "http://projecteuclid.org/handle/euclid.bams/1183524923" in Google search index. Accessed June 19, 2008. <http://www.google.com/search?q=info:http://projecteuclid.org/handle/euclid.bams/1183524923>.

[6] Mueller, J. "Retiring support for OAI-PMH in Sitemaps." Google Webmaster Central Blog (April 23, 2008). <http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html>.

Note

1. The use of the term "handle" here does not refer to the Handle System®.
Copyright © 2008 Kat Hagedorn and Joshua Santelli
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous Article | JCDL Conference Report
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

doi:10.1045/july2008-hagedorn