Google Still Not Indexing Hidden Web URLs

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
July/August 2008

Volume 14 Number 7/8

ISSN 1082-9873

Google Still Not Indexing Hidden Web URLs

Kat Hagedorn
Metadata Harvesting Librarian
Digital Library Production Service, University of Michigan Libraries
Ann Arbor, MI
<khage@umich.edu>

Joshua Santelli
Applications Programmer
Digital Library Production Service, University of Michigan Libraries
Ann Arbor, MI
<santelli@umich.edu>

Introduction

This report is a follow-up to the McCown et al. article in IEEE Internet Computing two years ago, in which the researchers investigated the percentage of URLs from OAI records in Google, Yahoo and MSN search indexes [1]. We were interested in whether Google in particular had increased the number of OAI-based resources in its search index.

To this end, we used a slightly different methodology using the OAIster [2] metadata corpus to see what percentage of the corpus was found in the Google search index only. OAIster harvests and aggregates OAI metadata with links to digital resources – those without links to digital objects are removed during our transformation and indexing process.

Methodology

On June 6, 2008, a snapshot was taken of the harvested content in OAIster. The snapshot contained 978 repositories comprised of 16,276,756 records. Each repository was placed into one of four groups based on the number of Dublin Core records indexed in OAIster.

Group A is made up of repositories with 100 or fewer records; Group B has 101 to 1,000 records; Group C has 1,001 to 10,000 records; and repositories with more than 10,000 records were put into Group D. (See Table 1.)

	Number of Repositories in Group	Number of Records Sampled in Each Group
Group A: <100 Records	139	6,582
Group B: 101 - 1,000 Records	334	14,670
Group C: 1,001 - 10,000 Records	363	64,150
Group D: >10,000 Records	142	61,893

Table 1. Randomly sampled records for each group.

Since OAIster only indexes records with URLs, each record has at least one URL. For records containing more than one URL, a single URL was selected at random from within the record.

Because Group A was very small (6,582 records) we selected all the records to run against the Google search index. From Group B, we randomly selected 10% of the records from each repository; from Group C, we selected 5% from each repository; and from Group D we selected 1% from each repository. With this method, we selected and tested a total of 147,305 URLs. Sampling size was chosen to maintain at least a 95% confidence level (±1%).

This method differs from that of McCown et al. They grouped the records using a different method, and they randomly selected 1,000 records from each group. They also searched MSN, Yahoo and Google while we searched for the records only in the Google search index.

To determine if a record was indexed by Google, we made an "info" request (e.g., info:http://oaister.org/) for each sampled URL against the Google Research API using the University Research Program for Google Search [3]. Either zero or one result was returned from the API. If a result was returned we marked that record as "found"; if no results were returned, we marked that record as "not found".

Results and Caveats

Of the sampled records, 44.35% of them were found in Google. (See Table 2.)

	Number of Records Found	Number of Records Not Found	% Records Found	% Records Not Found
Group A	4,908	1,674	74.57%	25.43%
Group B	8,462	6,208	57.68%	25.43%
Group C	32,775	31,375	51.09%	48.91%
Group D	19,182	42,711	30.99%	69.01%
All	65,327	81,968	44.35%	55.65%

Table 2. Records found and not found in the Google search index.

We spot-checked the sampling by choosing a few repositories and requesting all the URLs in the repositories in the Google search index. We chose one repository from each of Groups B, C and D. (Group A was already fully represented.) We chose these particular repositories because of the mostly even split between "found" and "not found" records and wanted to test the assumption that this would be the case for all the records in the repositories. We found that our assumption was correct. (See Tables 3 and 4.)

Repository Chosen Per Group	Number of Records Found	Number of Records Not Found	% Records Found	% Records Not Found	Total Records Sampled
Group B: UVicDSpace	12	67	15.19%	84.81%	79
Group C: Universität Frankfurt am Main Hochschulschriften OPUS	105	170	38.18%	61.82%	275
Group D: Digitala Vetenskapliga Arkivet (DiVA)	148	103	58.96%	41.04%	251

Table 3. Original requests for three repositories to the Google search index.

Repository Chosen Per Group	Number of Records Found	Number of Records Not Found	% Records Found	% Records Not Found	Total Records Sampled
Group B: UVicDSpace	131	665	16.46%	83.54%	796
Group C: Universität Frankfurt am Main Hochschulschriften OPUS	1,961	3,555	35.55%	64.45%	5,516
Group D: Digitala Vetenskapliga Arkivet (DiVA)	14,813	10,291	59.01%	40.99%	25,104

Table 4. Requests for all records in the three repositories to the Google search index.

We are aware that URLs in OAI records can be constructed differently from URLs accessed by Google for its index in its normal course of operations. For instance, a record from the Project Euclid repository accessed via OAI has the URL http://projecteuclid.org/euclid.bams/1183524923, with the title "Every planar graph with nine points has a nonplanar complement". If you perform an info request for this resource in the Google search index, the article is not found [4]. If you look for the URL in Google with the addition of a "/handle" element¹ – http://projecteuclid.org/handle/euclid.bams/1183524923 – the article is found [5]. Both types of URLs resolve correctly on the Project Euclid site. We are not able to determine how widespread this case is across repositories.

There is the potential that running all the records for the small repositories skewed the results by representing these more. Alternatively, choosing 1% of the records in Group D repositories could also have skewed the results by including too many records from a single, large repository.

Conclusions

Google's indexing does not seem to have retrieved more of the hidden web since the publication of the McCown, et al. article in 2006. We would venture to conclude that Google has not endeavoured to increase their support and access to OAI materials. Even taking into account the caveats, we would also conclude that aggregations of OAI records are as valuable for user research purposes as they were at least two years ago.

From our own experience, we know that providing the OAIster records in bulk to Google proved problematic for them, and eventually they requested only the OAIster URLs instead of the complete metadata. We are not, at this point, certain that Google is using these URLs (crawling them) for addition to their search index.

It is also interesting to note that Google has recently dropped support of OAI for website indexing [6]. Given the resulting numbers from our investigation, it seems that Google needs to do much more to gather hidden resources, not less. (Granted, the OAI for Sitemaps feature may not have been an appropriate approach for Google.)

We are very interested in others' evaluation of our data crunching. We would also like to encourage other OAI aggregators to run their metadata against the Google index, to prove or disprove our conclusions. Our source code and raw data are available upon request.

Acknowledgements

This research draws on data provided by the University Research Program for Google Search, a service provided by Google to promote a greater common understanding of the web.

Bibliography

[1] McCown, F., Liu, X., Nelson, M. L., and Zubair, M. "Search engine coverage of the OAI-PMH corpus." IEEE Internet Computing 10:2 (March/April 2006) pp. 66-73. <http://doi.ieeecomputersociety.org/10.1109/MIC.2006.41>.

[2] OAIster website. Accessed June 20, 2008. <http://www.oaister.org/>.

[3] University Research Program for Google Search website. Accessed June 19, 2008. <http://research.google.com/university/search/>.

[4] Info request for "http://projecteuclid.org/euclid.bams/1183524923" in Google search index. Accessed June 19, 2008. <http://www.google.com/search?q=info:http://projecteuclid.org/euclid.bams/1183524923>.

[5] Info request for "http://projecteuclid.org/handle/euclid.bams/1183524923" in Google search index. Accessed June 19, 2008. <http://www.google.com/search?q=info:http://projecteuclid.org/handle/euclid.bams/1183524923>.

[6] Mueller, J. "Retiring support for OAI-PMH in Sitemaps." Google Webmaster Central Blog (April 23, 2008). <http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html>.

Note

1. The use of the term "handle" here does not refer to the Handle System®.

D-Lib Magazine Access Terms and Conditions

doi:10.1045/july2008-hagedorn

D-Lib MagazineJuly/August 2008

Volume 14 Number 7/8 ISSN 1082-9873

Google Still Not Indexing Hidden Web URLs

Introduction

Methodology

Results and Caveats

Conclusions

Acknowledgements

Bibliography

Note

Copyright © 2008 Kat Hagedorn and Joshua Santelli

D-Lib Magazine
July/August 2008

Volume 14 Number 7/8

ISSN 1082-9873