Wikipedia:Google Books and Metadata Errors

This is an essay.

It contains the advice or opinions of one or more Wikipedia contributors. This page is not an encyclopedia article, nor is it one of Wikipedia's policies or guidelines, as it has not been thoroughly vetted by the community. Some essays represent widespread norms; others only represent minority viewpoints.

Shortcut

WP:GBT

This page in a nutshell: Google Books is a useful tool. But sometimes, it's a trap.

Google Books is an extremely valuable resource for research and reference identification. It is essential to use this tool with caution, however. Google Books is prone to errors, some of which can be extremely substantial. For some sources, especially those that are only available in Snippet View, it is difficult or impossible to verify that the data Google provides is accurate. When the source being consulted is actually a magazine or journal article that has been treated like a book, errors are far more likely.

In these cases, Google Books can be a trap.

Tiers of access[edit]

There are four tiers of Google Books access, depending on the rights Google has available. Works in the public domain can be displayed in full. These are (generally) faithful scans of the originals, and for Wikipedia's purposes are a convenient archive of works that are often difficult to access in their physical form. The second tier is the one most familiar to casual users of Google Books; copyrighted material from a publisher participating in Google's Partner Program is displayed as a Preview. Previews include full pages (sometimes with a placeholder graphic removing photographs or illustrations). The amount of the book that can be accessed depends on the terms of the partnership, but it is often substantial, allowing claims to be verified and evaluated in context. Below the Preview tier is Snippet View, where Google does not have rights to display the work. Snippet View relies on the concept of fair use for educational purposes, to make small fragments of a page visible in response to searches. There are substantial restrictions as to what can be seen via Snippet View. The final tier is for works which Google has indexed but not scanned (as well as a few works whose publishers specifically opt-out of Google Books); these provide only the metadata, and do not permit the actual content to be viewed at all.

Metadata and errors[edit]

Google Books' metadata—its determination of title, author, publisher, publication date, and so forth—is often flawed. Google uses an algorithmic process to identify this data. Problems with the reported metadata are common, but the general public rarely encounters them. Wikipedia editors performing in-depth research using Google Books are more likely to encounter these errors, and are more likely to encounter more serious errors.

In general, older sources are probably more likely to have problems. But it is always worth checking the copyright page of anything accessed exclusively through Google Books. Copyright dates are frequently incorrect, especially when dealing with reprinted or re-versioned books; in the terms used by Wikipedia's cite templates, Google often confuses year and origyear. It is also often close, but not quite correct on publishers. Especially when a publisher has changed names over the years, been acquired by now-parent companies, or published some books under different imprints, Google Books may fail to identify the correct publisher credit. This is perhaps more frequent for those publishers participating in Google's Partner Program, as the companies are known to Google, understandably, under their current name. The "Pages displayed by permission of..." credit that appears at the bottom of the left sidebar for many sources is particularly prone to this problem.

Snippet View, needless to say, makes everything worse. It is almost never possible to view the copyright page via Snippet View, at least not in its entirety. This means that errors in the metadata can escape scrutiny. Snippet View also makes it difficult to examine claims in context; it is very easy for material accessed in this method to be misinterpreted, even when viewed with the best intentions.

Journal articles[edit]

Far and away the most serious problems occur when Google digitizes something that isn't a book, but makes it available through Google Books. Google has digitized a nontrivial number of magazines and scholarly journals. Unfortunately, its metadata algorithms often fail catastrophically while processing these works. Even when they succeed, they generally discard information considered mandatory for bibliographic entries on Wikipedia.

In many cases, working from bound volumes or collections, Google will treat all the issues in a single volume of a journal as though it were a single book, a volume of a multi-book set like an encyclopedia. Invariably, Google will produce a malformed title. Often, it will identify someone (or some organizational entity) as the "author", making it appear as though the entire work was written by a single hand.

Examples[edit]

These mistakes are easy to make. These examples were taken from content that was submitted to the Featured Article Candidates process.

Example 1[edit]

Source

A statement from page 1 was used in the article Bentworth. This work is entirely in the public domain, so it can be viewed in full. But that doesn't mean Google got it right, nor is it immediately evident what the problems are. If Google were trusted, our reference might be:

Proceedings. Vol. 4. Hampshire Field Club and Archaeological Survey. 1905.

But that's incomplete at best. This is actually a scan of a bound volume of the collected issues of a journal. There are probably a couple ways to cite this, depending whether you want to format the citation to the journal issue or the bound volume. Because whoever scanned this obscured much of the copyright page of the bound work itself (by scanning their hand), I would opt to cite the issue. Needless to say, this doesn't look that much like the previous reference:

Shore, T. W. (1899). "Bentworth and its historical associations". Papers and Proceedings of the Hampshire Field Club and Archaeological Society. 4 (1): 1–15.

Example 2[edit]

Source

A statement in the article Palmyra was cited to content in an English-language work published in Japan, and available via Google Books' Snippet View. The claim itself isn't relevant here, and Snippet View search was sufficient to justify use of the material, which appeared on page 19. Based on what Google has to say about that work, a Wikipedia reference for it, using the cite family of templates, might look like this:

Sentā, Shiruku Rōdo-gaku Kenkyū (1995). Space Archaeology. Research Center for Silk Roadology.

Unfortunately, that's entirely incorrect. The limitation of Snippet View make it almost impossible to realize the problems. The work itself is not represented by its cover or title page, but by the Table of Contents, but downsampled sufficiently that it's impossible to make out meaningful text. What this material actually is is a scholarly journal. This is the 1995 volume of Silk Roadology, the bulletin of the Research Center for Silk Roadology, in Japan. "Space Archaeology" is the primary topic of this issue, and appears in large letters on the cover; Google's algorithm misidentified this as the title of the "book". "Shiruku Rōdo-gaku Kenkyū Sentā" might look like a person's name, but in fact it's an English transliteration of a Japanese transliteration of "Research Center for Silk Roadology" - "Shiruku" is "Silk", "Rōdo" is "Road", "Sentā" is "Center", "Kenkyū" is "Research", and presumably "-gaku" is "-ology".

Finding this information in the first place is nontrivial. In any case, once you know the actual paper being cited, its possible to dig it out of the contents via Snippet View. A correct citation for this work would look more like:

Izumi, Takura (1995). "The remains of Palmyra, the city of caravans, and an estimation of the city's ancient environment". Silk Roadology. 1: 19–26.