Comparing citation searching: Google, Bing, Google Scholar, Web of Science, L’AnneeMay 10, 2010
This is a short rather lengthy in web terms essay I wrote in October 2009, for a library school class. There’s a bit of library jargon; IR is “information retrieval.” Anything confusing, just ask in comments. I was surprised at the results, myself. Google Scholar is a pain and woefully poorly structured for the librarian, but it found more than the scholarly databases. Read on.
To evaluate web-based information retrieval systems, I decided to “vanity search” myself. The fact that I had a previous career as a PhD student in classical archaeology means I have academic publications to my credit, allowing me to evaluate IR systems for academic citations as well as general web search engines. By searching myself, I can compare the retrieved information with my own knowledge about my past publications, education, employment, and project participation. The “vanity search” is also made easier by the fact that I am the only living person with my unusual name: my late aunt shared my first and last names before her marriage, and a distant relative who was born in the late 19th century also shared my name as her maiden name, but these two are fairly easy to separate out and are relatively rare in IR searches in any case.
For this test I searched my name in several formats in two general web search engines, Microsoft’s new search engine Bing, and Google; and in three scholarly citation indexes: Google Scholar, a general index by Google in partnership with unnamed academic publishers which allows cited reference searching as well as providing citations; Thomson ISI’s Web of Science, a general index that is also the standard for cited reference searching; and L’Annee Philologique, the premier index for citations in the field of Classics. I searched both “Phoebe Acheson” and “Phoebe E. Acheson” (the name I consistently published under as an academic).
Web Search Engines
Google is the most widely used general internet search engine. First developed in the late 1990s, it remains the most commonly used product of the now wide-ranging Google firm. The search engine relies on crawlers which index web sites, and a search algorithm to rank results. The algorithm is proprietary and constantly updated, but includes weight for pages that are heavily linked to. Most Google searches are keyword searches of the full text of the web pages indexed, but relevance weighting seems to be given for keywords that appear in page titles or metadata, and keywords that appear close to one another. Spelling suggestions are possible (presumably through the use of ‘fuzzy searching’) and phrases can be searched using quotation marks. Google makes further search customization options available (see http://www.google.com/intl/en/help/features.html). The Google search interface is notoriously simple, but more complex searches are available using an Advanced Search page or the tips listed above.
A search for ‘Phoebe Acheson’ on Google returned 289,000 results; adding quotation marks reduced the number of results to 350. Included in the first three pages of results are my profile on Facebook (the top hit) and LinkedIn; pages reflecting my current job at the University of Georgia, my graduate student life at the University of Cincinnati (1995-2001), and my employment in the Duke Libraries (2002-2008); pages reflecting several archaeological field projects I participated in (NVAP 1997; KIP 2001; Durres Survey in Albania 2001; Lums Pond in Delaware 1994); and an incredibly cute photo of my then 9-month old son from my college alumni bulletin. There are several results for web resources that purport to find people that simply have long lists of names, including mine. There are two results that seem to refer to my early 20th century relative from Washington PA. There is also a citation of one of my articles (Acheson 1999) in a Google Books result that did not turn up when I searched in Google Scholar (see below).
Searching instead for ‘Phoebe E. Acheson’ returns 3110 results; adding quotation marks limits the results to “about 38″ (but when I actually attempt to look at them, there are only 14). These results are heavily focused on my academic career, as this was the form of my name that I published under. The top two results are an academic article (Acheson 1997b), result 5 is related to another article (Acheson 1999), and results 6, 7, 10, 11, and 12 reference a third (Davis, Hoti et al. 2003). Two more hits are from my participation in the PRAP project in 1997, and another my work on the Durres Survey in 2001. One result references my current employment at the University of Georgia. The last two results are from a ‘people-finder’ and from a web page of classical mythology, referring to the compliment I gave this site by linking to it from my online syllabus for Classical Studies in 1999.
Microsoft Bing (www.bing.com)
Bing is a new (debuted June 2009) web search engine developed by the Microsoft corporation. Its basic search algorithm, while proprietary, would seem to operate similarly to that of Google, although as my test described below shows, the two search engines rank results quite differently. Bing offers several features that Google does not, including tracking search history and suggesting related searches on a left-hand sidebar. Bing will also recognize when a search is asking for, for example, airfares or football scores, and highlight this information in the results. Bing has an advanced search that allows the limiting of searches by domain or language.
A search for ‘Phoebe Acheson’ on Bing returns 11,000 results; adding quotation marks reduced this to 38. Although Bing finds many of the results that Google did, it also finds some new ones (my Twitter feed, local white pages listing, and a raffle winner at an Australian primary school who may be another living person with my name!), and does not find some of the ones that Google did (my relatives do not show up). The order of results is also strikingly different when compared to Google (the KIP project is on the first page of results), and in many cases a different page from the same domain is being found (i.e. Bing finds my name in the overall staff listing for the UGA libraries; Google found my name from the library’s domain but only as a representative to the library staff association.) Bing also picks up a couple of spam blogs that are using my name, apparently repurposing a news release about the Durres field project that the University of Cincinnati sent out in 2001.
Changing the Bing search to Phoebe E. Acheson results in 10,500 hits; adding quotation marks reduces the results to only 8. As at Google, these were focused on my academic career as an archaeologist, with the addition of one reference to my current job. The first and 4th results are my article Acheson 1997b, the 5th and 7th are Davis, Hoti et al. 2003, and the 6th is Acheson 1999. The remaining two are references to my participation on the PRAP field season in 1997.
I was struck by the fact that both non-scholar Google and Bing retrieved many web sites related to my archaeological career, including a reference to an article (Acheson 1999) that did not turn up when I searched any of the scholarly indexes I tested (see below). They both also discovered associations between me and many field projects that no scholarly index uncovered. Since I did not publish articles related to my work on most of the field projects, it is not exactly a surprise that no scholarly index revealed my connections with them, but this information is useful for someone investigating an individual’s academic work, even if that work does not result in publications.
As for a comparison between Google and Bing, they seemed remarkably similar in the general outlines of what they retrieved, although the specifics were strikingly different in terms of relevance ranking, and some omissions (Twitter from Google) seemed inexplicable. The causal searcher would do equally well searching either, and the searcher looking for exhaustive coverage of a topic should search both.
Academic Article Databases and Search Engines
In these, I searched for myself as author, with the goal of finding the five publications I have to my name: Acheson 1997a; Acheson 1997b; Acheson 1999; Davis, Hoti et al. 2003; and Acheson and Davis 2005. The first is my Master’s Thesis, which I would not expect to be widely indexed. The second is an article I published as the sole author in a widely available journal. The third is an article of which I am the sole author, published in a book as the proceedings of a conference that took place in Belgium. The fourth is an article of which I am the 6th of 7 authors, published in one of the largest American classical studies journals. The fifth is a chapter I co-authored in English, which was translated into Greek by the editor of the Greek-published edited volume in which it appears. Full citations of all are available at the end of this essay.
L’Annee Philologique (http://www.annee-philologique.com/aph/, by subscription)
L’Annee (as it is commonly known) is a subject-specific database for Classical studies – languages, history, art, and archaeology. It originated as a print index in the 1920s and has been published annually since then. The index became available on CD-Rom in the 1990s, and a web version is now available. Entries from the print indexes covering 1924-2007 are now searchable through the online L’Annee, and new volumes are added annually; 2008 is expected to be available online in September 2010. The indexing work of L’Annee is supported by national research funds in France and the United States, as well as several academic institutions. It has offices in France, the US, Germany, Switzerland, Italy, and Spain, generally attached to academic institutions. Each office has a specific scope of materials to index, based on country of publication. L’Annee’s goal is to provide a comprehensive index of the international research literature in Classical Studies, and to that end it indexes about 1500 journals as well as books, festschriften, dissertations, and book reviews.
L’Annee’s online user interface [author’s note: the pre-April 2010 interface is discussed here] has long been a trial to researchers in Classics; the database is useful because of its content, and in spite of its interface (which is available in English). One can search by Modern Author (there seems to be some authority control), Full Text (which is a keyword search of the citation; the database does not contain full text of articles), Ancient Author (authority control is also in effect), Subjects and Disciplines (subject headings, which are nested although very broad – “archaeology” is one; also they were unfortunately changed with v. 67 (1997) so one can either search before-1997 headings or 1997-on headings, but not both), Date, and Other Criteria. Generally, to conduct an effective search on a topic requires the building of a search: for example, if one were looking for articles about the treatment of guests in the works of Homer one could search for the ancient author Homer, search for “guest” in the Full Text (making sure to search for the word meaning “guest” in at least German and French in addition to English), and then combine the result sets using AND in the search builder. L’Annee does allow citations to be emailed, downloaded, or exported to a bibliographic management software (directly to Refworks, through the use of a filter with EndNote.)
The modern author search asks that one enter a surname alone, and a list of all modern authors with that surname is provided; the searcher can pick one or more. There are two modern authors with the surname Acheson in L’Annee: a G. J. Acheson who was a classical philologist in South Africa in the 1930s-1950s, and myself, listed as Phoebe E. Acheson. Only one work is listed for me, Acheson 1997a. A full and accurate citation is provided, with an abstract (not the abstract which was published with the paper, but an abstract written by the indexer, presumably.) Links (opaquely labeled “SFX”) to the journal and the article directly in the University of Georgia’s holdings are provided, and they work.
Given L’Annee’s goal of comprehensiveness in Classical Studies, I was dismayed that only one of my publications appeared associated with me in its index. Davis, Hoti, et al. 2003 appears in L’Annee if I search for the author Davis, Jack L., but no other authors are listed for this article (et al. is used for all the co-authors), which has seven. I had hoped that because of L’Annee’s excellent coverage of non-English language sources, Acheson 1999 and Acheson and Davis 2005 might have been indexed by it. The conference proceedings in which Acheson 1999 appeared is indexed as a book, but apparently the individual articles were not separately included. While 35 citations of the editor of the volume in which Acheson and Davis 2005 appeared are indexed, this book is not.
Web of Science (http://thomsonreuters.com/products_services/science/science_products/scholarly_research_analysis/research_discovery/web_of_science, by subscription)
Web of Science is a widely used index of academic journal articles, produced by ISI, a subsidiary of Thomson Reuters. Web of Science indexes conference proceedings (120,000+), symposia, seminars, and other ephemera as well as journal articles (10,000+ journals) and books. Its major focus is the sciences, where Science Citation Index covers 6650 major journals, but there is relatively extensive coverage of the social sciences and humanities as well. All materials included in the database have their cited references captured at the time of indexing, allowing for cited reference searching. Web of Science has backfiles available to 1900, although there are various packages that institutions may buy and some choose a more limited backfile. For materials published since 1991, about 70% of records include full English-language abstracts. Included in the index are author(s) (with contract addresses, affiliations, and email), source information, author keywords, “KeyWords Plus” (apparently supplied by the indexers), subject category, and standard numbers including DOI. Direct links to full text of articles were available for some entries, and all support open url link resolvers.
The search interface for Web of Science has improved a great deal since I first used it in 2002. The default search is by “topic” (keyword for citation and abstract, apparently), but other options include title, author, group author, publisher, date, and even such unusual features as funding agency and grant number. Three boxes are set up which allow Boolean searching (defaulting to ‘and’.) Authors must be entered in a specific format, but an Author Finder link allows the user to better find all articles by a single author and is very helpful in distinguishing authors with common names, like John Smith. The initial search allows limiting by date. A separate Cited Reference Search is available, allowing one to see which articles have cited a given publication. Search results are ranked by date, and can be refined by faceting using the left sidebar, by subject area, document type, author, source, and other aspects. Web of Knowledge, the ‘parent’ database of Web of Science, also includes ISI’s Journal Impact Factor rankings, which are searchable by discipline. The database allows customization upon logging in, alerts, saved searches, and has a close connection to EndNote web.
Using the Author Finder feature, I searched for all articles in the Web of Science database authored by a P. Acheson. There were two: Davis, Hoti, et al. 2003, and a book review in a linguistics journal by another author with my initial. Web of Science reported that Davis, Hoti, et al. 2003 has been cited 4 times by other articles, in 2002, 2007 (twice) and 2008. It was not very surprising to me that this was the only citation of mine to appear in Web of Science. Its coverage is most comprehensive for the sciences. Davis, Hoti, et al. 2003 appears in one of the largest American journals covering classical studies; the rest of my citations are more obscure.
Google Scholar (scholar.google.com)
Google Scholar is a free online search engine that partners with a number of scholarly publishers (i.e. Highwire, CSA) to crawl through their databases for citations. It also includes books digitized by Google Book Search in its results, and some government web sites. There is no explicit statement of what Google Scholar is searching, and its scope could change without being announced. Google Scholar does not index materials in its search in the traditional by-hand way that librarians mean when they refer to indexing; rather, a computer algorithm creates the index and ranks the results, based on unspecified characteristics that seem to include prevalence of keywords in citations and possibly full text, and possibly citation levels for scholarly articles. Date does not seem to be a major factor in ranking the results, although there is a link to show “Recent Articles” as opposed to “All Articles.” “Recent” is not defined. The results list includes the title, which is hotlinked and allows the user to click through to the source (i.e. the web page of the scholarly publisher offering the full-text article for sale, in many cases). The source page may include an abstract. The result list also includes author, a source, a date, and the publisher. Beneath this information is a snippet that appears to be pulled from the full-text of the article or resource, set off by ellipses, and including the keyword or phrase searched. At the bottom of the entry in the results list are links allowing the user to see what other articles have cited this one, “related articles,” and if one is at a participating institution, a link resolver to find the content at one’s institution. Some articles are available as .html or .pdf, and links are included if this is the case.
The search interface is familiar to users of the main Google search engine: a single box. An Advanced Search is available, and allows the user to limit the search to certain disciplines, and to search by author, publication, and date (range). There is no way to re-rank or sort the results, and an existing result set cannot be narrowed down except by adding keywords to the search and re-searching.
When I searched ‘Phoebe E. Acheson’ in Google Scholar, the top result was Acheson 1997b. However, the citation is misdated to 1998, causing the automatically generated “Find It @ UGA” search (to which Google Scholar provides a link) to fail to find the article. According to Google Scholar, this article has been cited 11 times, in books and articles from 1999 to 2008; all of those I followed up did indeed cite Acheson 1997b. The second result was Davis, Hoti, et al. 2003, which Google Scholar states is cited by 6 other publications, including three not found by Web of Science; all appear to be genuine. The 4th result is Acheson 1997a, and the 8th result finds me thanked in the preface of a book published in 1998, which I helped put together as part of a graduate assistantship. Results 3, 5, 6, 7, 9 and 10 are not related to me. Adding quotation marks around the search “Phoebe E. Acheson” limits the results to Acheson 1997b, Davis, Hoti et al. 2003, and Acheson 1997a.
Changing the search to ‘Phoebe Acheson’ still finds Acheson 1997b, Davis, Hoti et al. 2003, and Acheson 1997a as the top three results. Some new interesting results turn up here, however: a mention of me on a faculty member’s curriculum vitae as a student he supervised in a directed reading; thanks in a friend’s and my husband’s dissertations; and thanks in the book Banking on Baghdad, whose author it would seem I helped with a reference question when I worked at Duke University. The results not immediately relevant to me include my late aunt’s master’s thesis and a mention of my distant relative in a historical work about Washington, PA. The second page of results has a few more acknowledgements of my assistance from books in Google Books, before tapering off into other matters. Adding quotation marks to make the search “Phoebe Acheson” removes the most irrelevant results but leaves in my aunt and distant cousin.
The results of this test were surprising to me. While it was the least sophisticated in allowing me to structure a search, and provided by far the largest number of completely irrelevant citations, Google Scholar also was the only academic index to find more than one of my five publications (it found three). I was disappointed that L’Annee, with its attention to international publications, inclusion of books in addition to scholarly articles, and hand-indexing by academics working in the discipline it covers, did not manage to index more than one of my publications. This result adds to the argument that Google Scholar is a worthwhile (if sometimes frustrating) resource for academic work, and also informs my work as a subject liaison – I will begin to encourage the faculty and students I work with to search multiple scholarly indexes when they want a comprehensive search, and not limit themselves to L’Annee, “gold standard” though it may be.
Google Scholar also performed well in the area of cited reference searching, when compared to Web of Science. Partly this is because Google Scholar had records for more of my publications than Web of Science did – I was gratified to learn that at least 11 publications have cited my article (Acheson 1997b), which was not indexed in Web of Science. Google Scholar also found more citations of Davis, Hoti et al. 2003 than Web of Science did, although web of Science found one that Google Scholar did not. Again, I would recommend that academics doing cited reference searches (as for the purpose of proving the impact of their scholarship for tenure review) should use both Web of Science and Google Scholar (while being careful to hand-check the results in Google Scholar especially, as I have found errors there.).
My vanity search seems to have worked well as a test of these various information retrieval systems. Of the five academic publications I hoped to find citations to, four were uncovered by at least one of the systems tested. Acheson 1997a was only indexed by Google Scholar. Acheson 1997b was found by Google, Bing, Google Scholar, and L’Annee. Acheson 1999 was not found by any of the scholarly indexes, but was found by Bing and Google. Davis, Hoti et al. 2003 was found while searching my name in Web of Science, Google Scholar, Google, and Bing, but not in L’Annee, although the article was indexed by L’Annee. Acheson and Davis 2005 was not found by any of the information retrieval systems tested; since my contribution was a book chapter in a book written in Greek, published in Greece, and only four years old, this is not surprising, although it is disappointing. A check of WorldCat revealed that only two library have holdings for this book: the Museum of Fine Arts, Boston, and an academic library in Germany. (Although I donated a copy I received to the Duke Library in 2006; presumably it has not yet been processed or the record has not made its way to Worldcat.) Thus, the limited availability of this publication accounts for its lack of indexing.
For my academic and professional (as well as personal) affiliations, both Google and Bing provided good information. The scholarly indexes could only give my professional affiliation at the time a given article was published, now well out of date, and could not associate me with field projects in which I participated but was not listed as an author for publications of. There were some field projects that Google and Bing did not associate me with (Aidone 1994, Megiddo 1996, Corinth 1999, Iklaina 1999) and some of my academic affiliations did not turn up in the search engines (my undergraduate degree, except through my alumnae bulletin, and my years abroad at College Year in Athens and the American School of Classical Studies, Athens.) It is possible that Bing and Google were not able to retrieve this information about me because it is not available on a web page, but it is also possible that it is available on web pages not indexed by their systems. Unfortunately it is impossible to tell which is the case.
Acheson, Phoebe E., 1997a. Regional Studies and the Agricultural Economy: Case Studies from the Southern Argolid, Greece, and Kommos, Crete. Master’s Thesis, Department of Classics, University of Cincinnati.
Acheson, Phoebe E., 1997b. “Does the ‘Economic Explanation’ Work? Settlement, Agriculture and Erosion in the Territory of Halieis in the Late Classical-Early Hellenistic Period,” Journal of Mediterranean Archaeology, 10: 2, 165-190.
Acheson, Phoebe E., 1999. “The Role of Force in the Development of Early Mycenaean Polities,” in R. Laffineur, ed., POLEMOS. Le contexte guerrier en Égée à l’Âge du Bronze [Aegaeum 19] (Liège/Austin) 97-104.
Acheson, Phoebe E. and Jack L. Davis, 2005. “Periphereiakes meletes, archaiologiki epiphaneiaki erevma kai archaiologia tou topio stin Ellada,” in P. Doukellis, ed., To ellēniko topio : meletes istorikēs geōgraphias kai proslēpsēs tu topou. (Athens: Estias) 33-58.
Davis, Jack L., Afrim Hoti, Iris Pojani, Sharon R. Stocker, Aaron D. Wolpert, Phoebe E. Acheson, and John W. Hayes, 2003. “The Durrës Regional Archaeological Project: Archaeological Survey in the Territory of Epidamnus/Dyrrachium in Albania,” Hesperia 72:1 (41-119). doi: 10.2972/hesp.2003.72.1.41