Google Test
From Wikipedia the free encyclopedia, by MultiMedia
Here are some ways to use Google, Alexa, Yahoo! and
Clusty to check articles and other information.
Types of Google tests
On Wikipedia, a Google Test is any use of Google or other
search engines as references. Several very
distinct kinds of information can be gleaned by this method. It should be
stressed that none of these applications is conclusive evidence, but simply
a first-pass heuristic or rule of thumb.
- Unencyclopedic or spurious topics. Some topics introduced to
Wikipedia articles don't belong here. Some of these can be detected by
running a Google search on a relevant phrase and counting the number of
search results. This technique works reasonably well for weeding out
hoaxes, fictions, and personal theories
and hypotheses. It can also be used to ascertain whether a topic is of
sufficiently broad interest to merit inclusion in the wiki, though this
application is highly subject to bias (see below). See Wikipedia:What
Wikipedia is not for a more comprehensive list of unencyclopedic topics.
- Copyrighted material. Large pieces of poorly wikified text,
submitted to the wiki all at once, particularly by a new or anonymous
user, are often copy-and-pasted from outside sources. Some of these are
submitted in violation of copyright. (See also Wikipedia:Spotting
possible copyright violations, Wikipedia:Copyrights.) A copy-and-paste
operation from an online source can often be detected by running
searches for excerpts.
- Idiosyncratic usage. The English language often has multiple
terms for a single concept, particularly given regional dialects. A
series of searches for different forms of a name reveals some
approximation of their relative popularity. For a quick comparison of
relative usage try googlefight, e.g.
comparing deoxyribose nucleic acid and deoxyribonucleic acid. Note that
there are cases where this googletest can be overruled, such as when an
international standard has been set, as in the case of aluminium.
- Related sites. If an article is of high quality (see
Wikipedia:Featured articles), Google may be used to look for sites that
might take an interest in it and be convinced to link to it.
- Research. Of course, search
engines are good for finding sources of further information.
Techniques
The Google Web search is not the only Google search. In performing a
Google test, consider searching groups (USENET newsgroups). This is a
significantly different sample and represents, for the most part,
conversations in English conducted by people who are not deliberately trying
to sell products or reach a mass audience. Other things being equal, a
"groups" search will typically return very roughly 1/5 as many hits as a
"Web" search. Because group and Web searches have very different "systemic
biases," hit numbers are not comparable. Nevertheless Group searches are
particularly helpful in identifying entities whose Web presence may have
been artificially inflated by promotional techniques; it is suspicious if a
phrase gets, say, 100,000 Web hits but only 20 Groups hits.
USENET postings are date-stamped and have been archived for over twenty
years, making them more useful than Web searches as a record of recent
history. Using a Groups "advanced search," it is possible to restrict a
search by date, which can help in identifying how recent the widespread use
of a term is.
Google News searches can assess whether
something is currently newsworthy. One characteristic of
Google News is that whereas it is easy and
inexpensive to create websites or post to USENET, it is harder to convince a
Google news source to run a story. Thus
Google News, in comparison to Web or Groups,
is less susceptible to manipulation by self-promoters. Note that
Google News indexes many "news" sources that
reflect specific points of view, and many news sources that are only of
local interest.
Depending on the subject, advanced search functions may be useful. For
example, adding "site:gov" or "site:edu" will restrict your search to U.S.
government sites or U.S. college and university sites.
Other tools that
may be useful for research include Google Scholar, which searches academic
literature.
Google Book Search can be valuable. As part of the world of print, Google
Book Search has a pattern of coverage that is in closer accord with
traditional encyclopedia content than the Web, taken as a whole, is; if it
has systemic bias, it is a very different systemic bias from Google Web
searches. Multiple hits on an exact phrase in Google Book Search provide
convincing evidence for the real use of the phrase or concept. Google Book
Search can locate print-published testimony to the importance of a person,
event, or concept. It can also be used to replace an unsourced "common
knowledge" fact with a print-sourced version of the same fact. Amazon.com's
"Search Inside The Book" also can be used.
Alexa test
Although Wikipedia is not a web directory, we can have articles about web
sites if they meet the same criteria for encyclopedic interest as other
articles.
If you're interested in writing a Wikipedia article about a particular web
site, just go to Alexa (http://www.alexa.com), and type in the URL. The
traffic rank may help you decide whether a site is important enough. Most
would agree that we should certainly have articles on top 100 sites,
possibly have articles on top 1,000 sites. For a page not in the top
100,000, most would agree that popularity alone would not suffice to justify
its inclusion in Wikipedia. The intermediate area is a grey area where
opinions differ.
For some websites (e.g., microsoft.com) in the top thousand, a redirect to a
broader article may be appropriate: in that case, Microsoft. (This is
somewhat controversial.)
Also note that the Alexa rating includes significant bias, due to various
factors. For example, the Alexa software is only available for Microsoft
Windows and Microsoft Internet Explorer, and requires installation. So, for
instance, a website exclusively devoted to an Apple Macintosh related topic
might not have an Alexa ranking that accurately represents its true traffic
activity. On the opposite extreme, some webmasters install the Alexa toolbar
for the sole purpose of improving their own rankings, by visiting their own
web site with it. The Alexa toolbar's user base is small enough, that one
frequent visitor can have a noticeable effect on overall results.
Google bias
When using Google to test for importance or existence, bear in mind that
this will be biased in favor of modern subjects of interest to people from
developed countries with Internet access, so it should be used with some
judgment. For example, a current popular-music group from the United States
will probably need many thousands of Google hits before most Wikipedians
consider it worthy of inclusion. A similarly important group in a country
with less Internet presence will have many fewer hits, if any. An important
musician of the 14th century might not show up on Google at all.
Q. What is the minimum number of matches you should see if a term is not
made up? (3? 27? 81?)
A. Perhaps a few hundred, but this depends on several things:
- The article's point of view: If narrow, fewer references are
required. Try to categorize the point of view, ( whether it is NPOV, or
other) eg: notice the difference between Ontology and Ontology (computer
science).
- The subject: If it's about some historical person, one or two
mentions in reliable texts might be enough; if it's some Internet
neologism, it may be on 100 pages and might still not be considered
'existing' for Wikipedia's purposes.
- The type of sites you find: Pay attention to how open the sites are
about accepting submissions. The Urban Dictionary, for example, accepts
submissions freely. This is especially important if you suspect an
author is self-promoting, or is promoting an idiosyncratic viewpoint. A
single Internet user can submit the same ideas to message boards and
open-submission sites all over the Internet.
Further judgment: the Google test checks popular usage, not correctness.
For example, a search for the incorrect Charles Windsor gives 10 times more
results than the correct Charles Mountbatten-Windsor.
Also, some topics may not be on the Web because of low Internet use in
certain areas and cultures of the world.
The search result from Google are highly biased towards popular culture.
This article, Scientists Use Google To Measure Fame vs. Merit, for example,
points out that Barry Williams ("Greg Brady" from the Brady Bunch, has 45%
more Google hits than Albert Einstein (2,400,000 vs. 1,660,000).
Especially when trying to determine the frequency of use of diacritic vs.
non-diacritic versions of a word, the internet (and therefore Google) is
extremely biased towards the non-diacritic versions. This is often more an
example of laziness and cluelessness of those who created the webpages than
a real test of usage. For example, spelling the weather phenomenon El Niño
as 'El Nino' is just plain wrong (it doesn't rhyme with keno, vino, or
Zeno). When Spanish words that have the ñ letter get naturalized into
English the ñ often gets converted to "ny" (as when cañon became canyon),
but "El Niño" is rarely spelled "El Ninyo" (and that spelling is more likely
not on an English-language website). Yet despite the fact that the spelling
should be El Niño, a Google test shows that there are more web pages with
"El Nino" than "El Niño" (8,830,000 vs. 7,970,000 as of September 2005).
Much better criteria for deciding upon the use of the diacritic vs.
non-diacritic versions of a word would be the entries in dictionaries, other
encyclopedias, and style guides.
Note that other Google searches, particular Google Books have a different
systemic bias from Google Web searches and give an interesting cross-check
and a somewhat independent view.
Non-applicable in case of pornography
The simple Google test by number of hits is not applicable to people or
titles within a number of internet-based businesses, most notably
pornography. This is because an entire sub-industry has appeared with the
sole purpose of increasing the number of Google hits certain subjects
receive. They achieve this by use of a number of techniques, including
multiple mirror sites, and spamming of notice boards and Wikipedia. Also,
pornographic actors tend to appear in production-line quantities of entirely
non-notable films. It is therefore necessary, as per Wikipedia:criteria for
inclusion of biographies, for the researcher to prove that the actor or
actress has established notoriety. This usually requires finding
journalistic coverage, independent biographies or extensive fan clubs.
Validity of the Google test
Given that the results of a Google test are interpreted subjectively, its
implementation is not always consistent. This reflects the nature of the
test being used on a case by case basis.
In some cases, articles have been kept with Google hit counts as low as 15
and some claim that this undermines the validity of the Google test in its
entirety. However, in fact, this reflects on the rather uneven and
subjective nature of the Wikipedia:Articles for deletion process more than
on the usefulness of the Google test. The Google test has always been and
very likely always will remain an imperfect tool used to produce a general
gauge of notability. It is not and should never be considered definitive.
Major factors which may affect Google hit count include subjects from
countries where the Internet is not prevalent or topics which are of a
historical nature but have not yet been well documented on the Internet. In
other cases, it is completely speculative as to why a subject merits
inclusion with a hitcount below 100 while other such articles are frequently
deleted.
Also note that the number of hits that Google reports is (sometimes or
perhaps always; the details are secret) an estimate, not an exact figure.
The number of hits reported by Google has little meaning until one navigates
to the last page of the results, since it's only then that Google applies
all criteria to a query (such as eliminating duplicate and spam control).
Often the hit count is cut by a factor of 10 (or much more) after doing
this. Jumping to the end of the results (or as far as is practical), also
reveals if the hit count is actually related to the intended meaning of the
search term. Queries are further improved by setting the results per page to
the maximum value (which reduces duplicate results) and excluding any domain
of a bias party. For instance "JoesRockBand.com" should be excluded when
searching for references to "Joe's Rock Band". For longer lasting articles,
excluding the term "wikipedia" itself, may be needed, to avoid counting all
the mirrors and language versions of a wikipedia article. In fact, the AFD
discussion itself, once archived and indexed by Google, may actually add to
the Google hit count used the next time the item is discussed. Finally, some
human labor has to be involved, and a manageable sample of sites found must
be opened individually, to actually verify the relevance of the hit count.
Much, probably most, of the publicly available web pages in existence are
not indexed. Each search engine captures a
different percentage of the total. Nobody can tell exactly what portion is
captured.
The estimated size of the World Wide Web is at least 2 billion pages, but a
much deeper (and larger) Web, estimated at over 500 billion pages, exists
within databases whose contents the search
engines do not index. These dynamic web pages are formatted by a Web
server when a user requests them and as such cannot be indexed by
conventional search engines. The United States Patent and Trademark Office
website is an example; although a search engine
can find its main page, one can only search its database of individual
patents by entering queries into the site itself.
Foreign languages and non-Latin scripts
Claims for the non-notability of a topic are occasionally made based on
few Google hits, where a considerably larger number of hits would have
resulted from searching in the correct script or for various transcriptions.
An Arabic name, for instance, needs to be searched for in the original
script, which is easily done with Google, provided one knows what to search
for, but one also has to take into account that e.g. English, French and
German webpages will likely transcribe the name using different conventions.
In addition, different forms of a name used in the original language must be
searched for. A Russian personal name has to be searched for both including
and excluding the patronymic, and any search for names and other words in
strongly inflected languages should take into account that arriving at the
total number of hits may require searching for forms with varying
case-endings or other grammatical variations not obvious for someone who
does not know the language.
Doing a search like this requires a certain linguistic competence which not
every individual wikipedian possesses, but the Wikipedia community as a
whole includes many bilingual and multilingual people and it is important
for nominators and voters on AFD at least to be aware of one's own
limitations and not state conclusively a small number of Google hits for,
say, a Serbian poet without pointing out the limited validity of a
preliminary search using only one particular transcribed form of the name.
Google Guide made by MultiMedia | Free content and software
This guide is licensed under the GNU
Free Documentation License. It uses material from the Wikipedia.
|