PageRank
From Wikipedia the free encyclopedia, by MultiMedia
PageRank, sometimes abbreviated to PR, is a
family of algorithms for assigning numerical weightings to hyperlinked
documents (or web pages) indexed by a
search engine originally developed by Larry Page (thus the play on
the words PageRank). Its properties are much discussed by
search engine optimization (SEO)
experts. The PageRank system is used by the popular
search engine Google to help determine
a page's relevance or importance. It was developed by Google's founders
Larry Page and Sergey Brin while at Stanford University in 1998. As
Google puts it:
PageRank relies on the uniquely democratic nature of
the web by using its vast link structure as an indicator of an
individual page's value. Google interprets a link from page A to page B
as a vote, by page A, for page B. But Google looks at more than the
sheer volume of votes, or links a page receives; it also analyzes the
page that casts the vote. Votes cast by pages that are themselves
"important" weigh more heavily and help to make other pages "important."
PageRank uses links as "votes"
In other words, a page rank results from a "ballot" among
all the other pages on the World Wide Web about how important a page is. A
hyperlink to a page counts as a vote of support. The PageRank of a page is
defined recursively and depends on the number and PageRank metric of all
pages that link to it ("incoming links"). A page that is linked by many
pages with high rank receives a high rank itself. If there are no links to a
web page there is no support of this specific page.
In early 2005, Google implemented a new attribute, rel="nofollow", for the
HTML link element, so that website builders and bloggers can make links that
Google will not follow for the purposes of PageRank—they are links that no
longer constitute a "vote" in the PageRank system. The nofollow attribute
was added in an attempt to help combat comment spam.
The Google Toolbar PageRank goes from 0 to 10. It seems to be a logarithmic
scale. The exact details of this scale are not public knowledge. The name
PageRank is a trademark of Google. This is a pun on the name Larry Page. The
PageRank process has been patented (U.S. Patent 6,285,999). The patent is
not assigned to Google but to Stanford University.
An alternative to the Page rank algorithm is the HITS algorithm proposed by
Jon Kleinberg and the CLEVER project at IBM. Many HITS concepts are now
incorporated into Teoma and Ask Jeeves.
Page rank algorithm
Simplified
Suppose a small universe of four web pages: A, B, C
and D. If all those pages link to A, then the PR
(PageRank) of page A would be the sum of the PR of pages B,
C and D.
- PR(A) = PR(B) +
PR(C) + PR(D)
But then suppose page B also has a link to page C, and page
D has links to all three pages. One page cannot vote twice, but split
its vote over several pages. Thus, page B gives half a vote to page
A and half a vote to page C. In the same logic, page D
divides its votes over three pages and only one third of D's vote is
counted for A's PageRank.
-
In other words, divide the PR by the total number of links that
come from the page.
-
The actual PageRank formula incorporates two more considerations:
First of all, we trust indirect votes less than real votes: Let's say, a
new page N links to page B, thus increasing the authority of
B by one unit. As a consequence of the above equation, the authority
of pages A and C would increase by half a unit (exactly as
much as if the new page N would have linked directly to A and
C instead of B). This is too much! Of course N links to B
and considers it more authoritative than A and C. This problem
is resolved by scaling down the votes by a factor q which is usually
0.85.
Finally, all pages get a small authority of 1-q=0.15 to start off. This
choice results in the nice property that the average page rank of all pages
will be one.
With these two modifications, our equations turns into the real page rank
equation:
-
So one page's PageRank is calculated by the PageRank of other pages.
Google is always recalculating the PageRanks. If you give all pages a
PageRank of any number and constantly recalculate everything, all PageRanks
will change and tend to stabilize at some point. It is at this point where
the PageRank is used by the
Complex
The formula uses a model of a random surfer who gets bored after
several clicks and switches to a random page. The PageRank value of a page
reflects the frequency of hits on that page by the random surfer. It can be
understood as a Markov process in which the states are pages, and the
transitions are all equally probable and are the links between pages. If a
page has no links to other pages, it becomes a sink and therefore makes this
whole thing unusable, because the sink pages will trap the random visitors
forever. However, the solution is quite simple. If the random surfer arrives
to a sink page, it picks another URL at random and continues surfing again.
To be fair with pages that are not sinks, these random transitions are
added to all nodes in the Web, with a residual probability of usually
q=0.15, estimated from the frequency that an average surfer uses his or her
browser's bookmark feature.
So, the equation is as follows:
-
where p1,p2,...,pN
are the pages under consideration, M(pi)
is the set of pages that link to pi,
L(pj) is the
number of links coming from page pj,
and N is the total number of pages.
The PageRank values are the entries of the dominant eigenvector of the
modified adjacency matrix. This makes PageRank a particularly elegant
metric: the eigenvector is
-
where R is the solution of the equation
-
where the adjacency function
is 0 if page
pj does not link to
pi, and normalised
such that, for each j
-
i.e. the elements of each column sum up to 1.
This is a variant of the eigenvector centrality measure used commonly in
network analysis.
The values of the PageRank eigenvector are fast to approximate (only a
few iterations are needed) and in practice it gives good results.
As a result of Markov theory, it can be shown that the PageRank of a page
is the probability of being at that page after lots of clicks. This happens
to equal t − 1 where
t is the expectation of the number of
clicks (or random jumps) required to get from the page back to itself.
The main disadvantage is that it favors older pages, because a new page,
even a very good one, will not have many links unless it is part of an
existing site (a site being a densely connected set of pages).
That's why PageRank should be combined with textual analysis or other
ranking methods. PageRank seems to favor Wikipedia pages, often putting them
high or at the top of searches for several encyclopedic topics. A common
theory is that this is because Wikipedia is very interconnected, with each
article having many internal links from other articles, which in turn have
links from many other sites on the Web pointing to them. Compared to
Wikipedia, and similar high quality content-rich sites, the rest of the
World Wide Web is relatively loosely connected.
Several strategies have been proposed to accelerate the computation of
PageRank.
However, Google is known to actively penalize link farms and other
schemes to artificially inflate PageRank. How Google tells the difference
between highly inter-linked web sites and link farms is one of Google's
trade secrets.
False or spoofed PageRank
While the PR shown is usually accurate for most sites it must be noted
that it is also easily manipulated. A current flaw is that any low PageRank
page that is redirected, via a 302 server header or a "Refresh" meta tag, to
a high PR page causes the lower PR page to acquire the PR of the destination
page. In theory a new, PR0 page with no incoming links can be redirected to
the Google home page - which is a PR 10 - and by the next PageRank update
the PR of the new page will be upgraded to a PR10. This is called spoofing
and is a known failing or bug in the system. Any page's PR can be spoofed to
a higher or lower number of the webmaster's choice and only Google has
access to the real PR of the page.
Buying Text Links
For SEO purposes webmasters often buy links for their sites. As links
from higher PR pages are believed to be more valuable they tend to be more
expensive. It can be an effective and viable marketing strategy to buy link
advertisements on content pages of quality & relevant sites to drive traffic
& increase a webmasters link popularity.
See also
External links
Google Guide made by MultiMedia | Free content and software
This guide is licensed under the GNU
Free Documentation License. It uses material from the Wikipedia.
|