LD SoftwareBespoke Software, Web Design, Security Consultants and Host Services.

Menu

Sentinel
You have been warned!
We have caught 5846 shameful hackers.

NukeSentinel(tm)

Paypal Referral
Sign up for PayPal and start accepting credit card payments instantly.

Link Exchange
Join our free link exchange

Click Here
 
PageRank
Home UP

PageRank

From Wikipedia the free encyclopedia, by MultiMedia


PageRank, sometimes abbreviated to PR, is a family of algorithms for assigning numerical weightings to hyperlinked documents (or web pages) indexed by a search engine originally developed by Larry Page (thus the play on the words PageRank). Its properties are much discussed by search engine optimization (SEO) experts. The PageRank system is used by the popular search engine Google to help determine a page's relevance or importance. It was developed by Google's founders Larry Page and Sergey Brin while at Stanford University in 1998. As Google puts it:

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. Google interprets a link from page A to page B as a vote, by page A, for page B. But Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

PageRank uses links as "votes"

In other words, a page rank results from a "ballot" among all the other pages on the World Wide Web about how important a page is. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked by many pages with high rank receives a high rank itself. If there are no links to a web page there is no support of this specific page.

In early 2005, Google implemented a new attribute, rel="nofollow", for the HTML link element, so that website builders and bloggers can make links that Google will not follow for the purposes of PageRank—they are links that no longer constitute a "vote" in the PageRank system. The nofollow attribute was added in an attempt to help combat comment spam.

The Google Toolbar PageRank goes from 0 to 10. It seems to be a logarithmic scale. The exact details of this scale are not public knowledge. The name PageRank is a trademark of Google. This is a pun on the name Larry Page. The PageRank process has been patented (U.S. Patent 6,285,999). The patent is not assigned to Google but to Stanford University.

An alternative to the Page rank algorithm is the HITS algorithm proposed by Jon Kleinberg and the CLEVER project at IBM. Many HITS concepts are now incorporated into Teoma and Ask Jeeves.

Page rank algorithm

Simplified

Suppose a small universe of four web pages: A, B, C and D. If all those pages link to A, then the PR (PageRank) of page A would be the sum of the PR of pages B, C and D.

PR(A) = PR(B) + PR(C) + PR(D)

But then suppose page B also has a link to page C, and page D has links to all three pages. One page cannot vote twice, but split its vote over several pages. Thus, page B gives half a vote to page A and half a vote to page C. In the same logic, page D divides its votes over three pages and only one third of D's vote is counted for A's PageRank.

PR(A)= \frac{PR(B)}{2}+ \frac{PR(C)}{1}+ \frac{PR(D)}{3}

In other words, divide the PR by the total number of links that come from the page.

PR(A)= \frac{PR(B)}{L(B)}+ \frac{PR(C)}{L(C)}+ \frac{PR(D)}{L(D)}

The actual PageRank formula incorporates two more considerations:

First of all, we trust indirect votes less than real votes: Let's say, a new page N links to page B, thus increasing the authority of B by one unit. As a consequence of the above equation, the authority of pages A and C would increase by half a unit (exactly as much as if the new page N would have linked directly to A and C instead of B). This is too much! Of course N links to B and considers it more authoritative than A and C. This problem is resolved by scaling down the votes by a factor q which is usually 0.85.

Finally, all pages get a small authority of 1-q=0.15 to start off. This choice results in the nice property that the average page rank of all pages will be one.

With these two modifications, our equations turns into the real page rank equation:

PR(A)=\left( \frac{PR(B)}{L(B)}+ \frac{PR(C)}{L(C)}+ \frac{PR(D)}{L(D)}+\,\cdots \right) q + 1 -  q

So one page's PageRank is calculated by the PageRank of other pages. Google is always recalculating the PageRanks. If you give all pages a PageRank of any number and constantly recalculate everything, all PageRanks will change and tend to stabilize at some point. It is at this point where the PageRank is used by the

Complex

The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the frequency of hits on that page by the random surfer. It can be understood as a Markov process in which the states are pages, and the transitions are all equally probable and are the links between pages. If a page has no links to other pages, it becomes a sink and therefore makes this whole thing unusable, because the sink pages will trap the random visitors forever. However, the solution is quite simple. If the random surfer arrives to a sink page, it picks another URL at random and continues surfing again.

To be fair with pages that are not sinks, these random transitions are added to all nodes in the Web, with a residual probability of usually q=0.15, estimated from the frequency that an average surfer uses his or her browser's bookmark feature.

So, the equation is as follows:

{\rm PageRank}(p_i) = \frac{q}{N} + (1 -q) \sum_{p_j \in M(p_i)} \frac{{\rm PageRank} (p_j)}{L(p_j)}

where p1,p2,...,pN are the pages under consideration, M(pi) is the set of pages that link to pi, L(pj) is the number of links coming from page pj, and N is the total number of pages.

The PageRank values are the entries of the dominant eigenvector of the modified adjacency matrix. This makes PageRank a particularly elegant metric: the eigenvector is

\mathbf{R} = \begin{bmatrix} {\rm PageRank}(p_1) \\ {\rm PageRank}(p_2) \\ \vdots \\ {\rm PageRank}(p_N) \end{bmatrix}

where R is the solution of the equation

\mathbf{R} =  \begin{bmatrix} {q / N} \\ {q / N} \\ \vdots \\ {q / N} \end{bmatrix}  + (1-q)  \begin{bmatrix} \ell(p_1,p_1) & \ell(p_1,p_2) & \cdots & \ell(p_1,p_N) \\ \ell(p_2,p_1) & \ddots &  &  \\ \vdots &  & \ell(p_i,p_j) & \\ \ell(p_N,p_1) &  &  & \ell(p_N,p_N) \end{bmatrix}  \mathbf{R}

where the adjacency function \ell(p_i,p_j) is 0 if page pj does not link to pi, and normalised such that, for each j

\sum_{i = 1}^N \ell(p_i,p_j) = 1,

i.e. the elements of each column sum up to 1.

This is a variant of the eigenvector centrality measure used commonly in network analysis.

The values of the PageRank eigenvector are fast to approximate (only a few iterations are needed) and in practice it gives good results.

As a result of Markov theory, it can be shown that the PageRank of a page is the probability of being at that page after lots of clicks. This happens to equal t − 1 where t is the expectation of the number of clicks (or random jumps) required to get from the page back to itself.

The main disadvantage is that it favors older pages, because a new page, even a very good one, will not have many links unless it is part of an existing site (a site being a densely connected set of pages).

That's why PageRank should be combined with textual analysis or other ranking methods. PageRank seems to favor Wikipedia pages, often putting them high or at the top of searches for several encyclopedic topics. A common theory is that this is because Wikipedia is very interconnected, with each article having many internal links from other articles, which in turn have links from many other sites on the Web pointing to them. Compared to Wikipedia, and similar high quality content-rich sites, the rest of the World Wide Web is relatively loosely connected.

Several strategies have been proposed to accelerate the computation of PageRank.

However, Google is known to actively penalize link farms and other schemes to artificially inflate PageRank. How Google tells the difference between highly inter-linked web sites and link farms is one of Google's trade secrets.

False or spoofed PageRank

While the PR shown is usually accurate for most sites it must be noted that it is also easily manipulated. A current flaw is that any low PageRank page that is redirected, via a 302 server header or a "Refresh" meta tag, to a high PR page causes the lower PR page to acquire the PR of the destination page. In theory a new, PR0 page with no incoming links can be redirected to the Google home page - which is a PR 10 - and by the next PageRank update the PR of the new page will be upgraded to a PR10. This is called spoofing and is a known failing or bug in the system. Any page's PR can be spoofed to a higher or lower number of the webmaster's choice and only Google has access to the real PR of the page.

Buying Text Links

For SEO purposes webmasters often buy links for their sites. As links from higher PR pages are believed to be more valuable they tend to be more expensive. It can be an effective and viable marketing strategy to buy link advertisements on content pages of quality & relevant sites to drive traffic & increase a webmasters link popularity.

See also

External links


Google Guide made by MultiMedia | Free content and software

This guide is licensed under the GNU Free Documentation License. It uses material from the Wikipedia.

PREVIOUS NEXT
 
You can syndicate our News with backend.php And our Forums with rss.php
You can also access our feeds via Feedburner Site News and LD Software Forums
© 2009 ld-software.co.uk All Rights Reserved.
PHP-Nuke Copyright © 2005 by Francisco Burzi. This is free software, and you may redistribute it under the GPL. PHP-Nuke comes with absolutely no warranty, for details, see the license.
Page Generation: 0.39 Seconds