Skip to content →

neverendingbooks Posts

google spammers


In the GoogleMatrix I tried to understand the concept
of the PageRank algorithm that Google uses to list pages according to
their \’importance\’. So, if you want your webpage to come out first in
a certain search, you have to increase your PageRank-value (which
normally is a measure of webpages linking to your page) artificially. A
method to achieve this is by link spamming, that is if page A is
to webpage of which you want to increase the PageRank value, take a page
B (either under your control or that of a friend webmaster) and add a
dummy link page B -> page A. To find out the effect of this on the
PageRank and how the second eigenvalue of the GoogleMatrix is able to
detect such constructs let us set up a micro-web consisting of
just 3 pages with links 1->2 and 1->3. The corresponding GoogleMatrix
(with c=0.85 and v=(1/3,1/3,1/3) is

1/3   1/20   1/20 1/3   9/10 
 1/20 1/3   1/20   9/10

which has eigenvalues 1,0.85 and 0.28.
The eigenvector with eigenvalue 1 (the PageRank) is equal to (0.15,1,1)
so page 2 and page 3 are equally important to Google and if we scale
PageRank such that it adds up to 100% over all pages, the relative
importance values are 6,9%,46,5% and 46,5%. In this case the eigenvector
corresponding to the second eigenvalue 0.85 is (0,-1,1) and hence
detects the two leaf-nodes. Now, assume the owner of page 2 sets up a
link spam by creating page 4 and linking 4->3, then the corresponding
GoogleMatrix (with v=(1/4,1/4,1/4,1/4)) is

77/240   3/80   3/80  
3/80 77/240   71/80   3/80   37/80 77/240   3/80   71/80  
3/80  3/80   3/80   3/80   37/80

which has eigenvalues
1,0.85,0.425 and 0.283. The PageRank eigenvector with eigenvalue 1 is
in this case is (0.8,8.18,5.35,1) or in relative importance % we have
(4.9%,50.1%,32.7%,6.1%) and we see that the spammer achieved his/her
goal. The eigenvector corresponding to the second eigenvalue is
(0,-1,1,0) which again gives the leaf-nodes and the eigenvector of the
third eigenvalue is (0,-1,0,1) and detects the spam-construct.

Leave a Comment

an even better LaTeX system

A
previous post the best LaTeX system was a commercial for Gerben
Wierda’s i-Installer to get a working tetex
distribution. I’ve been working happily with this TeX-system for two
years now but recently run into a few (minor) problems. In the process
of solving these problems I created myself a second tetex-system
more or less by accident. This is what happened. On the computer at the
university I once got fun packages running such as a chess-, go-
and Feynman diagrams-package but somehow I cannot reproduce this
on my home-machine, I get lots of errors with missing fonts etc. As I
really wanted to TeX some chess-diagrams I went surfing for the most
recent version of the chess-package and found one for Linux and
one under the Fink-project : the chess-tex package. So, I did a

sudo fink
install chess-tex

forgetting that in good Fink-tradition you can
only install a package by installing at the same time all packages
needed to run it, so I was given a whole list of packages that Fink
wanted to install including a full tetex-system. Did I want to
continue? Well, I had to think on that for a moment but realized that
the iTex-tree was living under /usr/local whereas Fink
creates trees under /sw so there should not really be a problem,
so yes let’s see what happens. It took quite a while (well over an hour
and a half) but when it was done I had a second full TeX-system, but how
could I get it running? Of course I could try to check it via the
command line but then I remembered that there is an alternative
front-end for TeXShop namely iTeXMac
which advertises that it can run either iTeX or the Fink-distribution of
tetex. So I downloaded it, looked in the preferences which indeed
contains a pane

where you can choose between using the standard
tetex-distribution or the Fink-distribution and iTeXMc finds
automatically the relevant folders. So I wrote a quick chess diagram ran
it trough iTeXMac and indeed it produced the graphics I expected! This
little system gave me some confidence in the Fink-distribution so I
fired up the Fink-Commander and looked under categories :
text
for other TeX packages I could install and there were plenty :
Latex2HTML, Latex2rtf, Feynman, tex4ht and so on. Installing them with
the commander is fun : just click on the package you want and click
‘Install’ under the Source-dropdown window

and in the lower part of the window you can follow the
installation process, whereas the upper part tells you what packages are
already installed and what their version-number is. I still have to
figure out how I will add new style files to this fink-tree and I have
to get used to the iTeXMac-editor but so far I like the robustness of
the system and the easy install procedure, so try it out!

Leave a Comment

the google matrix

This morning there was an intriguing post on arXiv/math.RA
entitled A Note on
the Eigenvalues of the Google Matrix
. At first I thought it was a
joke but a quick Google revealed that the PageRank algorithm really
is at the heart of Google technology, so I simply had to find out more
about it. An extremely readable account of it can be found in The PageRank Citation Ranking: Bringing Order to the Web which is really the
start of Google. It is coauthored by the two founders : Larry Page and
Sergey Brin. A quote from the introduction

“To test the utility of PageRank for search, we built a web
search engine called Google (Section
5)”

Here is an intuitive idea of
_PageRank_ : a page has high rank if the sum of the ranks of its
_backlinks_ (that is, pages linking to the page in question) is
high and it is computed by the _Random Surfer Model_ (see
sections 2.5 and 2.6 of the paper). More formally (at least from my
quick browsing of some papers, maybe the following account is slightly
erroneous and I’ll have to spend some more time reading) let
N be the number of webpages (estimated between 3 and 4
billion) and consider the N x N matrix
A the so called GoogleMatrix where

A = cP  + (1-c)(v x
vec(1)) 

where P is the
column-stochastic matrix (meaning : all entries are zero or positive and
the sum of all entries in each column adds up to 1) with
entries

P(i,j) = 1/N(i) if i->j and 0
otherwise 

where i and j are webpages and i->j
denotes that page i has a link to page j and where N(i) is the total
number of pages linked to in page i (all this information is available
once we download page i). c is a constant 0 < c < 1 and
corresponds to the fraction of webpages containing an _outlink (that
is, a link to another page) by all webpages (it seems that Google uses
c=0.85 as an estimate). Finally, v is a column vector with zero or
positive numbers adding up to 1 and vec(1) is the constant row vector
(1,…,1). The idea behind this term is that in the _Random Surfer
Model_ to compute the PageRank the Googlebot (normally following
links randomly in pages it enters) jumps every (1-c)x100% links randomly
to an entirely different webpage where the chance that it will end up at
page i is given by the i-th entry of v (this is to avoid being trapped
in a web-loop). So, in Googles model the bot _teleports_ itself
randomly every 6th link or so. Now, the PageRank is a
column-eigenvector for the GoogleMatrix A with eigenvalue 1 which can be
approximated by the RandomSurfer model and the rate of convergence of
this process depends on the _second_ largest eigenvalue for A
(the largest being 1). Now, in the paper posted this morning a simple
proof is given that this eigenvalue is c (because the matrix P has
multiple eigenvalues equal to 1). According to a previous paper on the
subject The
Second Eigenvalue of the Google Matrix
, this statement has
implications for the convergence rate of the standard PageRank algorithm
as the web scales, for the stability of PageRank to perturbations to the
link structure of the web, for the detection of Google spammers, and for
the design of algorithms to speed up PageRank. But I’ll have to
read more to understand the Google spammers bit…

2 Comments