Information retrival system and PageRank algorithm

INFORMATION RETRIEVAL SYSTEM AND
THE PAGERANK ALGORITHM

OUTLINE
 Information retrieval system
 Data retrieval versus information retrieval
 Basic concepts of information retrieval
 Retrieval process
 Classical models of information retrieval
 Boolean model
 Vector model
 Probabilistic model
 Web information retrieval
 Features of Google’s search system
 Google’s architecture
 A brief analysis of PageRank algorithm
 PageRank versus HITS algorithm

WHAT IS INFORMATION RETRIEVAL?
 Information retrieval (IR) deals with the representation, storage,
organization of, and access to information items[1].
 The user must first translate this information need into a query which can
be processed by IR system.
 The key goal of an IR system is to retrieve information which might be
useful or relevant to the user.

DATA VERSUS INFORMATION RETRIEVAL
DATA RETRIEVAL INFORMATION RETRIEVAL
Determines which documents of a
collection contain the keywords in
the user query
Retrieves information about a subject
rather than data which satisfies a
given query
All objects which satisfy clearly
defined conditions are retrieved
IR system somehow 'interprets' the
contents of documents in a collection
and rank them according to a degree
of relevance to the user query
A single erroneous object means
total failure
The retrieved objects might be
inaccurate and small errors are
ignored
Data has a well defined structure
and semantics
Data is a natural language text which
is not always well structured and
could be semantically ambiguous

BASIC CONCEPTS OF IR
The effective retrieval of relevant information is directly affected by :
 User task – The task of the user might be:
 Information or a data retrieval
 Browsing
 Filtering
Figure1: User tasks in an IR system[1]

 Logical View-The way the index words might be extracted from the
document can be of 2 types:
 Full Text
 Index term
Figure 2: Text operations for Index Term Logical View [1]

RETRIEVAL PROCESS
Step 1: Before the retrieval process can even be initiated, it is necessary
to define the text database. This is usually done by the manager of the
database, which specifies the following:
(a) the documents to be used
(b) text operations
(c) the text model
Step 2: Once the logical view of the documents is defined, the database
manager builds an index of the text. An index is a critical data structure
because it allows fast searching over large volumes of data(e.g.
inverted file)

Figure 3 : Retrieval Process[1]

Step 3: Then, the user first specifies a user need which is then parsed
and transformed by the same text operations applied to the text. Then,
query operations are applied to the actual query which is then
processed to obtain the retrieved documents. Fast query processing is
made possible by the index structure previously built.
Step 4: Before been sent to the user, the retrieved documents are ranked
according to a likelihood of relevance.
Step 5: The user then examines the set of ranked documents in the
search for useful information. At this point, he might pinpoint a subset
of the documents seen as definitely of interest and initiate a user
feedback cycle[1].

IR MODELS
 The central problem regarding IR systems is the issue of predicting which
documents are relevant and which are not.
 A ranking algorithm operates according to basic premises regarding the
notion of document relevance.
 The IR model adopted determines the predictions of what is relevant and
what is not.
Figure 4 : Classification of the various IR models[1]

FORMAL DEFINITION OF IR
An information retrieval model is a quadruple :
{D,Q, F, R(qi, dj)}
where:
 D is a set composed of logical views (or representations) for the
documents in the collection.
 Q is a set composed of logical views (or representations) for the user
information needs (called queries).
 F is a framework for modeling document representations, queries, and
their relationships.
 R(qi, dj) is a ranking function which associates a real number with a
query qi ϵ Q and a document representation dj ϵ D. Such ranking
defines an ordering among the documents with regard to the query qi.

CLASSICAL MODEL
 Classic models in IR system consider that each document is described
by a set of representative keywords called index terms which are used
to index and summarize the document contents.
 Thus, the distinct index terms have varying relevance when used to
describe document contents.
 In this model, this effect is captured through the assignment of
numerical weights to each index term of a document.
 The main classical models are:
 Boolean Model
 Vector Model
 Probabilistic Model

STRUCTURED MODEL
 Retrieval models which combine information on text content with
information on the document structure are called structured text retrieval
models.[1]
 There are two models for structured text retrieval:-
 Non-overlapping lists model
 Proximal nodes model
Figure 5: List structure for (a) Non-overlapping lists model (b) Proximal nodes model [1]

BROWSING MODEL
 Browsing is a process of retrieving information whose main objectives
are not clearly defined in the beginning and whose purpose might
change during the interaction with the system.
 For browsing, there are 3 models :-
 Flat model
 Structure guided model
 Hypertext model

BOOLEAN MODEL
 The Boolean model is a simple retrieval model based on set theory and
Boolean algebra.
 The queries are specified as Boolean expressions which have precise
semantics.
 The Boolean model considers that index terms are present or absent in
a document. As a result, the index term weights are assumed to be all
binary, i.e., Wi,j ϵ{0,I}.
 A query q is composed of index terms linked by three connectives: not,
and, or.
 A query is essentially a conventional Boolean expression which can be
represented as a disjunction of conjunctive vectors in Disjunctive
normal form. The binary weighted vectors are called the conjunctive
components of Qdnf.

ADVANTAGES:-
1. clean formalism behind the model
2.its simplicity.
DISADVANTAGES:-
1. Its retrieval strategy is based on a binary decision criterion and
behaves more as data retrieval model.
2. The exact matching may lead to retrieval of too few or too many
documents.
3. It is not simple to translate an information need into a Boolean
expression
4. The Boolean expressions actually formulated by users often are
quite simple.
APPLICATIONS:-
Commercial document database systems

VECTOR MODEL
 The vector model was given by Gerard Salton and McGill.
 This model proposes to apply partial matching strategy by assigning
non-binary weights to index terms in queries and in documents.
 These term weights are ultimately used to compute the degree of
similarity between each document stored in the system and the user
query.
 In the vector model,
 Weight Wi,j associated with a pair of index terms and document
vector is positive and non-binary.
 The index terms in the query are also weighted.

 ADVANTAGES:
 Its term-weighting scheme improves retrieval performance.
 Its partial matching strategy allows retrieval of documents that
approximate the query conditions.
 Its cosine ranking formula sorts the documents according to their
degree of similarity to the query.
 It is simple and resilient ranking strategy.
 DISADVANTAGE:
 Index terms are assumed to be mutually independent.

PROBABILISTIC MODEL
 The classic probabilistic model introduced in 1976 by Roberston and
Sparck Jones.
 The probabilistic model attempts to capture the IR problem within a
probabilistic framework.
 BASIC IDEA: Given a user query, there is a set of documents which
contains exactly the relevant documents referred as the ideal answer
set. Given the description of this ideal answer set, we retrieve the
documents that satisfy this condition.
 Thus the querying process will be a process of specifying the
properties of an ideal answer set .

Assumption (Probabilistic Principle) -
‘Given a user query q and a document dj in the collection, the
probabilistic model tries to estimate the probability that the user will
find the document dj relevant.
— The model assumes that this probability of relevance depends on the
query and the document representations only.
— Further, the model assumes that there is a subset of all documents
which the user prefers as the answer set for the query q, called an
ideal answer set is labeled R which should maximize the overall
probability of relevance to the user.
— Documents in the set R are predicted to be relevant to the query.
Documents not in this set are predicted to be non-relevant.’
 This assumption does not state explicitly :-
 How to compute the probabilities of relevance
 We don’t know even the sample space

ADVANTAGES:-
 The documents are ranked in decreasing order of their probability of
being relevant.
DISADVANTAGES:-
 There is a need to guess the initial separation of documents into
relevant and non-relevant sets.
 It does not take into account the frequency with which an index term
occurs inside a document
 The adoption of the independence assumption for index terms.

COMPARISON OF THE CLASSICAL
MODELS
BOOLEAN MODEL VECTOR MODEL PROBABILISTIC MODEL
It evaluates queries as
evaluating Boolean
expression.
It uses the concept of
index weights and partial
matching to match a
document to a query.
It evaluates the queries by
using the ideal set
probabilistic index terms.
Weights are binary. The
document is either
relevant or irrelevant.
Index terms are weighted.
So, there is a ranking
created based on these
weights(using similarity).
Weights are binary. Initially
the document either
belongs to the ideal set or
is considered irrelevant.
It is simple to evaluate
based on the query and
the document.
It is more complex than
binary as the index term
weighting needs to be
done.
This is the most complex
model since neither the
weights nor the ideal set is
initially defined.
Performance is not that
good.
Performance is
considered to be optimal.
Performance is proved to
be optimal. However, in
practice it may become
impractical.

WEB IR VERSUS TRADITIONAL IR
The differences between the modeling for the web and the traditional
document collections are because of the following reasons:
o Web is huge
o Dynamic nature of Web
o Web is self organized
o Web growth is fast
o Web is hyperlinked

GOOGLE SEARCH ENGINE
 Google ,the most popular search engine, came into existence in 1998.
 It was developed by Sergey Brin and Lawrence Page as a solution for
the problem of Web information retrieval.
 DESIGN GOALS OF GOOGLE
 Improved search quality
 Academic search engine
 Usage
 Architecture

HOW GOOGLE SEARCH WORKS
 STEP 1: CRAWLING
 STEP 2: COMPRESSING
 STEP 3: INDEXING
 STEP 4: PAGERANK CALCULATION
 STEP 5: SORTING
 STEP 6: SEARCHING

GOOGLE SYSTEM FEATURES
1. ANCHOR TEXT-Google associates the text of the link with 2 things:
 The page that the link is on
 The page the link points to
2. THE PAGERANK ALGORITHM- PageRank extends the idea of
citations by not counting links from all pages equally and by
normalizing by the number of links on a page.
 We assume page A has pages T1,T2….Tn which point to it (i.e., are
citations). The parameter d is a damping factor which can be set
between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the
number of links going out of page A. The PageRank of a page A is
given as follows:

 ADVANTAGES OF USING PAGERANK ALGORITHM-
 Random Surfer Model is used as the intuitive justification of
PageRank.
 Pages that are well cited from many places around the Web are
worth looking at. Also, pages that have perhaps only one citation
from a well known site are also generally worth looking at.
 ADVANTAGES OF USING ANCHOR TEXT-
 Anchors often provide more accurate of Web pages than the
pages themselves.
 Anchors may exist for documents which cannot be indexed by a
text-based search engine, such as images, programs, and
databases.

MATHEMATICS OF PAGERANK
 The PageRank Thesis: A page is important if it is pointed to by other
important pages.
 ORIGINAL FORMULA- The PageRank of a page Pi is denoted r(i)
The problem is that the PageRanks of pages inlinking to page Pi are
unknown. So, an iterative procedure was used.
Figure 6: Example of PageRank calculation on web pages

 INITIAL ASSUMPTION: In the beginning, all pages have equal
PageRank of 1/n, where n is the number of pages in Google's index
of the Web. So, iterative formula is-
 This can also be written as-
where H is row normalized matrix such that

 OBSERVATIONS:
 Each iteration of the equation involves one vector-matrix
multiplication, which generally requires O( n2) computation, where
n=size (Hnxn).
 H is a very sparse because most web pages link to only a handful
of other pages. Hence, it requires O(nnz(H)) computation, where
nnz(H) =number of non zeros in H which reduces to O( n) effort.
 The iterative method applied to H is the classical power method
applied to H matrix.
 H looks a lot like a stochastic transition probability matrix for a
Markov chain. The dangling nodes of the network, those nodes
with no outlinks, create 0 rows in the matrix. All the other rows,
which correspond to the non dangling nodes, create stochastic
rows. Thus, H is called sub-stochastic. [2].

 PROBLEMS WITH THE ITERATIVE PROCESS-
1. Problem of Rank Sinks-
 Rank sinks are those pages that accumulate more and more
PageRank at each iteration
 It is used by SEO and link
 Thus, ranking nodes by their PageRank values is tough when a
majority of the nodes are tied with PageRank 0.
 It’s peferable to have PageRanks as positive.
2. Problem of Cycles-
 In the page cycles, the page1 only points to page 2 and vice
versa which creates an infinite loop or cycle.
 The iterates will not converge no matter how long the process
is run since (k)T will flip flop indefinitely
Figure 7: (a) Rank Sink (b) Cycle

ADJUSTMENTS TO THE MODEL-
So, to counter the problems, Brin and Page made use of the Random
Surfer Model.
 Imagine a web surfer who bounces along randomly following the
hyperlink structure of the Web & when he arrives at a page with several
outlinks, he chooses one at random, hyperlinks to this new page, and
continues this random decision process indefinitely.
 In the long run, the proportion of time the random surfer spends on a
given page is a measure of the relative importance of that page.
 Unfortunately, this random surfer encounters some problems. He gets
caught whenever he enters a dangling node e.g., pdf files, image files,
data tables, etc[3].”
 To fix this, Brin and Page define their first adjustment, which we call the
stochasticity adjustment because the 0T rows of H are replaced with
1/n eT, thereby making H stochastic. Now, the random surfer can
hyperlink to any page at random. The stochastic matrix is called S.

So,
S=H+a(1/n eT)
where, a = dangling node vector
 This adjustment guarantees that S is stochastic, but it alone cannot
guarantee the convergence results desired. So a ,primitivity
adjustment was done to make it irreducible and aperiodic (so that
a PageRank value is generated)
 When the random surfer abandons the hyperlink method by
entering a new destination, the random surfer, "teleports" to the
new page, where he begins hyperlink surfing again, until the next
teleportation, and so on.

 To model this activity mathematically, Brin and Page invented a new
matrix G, such that-
G=αS + (1-α)1/n eeT
where,
α is teleportation factor/damping factor and α ϵ {(0,1)}
G is called the Google matrix
E = 1/ n eeT is the teleportation matrix
 The teleporting is random because the E is uniform meaning the
surfer is equally likely, when teleporting, to jump to any page.
 So, Google's adjusted PageRank method is:
which is simply the power method applied to G.

HITS ALGORITHM
 HITS (Hypertext Induced Topic Search) was invented by Jon Kleinberg
in 1998 and uses the Web's hyperlink structure to create
 HITS produces two popularity scores and is query-dependent. HITS
thinks of web pages as authorities and hubs.
 An authority is a page with many inlinks, and a hub is a page with
many outIinks.
 The main criteria for HITS is:Good authorities are pointed to by good
hubs and good hubs point to good authorities.
 Every page i has both an authority score xi and a hub score yi . If E is
the set of all directed edges in the web graph,then,
• and for k=1,2,3….
given that each page has somehow been assigned an initial authority
score x(0) and hub score y(0) ,

HITS VERSUS PAGERANK
HITS PageRank
Scoring criteria Good authorities are pointed to by
good hubs and good hubs point to
good authorities.
A webpage is important if it
is pointed to by other
important pages.
Number of scores Dual Rankings
a) one with the most authoritative
documents related to the query
b) other with the most "hubby"
documents.
Page rank only presents
one score.
Query indepedence HITS score is calculated after
getting the neighbourhood graph
according to the query
PageRank score is query
independent
Resilience to
spamming
Susceptible to spamming since
addition of pages slightly affects
the ranking.
Since PageRank is able to
isolate spam, it is risilient
to spamming.

FUTURE WORK
 Creating spam-resistant ranking algorithms-
 The proposed algorithm considers each page one at a time, and asks,
"What proportion of this page's outlinking pages point back to it?" If
this value comes more than a threshold value, then we can detect the
presence of a link farm.
 The second proposal is to build a score that is the "opposite" of
PageRank called BadRank for each page. Then, actual ranking would
be done by the difference of these 2 quantities.
 Intelligent Agent-
 An intelligent agent is a software robot designed to retrieve specific
information automatically. . So we need to factor in such crawlers that
do not cause privacy issues.

Information retrival system and PageRank algorithm

Related slideshows

More Related Content

Information retrival system and PageRank algorithm

Editor's Notes