SlideShare a Scribd company logo
INFORMATION RETRIEVAL SYSTEM AND
THE PAGERANK ALGORITHM
OUTLINE
 Information retrieval system
 Data retrieval versus information retrieval
 Basic concepts of information retrieval
 Retrieval process
 Classical models of information retrieval
 Boolean model
 Vector model
 Probabilistic model
 Web information retrieval
 Features of Google’s search system
 Google’s architecture
 A brief analysis of PageRank algorithm
 PageRank versus HITS algorithm
WHAT IS INFORMATION RETRIEVAL?
 Information retrieval (IR) deals with the representation, storage,
organization of, and access to information items[1].
 The user must first translate this information need into a query which can
be processed by IR system.
 The key goal of an IR system is to retrieve information which might be
useful or relevant to the user.
DATA VERSUS INFORMATION RETRIEVAL
DATA RETRIEVAL INFORMATION RETRIEVAL
Determines which documents of a
collection contain the keywords in
the user query
Retrieves information about a subject
rather than data which satisfies a
given query
All objects which satisfy clearly
defined conditions are retrieved
IR system somehow 'interprets' the
contents of documents in a collection
and rank them according to a degree
of relevance to the user query
A single erroneous object means
total failure
The retrieved objects might be
inaccurate and small errors are
ignored
Data has a well defined structure
and semantics
Data is a natural language text which
is not always well structured and
could be semantically ambiguous
BASIC CONCEPTS OF IR
The effective retrieval of relevant information is directly affected by :
 User task – The task of the user might be:
 Information or a data retrieval
 Browsing
 Filtering
Figure1: User tasks in an IR system[1]
 Logical View-The way the index words might be extracted from the
document can be of 2 types:
 Full Text
 Index term
Figure 2: Text operations for Index Term Logical View [1]
RETRIEVAL PROCESS
Step 1: Before the retrieval process can even be initiated, it is necessary
to define the text database. This is usually done by the manager of the
database, which specifies the following:
(a) the documents to be used
(b) text operations
(c) the text model
Step 2: Once the logical view of the documents is defined, the database
manager builds an index of the text. An index is a critical data structure
because it allows fast searching over large volumes of data(e.g.
inverted file)
Figure 3 : Retrieval Process[1]
Step 3: Then, the user first specifies a user need which is then parsed
and transformed by the same text operations applied to the text. Then,
query operations are applied to the actual query which is then
processed to obtain the retrieved documents. Fast query processing is
made possible by the index structure previously built.
Step 4: Before been sent to the user, the retrieved documents are ranked
according to a likelihood of relevance.
Step 5: The user then examines the set of ranked documents in the
search for useful information. At this point, he might pinpoint a subset
of the documents seen as definitely of interest and initiate a user
feedback cycle[1].
IR MODELS
 The central problem regarding IR systems is the issue of predicting which
documents are relevant and which are not.
 A ranking algorithm operates according to basic premises regarding the
notion of document relevance.
 The IR model adopted determines the predictions of what is relevant and
what is not.
Figure 4 : Classification of the various IR models[1]
FORMAL DEFINITION OF IR
An information retrieval model is a quadruple :
{D,Q, F, R(qi, dj)}
where:
 D is a set composed of logical views (or representations) for the
documents in the collection.
 Q is a set composed of logical views (or representations) for the user
information needs (called queries).
 F is a framework for modeling document representations, queries, and
their relationships.
 R(qi, dj) is a ranking function which associates a real number with a
query qi ϵ Q and a document representation dj ϵ D. Such ranking
defines an ordering among the documents with regard to the query qi.
CLASSICAL MODEL
 Classic models in IR system consider that each document is described
by a set of representative keywords called index terms which are used
to index and summarize the document contents.
 Thus, the distinct index terms have varying relevance when used to
describe document contents.
 In this model, this effect is captured through the assignment of
numerical weights to each index term of a document.
 The main classical models are:
 Boolean Model
 Vector Model
 Probabilistic Model
STRUCTURED MODEL
 Retrieval models which combine information on text content with
information on the document structure are called structured text retrieval
models.[1]
 There are two models for structured text retrieval:-
 Non-overlapping lists model
 Proximal nodes model
Figure 5: List structure for (a) Non-overlapping lists model (b) Proximal nodes model [1]
BROWSING MODEL
 Browsing is a process of retrieving information whose main objectives
are not clearly defined in the beginning and whose purpose might
change during the interaction with the system.
 For browsing, there are 3 models :-
 Flat model
 Structure guided model
 Hypertext model
BOOLEAN MODEL
 The Boolean model is a simple retrieval model based on set theory and
Boolean algebra.
 The queries are specified as Boolean expressions which have precise
semantics.
 The Boolean model considers that index terms are present or absent in
a document. As a result, the index term weights are assumed to be all
binary, i.e., Wi,j ϵ{0,I}.
 A query q is composed of index terms linked by three connectives: not,
and, or.
 A query is essentially a conventional Boolean expression which can be
represented as a disjunction of conjunctive vectors in Disjunctive
normal form. The binary weighted vectors are called the conjunctive
components of Qdnf.
ADVANTAGES:-
1. clean formalism behind the model
2.its simplicity.
DISADVANTAGES:-
1. Its retrieval strategy is based on a binary decision criterion and
behaves more as data retrieval model.
2. The exact matching may lead to retrieval of too few or too many
documents.
3. It is not simple to translate an information need into a Boolean
expression
4. The Boolean expressions actually formulated by users often are
quite simple.
APPLICATIONS:-
Commercial document database systems
VECTOR MODEL
 The vector model was given by Gerard Salton and McGill.
 This model proposes to apply partial matching strategy by assigning
non-binary weights to index terms in queries and in documents.
 These term weights are ultimately used to compute the degree of
similarity between each document stored in the system and the user
query.
 In the vector model,
 Weight Wi,j associated with a pair of index terms and document
vector is positive and non-binary.
 The index terms in the query are also weighted.
 ADVANTAGES:
 Its term-weighting scheme improves retrieval performance.
 Its partial matching strategy allows retrieval of documents that
approximate the query conditions.
 Its cosine ranking formula sorts the documents according to their
degree of similarity to the query.
 It is simple and resilient ranking strategy.
 DISADVANTAGE:
 Index terms are assumed to be mutually independent.
PROBABILISTIC MODEL
 The classic probabilistic model introduced in 1976 by Roberston and
Sparck Jones.
 The probabilistic model attempts to capture the IR problem within a
probabilistic framework.
 BASIC IDEA: Given a user query, there is a set of documents which
contains exactly the relevant documents referred as the ideal answer
set. Given the description of this ideal answer set, we retrieve the
documents that satisfy this condition.
 Thus the querying process will be a process of specifying the
properties of an ideal answer set .
Assumption (Probabilistic Principle) -
‘Given a user query q and a document dj in the collection, the
probabilistic model tries to estimate the probability that the user will
find the document dj relevant.
— The model assumes that this probability of relevance depends on the
query and the document representations only.
— Further, the model assumes that there is a subset of all documents
which the user prefers as the answer set for the query q, called an
ideal answer set is labeled R which should maximize the overall
probability of relevance to the user.
— Documents in the set R are predicted to be relevant to the query.
Documents not in this set are predicted to be non-relevant.’
 This assumption does not state explicitly :-
 How to compute the probabilities of relevance
 We don’t know even the sample space
ADVANTAGES:-
 The documents are ranked in decreasing order of their probability of
being relevant.
DISADVANTAGES:-
 There is a need to guess the initial separation of documents into
relevant and non-relevant sets.
 It does not take into account the frequency with which an index term
occurs inside a document
 The adoption of the independence assumption for index terms.
COMPARISON OF THE CLASSICAL
MODELS
BOOLEAN MODEL VECTOR MODEL PROBABILISTIC MODEL
It evaluates queries as
evaluating Boolean
expression.
It uses the concept of
index weights and partial
matching to match a
document to a query.
It evaluates the queries by
using the ideal set
probabilistic index terms.
Weights are binary. The
document is either
relevant or irrelevant.
Index terms are weighted.
So, there is a ranking
created based on these
weights(using similarity).
Weights are binary. Initially
the document either
belongs to the ideal set or
is considered irrelevant.
It is simple to evaluate
based on the query and
the document.
It is more complex than
binary as the index term
weighting needs to be
done.
This is the most complex
model since neither the
weights nor the ideal set is
initially defined.
Performance is not that
good.
Performance is
considered to be optimal.
Performance is proved to
be optimal. However, in
practice it may become
impractical.
WEB IR VERSUS TRADITIONAL IR
The differences between the modeling for the web and the traditional
document collections are because of the following reasons:
o Web is huge
o Dynamic nature of Web
o Web is self organized
o Web growth is fast
o Web is hyperlinked
GOOGLE SEARCH ENGINE
 Google ,the most popular search engine, came into existence in 1998.
 It was developed by Sergey Brin and Lawrence Page as a solution for
the problem of Web information retrieval.
 DESIGN GOALS OF GOOGLE
 Improved search quality
 Academic search engine
 Usage
 Architecture
HOW GOOGLE SEARCH WORKS
 STEP 1: CRAWLING
 STEP 2: COMPRESSING
 STEP 3: INDEXING
 STEP 4: PAGERANK CALCULATION
 STEP 5: SORTING
 STEP 6: SEARCHING
GOOGLE SYSTEM FEATURES
1. ANCHOR TEXT-Google associates the text of the link with 2 things:
 The page that the link is on
 The page the link points to
2. THE PAGERANK ALGORITHM- PageRank extends the idea of
citations by not counting links from all pages equally and by
normalizing by the number of links on a page.
 We assume page A has pages T1,T2….Tn which point to it (i.e., are
citations). The parameter d is a damping factor which can be set
between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the
number of links going out of page A. The PageRank of a page A is
given as follows:
 ADVANTAGES OF USING PAGERANK ALGORITHM-
 Random Surfer Model is used as the intuitive justification of
PageRank.
 Pages that are well cited from many places around the Web are
worth looking at. Also, pages that have perhaps only one citation
from a well known site are also generally worth looking at.
 ADVANTAGES OF USING ANCHOR TEXT-
 Anchors often provide more accurate of Web pages than the
pages themselves.
 Anchors may exist for documents which cannot be indexed by a
text-based search engine, such as images, programs, and
databases.
MATHEMATICS OF PAGERANK
 The PageRank Thesis: A page is important if it is pointed to by other
important pages.
 ORIGINAL FORMULA- The PageRank of a page Pi is denoted r(i)
The problem is that the PageRanks of pages inlinking to page Pi are
unknown. So, an iterative procedure was used.
Figure 6: Example of PageRank calculation on web pages
 INITIAL ASSUMPTION: In the beginning, all pages have equal
PageRank of 1/n, where n is the number of pages in Google's index
of the Web. So, iterative formula is-
 This can also be written as-
where H is row normalized matrix such that
 OBSERVATIONS:
 Each iteration of the equation involves one vector-matrix
multiplication, which generally requires O( n2) computation, where
n=size (Hnxn).
 H is a very sparse because most web pages link to only a handful
of other pages. Hence, it requires O(nnz(H)) computation, where
nnz(H) =number of non zeros in H which reduces to O( n) effort.
 The iterative method applied to H is the classical power method
applied to H matrix.
 H looks a lot like a stochastic transition probability matrix for a
Markov chain. The dangling nodes of the network, those nodes
with no outlinks, create 0 rows in the matrix. All the other rows,
which correspond to the non dangling nodes, create stochastic
rows. Thus, H is called sub-stochastic. [2].
 PROBLEMS WITH THE ITERATIVE PROCESS-
1. Problem of Rank Sinks-
 Rank sinks are those pages that accumulate more and more
PageRank at each iteration
 It is used by SEO and link
 Thus, ranking nodes by their PageRank values is tough when a
majority of the nodes are tied with PageRank 0.
 It’s peferable to have PageRanks as positive.
2. Problem of Cycles-
 In the page cycles, the page1 only points to page 2 and vice
versa which creates an infinite loop or cycle.
 The iterates will not converge no matter how long the process
is run since (k)T will flip flop indefinitely
Figure 7: (a) Rank Sink (b) Cycle
ADJUSTMENTS TO THE MODEL-
So, to counter the problems, Brin and Page made use of the Random
Surfer Model.
 Imagine a web surfer who bounces along randomly following the
hyperlink structure of the Web & when he arrives at a page with several
outlinks, he chooses one at random, hyperlinks to this new page, and
continues this random decision process indefinitely.
 In the long run, the proportion of time the random surfer spends on a
given page is a measure of the relative importance of that page.
 Unfortunately, this random surfer encounters some problems. He gets
caught whenever he enters a dangling node e.g., pdf files, image files,
data tables, etc[3].”
 To fix this, Brin and Page define their first adjustment, which we call the
stochasticity adjustment because the 0T rows of H are replaced with
1/n eT, thereby making H stochastic. Now, the random surfer can
hyperlink to any page at random. The stochastic matrix is called S.
So,
S=H+a(1/n eT)
where, a = dangling node vector
 This adjustment guarantees that S is stochastic, but it alone cannot
guarantee the convergence results desired. So a ,primitivity
adjustment was done to make it irreducible and aperiodic (so that
a PageRank value is generated)
 When the random surfer abandons the hyperlink method by
entering a new destination, the random surfer, "teleports" to the
new page, where he begins hyperlink surfing again, until the next
teleportation, and so on.
 To model this activity mathematically, Brin and Page invented a new
matrix G, such that-
G=αS + (1-α)1/n eeT
where,
α is teleportation factor/damping factor and α ϵ {(0,1)}
G is called the Google matrix
E = 1/ n eeT is the teleportation matrix
 The teleporting is random because the E is uniform meaning the
surfer is equally likely, when teleporting, to jump to any page.
 So, Google's adjusted PageRank method is:
which is simply the power method applied to G.
HITS ALGORITHM
 HITS (Hypertext Induced Topic Search) was invented by Jon Kleinberg
in 1998 and uses the Web's hyperlink structure to create
 HITS produces two popularity scores and is query-dependent. HITS
thinks of web pages as authorities and hubs.
 An authority is a page with many inlinks, and a hub is a page with
many outIinks.
 The main criteria for HITS is:Good authorities are pointed to by good
hubs and good hubs point to good authorities.
 Every page i has both an authority score xi and a hub score yi . If E is
the set of all directed edges in the web graph,then,
• and for k=1,2,3….
given that each page has somehow been assigned an initial authority
score x(0) and hub score y(0) ,
HITS VERSUS PAGERANK
HITS PageRank
Scoring criteria Good authorities are pointed to by
good hubs and good hubs point to
good authorities.
A webpage is important if it
is pointed to by other
important pages.
Number of scores Dual Rankings
a) one with the most authoritative
documents related to the query
b) other with the most "hubby"
documents.
Page rank only presents
one score.
Query indepedence HITS score is calculated after
getting the neighbourhood graph
according to the query
PageRank score is query
independent
Resilience to
spamming
Susceptible to spamming since
addition of pages slightly affects
the ranking.
Since PageRank is able to
isolate spam, it is risilient
to spamming.
FUTURE WORK
 Creating spam-resistant ranking algorithms-
 The proposed algorithm considers each page one at a time, and asks,
"What proportion of this page's outlinking pages point back to it?" If
this value comes more than a threshold value, then we can detect the
presence of a link farm.
 The second proposal is to build a score that is the "opposite" of
PageRank called BadRank for each page. Then, actual ranking would
be done by the difference of these 2 quantities.
 Intelligent Agent-
 An intelligent agent is a software robot designed to retrieve specific
information automatically. . So we need to factor in such crawlers that
do not cause privacy issues.

More Related Content

Information retrival system and PageRank algorithm

  • 1. INFORMATION RETRIEVAL SYSTEM AND THE PAGERANK ALGORITHM
  • 2. OUTLINE  Information retrieval system  Data retrieval versus information retrieval  Basic concepts of information retrieval  Retrieval process  Classical models of information retrieval  Boolean model  Vector model  Probabilistic model  Web information retrieval  Features of Google’s search system  Google’s architecture  A brief analysis of PageRank algorithm  PageRank versus HITS algorithm
  • 3. WHAT IS INFORMATION RETRIEVAL?  Information retrieval (IR) deals with the representation, storage, organization of, and access to information items[1].  The user must first translate this information need into a query which can be processed by IR system.  The key goal of an IR system is to retrieve information which might be useful or relevant to the user.
  • 4. DATA VERSUS INFORMATION RETRIEVAL DATA RETRIEVAL INFORMATION RETRIEVAL Determines which documents of a collection contain the keywords in the user query Retrieves information about a subject rather than data which satisfies a given query All objects which satisfy clearly defined conditions are retrieved IR system somehow 'interprets' the contents of documents in a collection and rank them according to a degree of relevance to the user query A single erroneous object means total failure The retrieved objects might be inaccurate and small errors are ignored Data has a well defined structure and semantics Data is a natural language text which is not always well structured and could be semantically ambiguous
  • 5. BASIC CONCEPTS OF IR The effective retrieval of relevant information is directly affected by :  User task – The task of the user might be:  Information or a data retrieval  Browsing  Filtering Figure1: User tasks in an IR system[1]
  • 6.  Logical View-The way the index words might be extracted from the document can be of 2 types:  Full Text  Index term Figure 2: Text operations for Index Term Logical View [1]
  • 7. RETRIEVAL PROCESS Step 1: Before the retrieval process can even be initiated, it is necessary to define the text database. This is usually done by the manager of the database, which specifies the following: (a) the documents to be used (b) text operations (c) the text model Step 2: Once the logical view of the documents is defined, the database manager builds an index of the text. An index is a critical data structure because it allows fast searching over large volumes of data(e.g. inverted file)
  • 8. Figure 3 : Retrieval Process[1]
  • 9. Step 3: Then, the user first specifies a user need which is then parsed and transformed by the same text operations applied to the text. Then, query operations are applied to the actual query which is then processed to obtain the retrieved documents. Fast query processing is made possible by the index structure previously built. Step 4: Before been sent to the user, the retrieved documents are ranked according to a likelihood of relevance. Step 5: The user then examines the set of ranked documents in the search for useful information. At this point, he might pinpoint a subset of the documents seen as definitely of interest and initiate a user feedback cycle[1].
  • 10. IR MODELS  The central problem regarding IR systems is the issue of predicting which documents are relevant and which are not.  A ranking algorithm operates according to basic premises regarding the notion of document relevance.  The IR model adopted determines the predictions of what is relevant and what is not. Figure 4 : Classification of the various IR models[1]
  • 11. FORMAL DEFINITION OF IR An information retrieval model is a quadruple : {D,Q, F, R(qi, dj)} where:  D is a set composed of logical views (or representations) for the documents in the collection.  Q is a set composed of logical views (or representations) for the user information needs (called queries).  F is a framework for modeling document representations, queries, and their relationships.  R(qi, dj) is a ranking function which associates a real number with a query qi ϵ Q and a document representation dj ϵ D. Such ranking defines an ordering among the documents with regard to the query qi.
  • 12. CLASSICAL MODEL  Classic models in IR system consider that each document is described by a set of representative keywords called index terms which are used to index and summarize the document contents.  Thus, the distinct index terms have varying relevance when used to describe document contents.  In this model, this effect is captured through the assignment of numerical weights to each index term of a document.  The main classical models are:  Boolean Model  Vector Model  Probabilistic Model
  • 13. STRUCTURED MODEL  Retrieval models which combine information on text content with information on the document structure are called structured text retrieval models.[1]  There are two models for structured text retrieval:-  Non-overlapping lists model  Proximal nodes model Figure 5: List structure for (a) Non-overlapping lists model (b) Proximal nodes model [1]
  • 14. BROWSING MODEL  Browsing is a process of retrieving information whose main objectives are not clearly defined in the beginning and whose purpose might change during the interaction with the system.  For browsing, there are 3 models :-  Flat model  Structure guided model  Hypertext model
  • 15. BOOLEAN MODEL  The Boolean model is a simple retrieval model based on set theory and Boolean algebra.  The queries are specified as Boolean expressions which have precise semantics.  The Boolean model considers that index terms are present or absent in a document. As a result, the index term weights are assumed to be all binary, i.e., Wi,j ϵ{0,I}.  A query q is composed of index terms linked by three connectives: not, and, or.  A query is essentially a conventional Boolean expression which can be represented as a disjunction of conjunctive vectors in Disjunctive normal form. The binary weighted vectors are called the conjunctive components of Qdnf.
  • 16. ADVANTAGES:- 1. clean formalism behind the model 2.its simplicity. DISADVANTAGES:- 1. Its retrieval strategy is based on a binary decision criterion and behaves more as data retrieval model. 2. The exact matching may lead to retrieval of too few or too many documents. 3. It is not simple to translate an information need into a Boolean expression 4. The Boolean expressions actually formulated by users often are quite simple. APPLICATIONS:- Commercial document database systems
  • 17. VECTOR MODEL  The vector model was given by Gerard Salton and McGill.  This model proposes to apply partial matching strategy by assigning non-binary weights to index terms in queries and in documents.  These term weights are ultimately used to compute the degree of similarity between each document stored in the system and the user query.  In the vector model,  Weight Wi,j associated with a pair of index terms and document vector is positive and non-binary.  The index terms in the query are also weighted.
  • 18.  ADVANTAGES:  Its term-weighting scheme improves retrieval performance.  Its partial matching strategy allows retrieval of documents that approximate the query conditions.  Its cosine ranking formula sorts the documents according to their degree of similarity to the query.  It is simple and resilient ranking strategy.  DISADVANTAGE:  Index terms are assumed to be mutually independent.
  • 19. PROBABILISTIC MODEL  The classic probabilistic model introduced in 1976 by Roberston and Sparck Jones.  The probabilistic model attempts to capture the IR problem within a probabilistic framework.  BASIC IDEA: Given a user query, there is a set of documents which contains exactly the relevant documents referred as the ideal answer set. Given the description of this ideal answer set, we retrieve the documents that satisfy this condition.  Thus the querying process will be a process of specifying the properties of an ideal answer set .
  • 20. Assumption (Probabilistic Principle) - ‘Given a user query q and a document dj in the collection, the probabilistic model tries to estimate the probability that the user will find the document dj relevant. — The model assumes that this probability of relevance depends on the query and the document representations only. — Further, the model assumes that there is a subset of all documents which the user prefers as the answer set for the query q, called an ideal answer set is labeled R which should maximize the overall probability of relevance to the user. — Documents in the set R are predicted to be relevant to the query. Documents not in this set are predicted to be non-relevant.’  This assumption does not state explicitly :-  How to compute the probabilities of relevance  We don’t know even the sample space
  • 21. ADVANTAGES:-  The documents are ranked in decreasing order of their probability of being relevant. DISADVANTAGES:-  There is a need to guess the initial separation of documents into relevant and non-relevant sets.  It does not take into account the frequency with which an index term occurs inside a document  The adoption of the independence assumption for index terms.
  • 22. COMPARISON OF THE CLASSICAL MODELS BOOLEAN MODEL VECTOR MODEL PROBABILISTIC MODEL It evaluates queries as evaluating Boolean expression. It uses the concept of index weights and partial matching to match a document to a query. It evaluates the queries by using the ideal set probabilistic index terms. Weights are binary. The document is either relevant or irrelevant. Index terms are weighted. So, there is a ranking created based on these weights(using similarity). Weights are binary. Initially the document either belongs to the ideal set or is considered irrelevant. It is simple to evaluate based on the query and the document. It is more complex than binary as the index term weighting needs to be done. This is the most complex model since neither the weights nor the ideal set is initially defined. Performance is not that good. Performance is considered to be optimal. Performance is proved to be optimal. However, in practice it may become impractical.
  • 23. WEB IR VERSUS TRADITIONAL IR The differences between the modeling for the web and the traditional document collections are because of the following reasons: o Web is huge o Dynamic nature of Web o Web is self organized o Web growth is fast o Web is hyperlinked
  • 24. GOOGLE SEARCH ENGINE  Google ,the most popular search engine, came into existence in 1998.  It was developed by Sergey Brin and Lawrence Page as a solution for the problem of Web information retrieval.  DESIGN GOALS OF GOOGLE  Improved search quality  Academic search engine  Usage  Architecture
  • 25. HOW GOOGLE SEARCH WORKS  STEP 1: CRAWLING  STEP 2: COMPRESSING  STEP 3: INDEXING  STEP 4: PAGERANK CALCULATION  STEP 5: SORTING  STEP 6: SEARCHING
  • 26. GOOGLE SYSTEM FEATURES 1. ANCHOR TEXT-Google associates the text of the link with 2 things:  The page that the link is on  The page the link points to 2. THE PAGERANK ALGORITHM- PageRank extends the idea of citations by not counting links from all pages equally and by normalizing by the number of links on a page.  We assume page A has pages T1,T2….Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
  • 27.  ADVANTAGES OF USING PAGERANK ALGORITHM-  Random Surfer Model is used as the intuitive justification of PageRank.  Pages that are well cited from many places around the Web are worth looking at. Also, pages that have perhaps only one citation from a well known site are also generally worth looking at.  ADVANTAGES OF USING ANCHOR TEXT-  Anchors often provide more accurate of Web pages than the pages themselves.  Anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases.
  • 28. MATHEMATICS OF PAGERANK  The PageRank Thesis: A page is important if it is pointed to by other important pages.  ORIGINAL FORMULA- The PageRank of a page Pi is denoted r(i) The problem is that the PageRanks of pages inlinking to page Pi are unknown. So, an iterative procedure was used. Figure 6: Example of PageRank calculation on web pages
  • 29.  INITIAL ASSUMPTION: In the beginning, all pages have equal PageRank of 1/n, where n is the number of pages in Google's index of the Web. So, iterative formula is-  This can also be written as- where H is row normalized matrix such that
  • 30.  OBSERVATIONS:  Each iteration of the equation involves one vector-matrix multiplication, which generally requires O( n2) computation, where n=size (Hnxn).  H is a very sparse because most web pages link to only a handful of other pages. Hence, it requires O(nnz(H)) computation, where nnz(H) =number of non zeros in H which reduces to O( n) effort.  The iterative method applied to H is the classical power method applied to H matrix.  H looks a lot like a stochastic transition probability matrix for a Markov chain. The dangling nodes of the network, those nodes with no outlinks, create 0 rows in the matrix. All the other rows, which correspond to the non dangling nodes, create stochastic rows. Thus, H is called sub-stochastic. [2].
  • 31.  PROBLEMS WITH THE ITERATIVE PROCESS- 1. Problem of Rank Sinks-  Rank sinks are those pages that accumulate more and more PageRank at each iteration  It is used by SEO and link  Thus, ranking nodes by their PageRank values is tough when a majority of the nodes are tied with PageRank 0.  It’s peferable to have PageRanks as positive. 2. Problem of Cycles-  In the page cycles, the page1 only points to page 2 and vice versa which creates an infinite loop or cycle.  The iterates will not converge no matter how long the process is run since (k)T will flip flop indefinitely Figure 7: (a) Rank Sink (b) Cycle
  • 32. ADJUSTMENTS TO THE MODEL- So, to counter the problems, Brin and Page made use of the Random Surfer Model.  Imagine a web surfer who bounces along randomly following the hyperlink structure of the Web & when he arrives at a page with several outlinks, he chooses one at random, hyperlinks to this new page, and continues this random decision process indefinitely.  In the long run, the proportion of time the random surfer spends on a given page is a measure of the relative importance of that page.  Unfortunately, this random surfer encounters some problems. He gets caught whenever he enters a dangling node e.g., pdf files, image files, data tables, etc[3].”  To fix this, Brin and Page define their first adjustment, which we call the stochasticity adjustment because the 0T rows of H are replaced with 1/n eT, thereby making H stochastic. Now, the random surfer can hyperlink to any page at random. The stochastic matrix is called S.
  • 33. So, S=H+a(1/n eT) where, a = dangling node vector  This adjustment guarantees that S is stochastic, but it alone cannot guarantee the convergence results desired. So a ,primitivity adjustment was done to make it irreducible and aperiodic (so that a PageRank value is generated)  When the random surfer abandons the hyperlink method by entering a new destination, the random surfer, "teleports" to the new page, where he begins hyperlink surfing again, until the next teleportation, and so on.
  • 34.  To model this activity mathematically, Brin and Page invented a new matrix G, such that- G=αS + (1-α)1/n eeT where, α is teleportation factor/damping factor and α ϵ {(0,1)} G is called the Google matrix E = 1/ n eeT is the teleportation matrix  The teleporting is random because the E is uniform meaning the surfer is equally likely, when teleporting, to jump to any page.  So, Google's adjusted PageRank method is: which is simply the power method applied to G.
  • 35. HITS ALGORITHM  HITS (Hypertext Induced Topic Search) was invented by Jon Kleinberg in 1998 and uses the Web's hyperlink structure to create  HITS produces two popularity scores and is query-dependent. HITS thinks of web pages as authorities and hubs.  An authority is a page with many inlinks, and a hub is a page with many outIinks.  The main criteria for HITS is:Good authorities are pointed to by good hubs and good hubs point to good authorities.  Every page i has both an authority score xi and a hub score yi . If E is the set of all directed edges in the web graph,then, • and for k=1,2,3…. given that each page has somehow been assigned an initial authority score x(0) and hub score y(0) ,
  • 36. HITS VERSUS PAGERANK HITS PageRank Scoring criteria Good authorities are pointed to by good hubs and good hubs point to good authorities. A webpage is important if it is pointed to by other important pages. Number of scores Dual Rankings a) one with the most authoritative documents related to the query b) other with the most "hubby" documents. Page rank only presents one score. Query indepedence HITS score is calculated after getting the neighbourhood graph according to the query PageRank score is query independent Resilience to spamming Susceptible to spamming since addition of pages slightly affects the ranking. Since PageRank is able to isolate spam, it is risilient to spamming.
  • 37. FUTURE WORK  Creating spam-resistant ranking algorithms-  The proposed algorithm considers each page one at a time, and asks, "What proportion of this page's outlinking pages point back to it?" If this value comes more than a threshold value, then we can detect the presence of a link farm.  The second proposal is to build a score that is the "opposite" of PageRank called BadRank for each page. Then, actual ranking would be done by the difference of these 2 quantities.  Intelligent Agent-  An intelligent agent is a software robot designed to retrieve specific information automatically. . So we need to factor in such crawlers that do not cause privacy issues.

Editor's Notes

  1. The representation and organization of the information items should provide the user with easy access to the information in which he is interested. This translation yields a set of keywords (or index terms) which summarizes the description of the user information need.
  2. This interpretation involves extracting syntactic &semantic information from text and using this information to match the user information need.
  3. Information retrieval- When a user searches for useful information, he executes a retrieval task. Ex:- Classic information retrieval systems Browsing- It is still a process of retrieving information, but one whose main objectives are not clearly defined in the beginning & whose purpose might change during the interaction with the system. Ex:-Hypertext systems are usually tuned for providing quick browsing. :-Modern digital library and Web interfaces Filtering- Both retrieval and browsing are, in the language of the World Wide Web, 'pulling' actions. That is, the user requests the information in an interactive manner. Then, we say that the IR system is executing a particular retrieval task which consists of filtering relevant information. In a filtering task, a user profile describing the user's preferences is constructed. Such a profile is then compared to the incoming documents in an attempt to determine those which might be of interest to this particular user. The task of determining which ones are really relevant is fully reserved to the user. Documents in a collection are frequently represented through a set of index terms or keywords. Such keywords might be extracted directly from the text of the document or might be specified by a human subject. They provide a logical view of the document. There are basically 2 views of a document: a. Full Text Logical View b. Index Termed Logical View Full Text Logical View- Modern computers are making it possible to represent a document by its full set of words. The full text is clearly the most complete logical view of a document but its usage usually implies higher computational costs. Index Termed Logical View-With very large collections ,however , even modern computers might have to reduce the set of representative keywords. This can be accomplished through operations called text operations such as: elimination of stop-words(such as articles and connectives) the use of stemming (which reduces distinct words to their common grammatical root) the identification of noun groups(which eliminates adjectives, adverbs, and verbs). Text operations reduce the complexity of the document representation of the document representation and allow moving the logical view from that of a full text to that of a set of index terms. A small set of categories , provides the most concise logical view of a document but its usage might lead to retrieval of poor quality. The retrieval system might also recognize the internal structure normally present in a document .
  4. ISSUES RELATED TO IR Despite the high interactivity, people still find it difficult (if not impossible) to retrieve information relevant to their information needs. Thus, in the dynamic world of the Web and of large digital libraries, which techniques will allow retrieval of higher quality? With the ever increasing demand for access, quick response is becoming more and more a pressing factor. Thus, which techniques will yield faster indexes and smaller query response times? The quality of the retrieval task is greatly affected by the user interaction with the system. Thus, how will a better understanding of the user behavior affect the design and deployment of new information retrieval strategies? Practical Issues: Security Privacy-Frequently, people are willing to exchange information as long as it does not become public. The reasons are many but the most common one is to protect oneself against misuse of private information by third parties. Copyright and Patent Rights-It is far clear how the wide spread of data on the Web affects copyright and patent laws in the various countries. This is important because it affects the business of building up and deploying large digital libraries. Additionally, other practical issues of interest include: 4. scanning 5. optical character recognition (OCR) 6. cross-language retrieval (in which the query is in one language but the documents retrieved are in another language).
  5. Such a decision is usually dependent on a ranking algorithm which attempts to establish a simple ordering of the documents retrieved. Documents appearing at the top of this ordering are considered to be more likely to be relevant. [1] Distinct sets of premise yield distinct information retrieval models.  
  6. In this situation, we can strictly assume that the user is browsing the space instead of searching. The various models for browsing are presented next.
  7. Given its inherent simplicity and neat formalism, the Boolean model received great attention in past years and was adopted by many of the early commercial bibliographic systems.
  8. binary independence retrieval (BIR) model.
  9. 1. In fact, it's so big that it's hard to get an accurate count of its size. Deep webpages can not be found by casual, routine surfing. Surfers must request information from a particular database, at which point, the relevant pages are served to the user dynamically.. As a result, search engines cannot easily find these dynamic pages since they do not exist before or after the query. 2. First, once a document is added to a traditional information collection, it does not change. But webpages change and that too very frequently. For the most part, the size of a traditional document collection is relatively static. Billions of pages are added to the Web each year. The dynamics of the Web make it tough to compute relevancy scores for queries when the collection is a moving, evolving target. 3.self-organized-Traditional document collections are usually collected and categorized by trained specialists. However, on the Web, anyone can post a webpage and link away at will. There are no standards and no gatekeepers policing content, structure, and format. Also, the data is volatile; there are rapid updates, broken links, and file disappearances. The data is heterogeneous, existing in multiple formats, languages, and alphabets. Also duplicity causes problem and since there is no editorial review process, which means data may contain lot of errors, falsehoods, and invalid statements. Further, this self-organization allows spammers who capitalize on the mercantile potential offered by the Web. 4. fast-An additional information retrieval challenge for any document collection, concerns precision. Although the amount of accessible information continues to grow, a user's ability to look at documents does not.This user impatience means that search engine precision must increase just as rapidly as the number of documents is increasing. 5. While traditional search engines are compared by running tests on familiar, well studied, controlled collections, this is not realistic for web engines. Even small web collections are too large for researchers to catalog, count, and create estimates of the precision and recall numerators and denominators for dozens of queries. Comparing two search engines is usually done with user satisfaction studies and market share measures in addition to the baseline comparison measures of speed and storage requirements.[ In addition, both web and traditional are models face 2 problems of synonymy and polysemy. Synonymy refers to multiple words having the same meaning, such as car and automobile. Polysemy refers to words with multiple meanings. The problem of polysemy can cause many documents that are irrelevant to the user's actual intended query meaning to be retrieved.
  10. 1.-The main goal of Google is to improve the quality of Websearch engines. In 1994, some people believed that a complete search index would make it possible to find anything easily. Anyone who has used a search engine recently, can readily testify that the completeness of the index is not the only factor in the quality of search results.[3] "Junk results" often wash out any results that a user is interested in. This problem is that as the collection size grows, tools are required that have very high precision Indeed, the notion of "relevant" should include the very best documents since there may be many slightly relevant documents. This very high precision is important even at the expense of recall. 2. research-Aside from tremendous growth, the Web has alsobecome increasingly commercial over time. At the same time,search engines have migrated from the academic domain to the commercial. This caused search engine technology to remain largely a black art and to be advertising oriented. With Google, the goal was to push more development and understanding into the academic realm.[3]   3. Usage-Another important design goal was to build systems that reasonable numbers of people can actually use. Usage was important because some of the most interesting research would involve leveraging the vast amount of usage data that is available from modern Web systems. 4. The final design goal was to build an architecture that can support novel research activities on large scale Web data. To support novel research uses, Google stores all of the actual documents it crawls in compressed form.
  11. This makes it possible to return Web pages which have not actually been crawled. Anchor propagation was used mostly because anchor text can help provide better quality results. Using anchor text efficiently is technically difficult because of the large amounts of data which must be processed.[5]
  12. PageRank can be thought of as a model of user behavior. We assume there is a "random surfer" who is given a Web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank (M) And, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page.[3] One important variation is to only add the damping factor d to a single page, or a group of pages. This allows for personalization and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking. [3] A page can have a high PageRank if there are many pages that point to it or if there are some pages that point to it and have a high PageRank. If a page was not high quality, or was a broken link, it is quite likely that well know site’s page would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the Web.[5]
  13. The nonzero elements of row i correspond to the outlinking pages of page i, whereas the nonzero elements of column i correspond to the inlinking pages of page i, We now introduce a row vector (k)T, which is the PageRank vector at the kth iteration. Using this, we get
  14. IMPORTANT FACTORS • Will this iterative process continue indefinitely or will it converge? • Under what circumstances or properties of H is it guaranteed to converge? Will it converge to something that makes sense in the context of the PageRank problem? Will it converge to just one vector or multiple vectors? Does the convergence depend on the starting vector (0)T? If it will converge eventually, how long is "eventually"? That is, how many iterationscan we expect until convergence?