SlideShare a Scribd company logo
PRESENTATION ON PROJECT
TOPIC:- “DOCUMENT RANKING
USING QPRP WITH CONCEPT
OF MULTI-DIMENSIONAL
SUBSPACE”
1
Presented By:-
• Prakash Kumar Dubey (08)
Guided By:-
• Mr. Sourish Dhar (Dept. of IT)
• Mr. Bhagaban Swain (Dept. of IT)
Overview
2
 Introduction.
 Architecture of IR.
 Classical models of IR.
 Quantum probability.
 Document ranking using qPRP.
 Proposed solution.
 Implementation and Data collection.
 Conclusion.
 Future work.
Information Retrieval
3
• Information Retrieval (IR) is to search for relevant
information in large collections of data.
• Examples of IR
Q- Give me articles about Laloo Prasad Yadav and the fodder scam.
R- Evidence regarding Laloo Prasad Yadav's involvement in the
fodder scam. - text retrieval.
Q- What does a brain tumor look like on a CT-scan?
R- A picture of a brain tumor - image retrieval.
• Not to be confused with Data Retrieval.
Main Components
4
There are five main components of the basic information retrieval
system.
i. Crawling.
ii. Indexing.
iii. User’s Query.
iv. Ranking.
v. Relevance Feedback.
Basic Architecture of IR
5
Cont…..
6
Crawling:- The system browses the document collection and
fetches documents.
i. Selection Policy.
ii. Revisit Policy.
iii. Politeness Policy.
Indexing:- System builds an index of the documents.
i. Tokenization.
ii. Stop-word Eliminator.
iii. Stemmer.
iv. Inverted Index.
Cont….
7
Ranking:- When user gives a query the index is consulted to get
most relevant document. Relevant documents are then
ranked as per their importance.
Relevance Feedback:- It is a classical way of refining search
engine rankings. eg:- Matrix(maths or movie).
Three Types of relevance feedback:-
* Explicit.
* Implicit.
* Pseudo.
Theoretical Models in IR
8
 Theoretical models gives us different ways of solving IR related
Problems.
 IR model is defined as 4-tuple [D,Q, F,R(qi,dj)]. Here,
 D- It represents the document collection.
 Q- Query collection collected from the users.
 F- Framework for modeling document representation,
queries and their relationships.
 R(qi,dj)- Ranking function which associates a score with
the pair (qi,dj).
Classical Models Of IR
9
The main three classical models of Information Retrieval
are:-
 Boolean Model
 Vector Space Model
��� Probabilistic Model.
Boolean Model
10
 The model is based on the set theory and boolean algebra.
 Each document is considered as a bag of index terms(words or
phrases from the documents important to establish its meaning).
 Query here is the expression using boolean algebra connectives like
, , etc.
And Or Not
 Document retrieved should completely match the given query and it
is not ordered.
Boolean Query Example..
11
Suppose we have 3 documents:-
Doc1:- Cricket is the most popular game of India.
Doc2:- Ricky Ponting is the most successful captain of cricket
Australia.
Doc3:- India is ranked 5th in the latest ICC test cricket ranking.
If a user wants to know about Indian Cricket then a simple query is:
India Cricket Australia.
Inverted index is formed. India is present in document {1,3}, Cricket is
present in document {1,2,3} and Australia is present in document {2}.So
finally {1,3} is selected.
Pros and Cons…
12
Advantages:-
i. Simple, efficient and easy to implement.
ii. Very precise in nature, user gets exact thing.
Disadvantages:-
i. Partial matches are not retrieved, which in many cases
is not suitable. Retrieved documents are not ranked.
ii. Given large set of documents, it retrieves either too many or
very few documents.
iii. Query does not captures synonymous terms.
iv. Model does not use term weights.
Vector Space Model
13
 In this model the documents are represented as a vector of index
terms. It has the ability to fetch partial matches.
 Here we do not consider only the presence or absence of terms. So, in
vector model the term weights are not binary.
 Queries are also represented as vectors.
 The similarity between the two vectors is actually calculated as the
cosine similarity between them using which we find the relevance of
the document.
𝒅𝒋 = 𝒘 𝟏𝒋, 𝒘 𝟐𝒋, … … , 𝒘 𝒕𝒋
Some Important Terms
14
 Modelling as a Clusturing Method.
 Fixing the Term weights.
i. Term Frequency(tf)
𝒕𝒇𝒊,𝒋 =
𝒇𝒓𝒆𝒒𝒊,𝒋
𝒎𝒂𝒙 𝒍(𝒇𝒓𝒆𝒒𝒍,𝒋)
ii. Inverse Document Frequency(idf)
𝒊𝒅𝒇𝒊 = 𝒍𝒐𝒈
𝑵
𝒏 𝒊
Similarity Measure Between Two
Vectors
 The most widely used method to measure the similarity between the two
vectors is Cosine Similarity.
 The Cosine Similarity of the two qi and dj is given by:-
𝒔𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚 𝒅𝒋, 𝒒 = 𝒄𝒐𝒔𝜽 =
𝒅𝒋 ∙ 𝒒
|𝒅𝒋| ∙ |𝒒|
Here,
Ɵ = Angle between two vectors.
w(i,j)= Term weight of ith term of jth document.
w(i,q)= Term weight assigned to ith term of the query.
Cont..
16
 The retrieved set of documents dk are those for which similarity(di,qj) is
greater than a threshold value.
 The value of threshold can be brought down if for some query the highest
similarity is on lower side hence allowing the partial matches to be
retrieved.
Value of cos Ɵ increases
dj
qj
Pros and Cons…
17
Advantages:-
i. Partial matching possible.
ii. Ranking of retrieved results according to cosine
similarity is possible.
Disadvantages:-
i. Index terms are considered to be mutually independent which
does not allow it to capture semantic of query or document.
ii. It cannot denote the “clear logic view” like Boolean
model.
Probabilistic Model
18
 We try to capture the information retrieval process from a
probabilistic framework.
 Idea is to retrieve the documents according to the probability of the
document being relevant.
 Several version of Probabilistic model are available.
 We will use version of Robertson-Spark-Jones.
Probabilistic Model (Why)
19
 Other model are empirical for most part
 success measured by experimental results
 few properties provable
 Probabilistic Ranking Principle
 provable “minimization of risk”
 Information Retrieval deals with Uncertain Information
 And it makes uncertain guess of whether a document satisfies the
query.
 Probability theory provides a principled foundation for such reasoning
under uncertainty.
 Vector space model: rank documents according to similarity to query.
Probability Ranking Principle
 Collection of Documents
 User issues a query
 A Set of documents needs to be returned
 Question: In what order to present documents to user ?
20
Probability Ranking Principle
 Question: In what order to present documents to user ?
 Intuitively, want the “best” document to be first, second best -
second, etc…
 Need a formal way to judge the “goodness” of documents w.r.t.
queries.
 Idea: Probability of relevance of the document w.r.t. query
21
The Probabilistic Ranking
Principle22
If a reference retrieval system's response to each request is a ranking
of the documents in the collection in order of decreasing probability
of relevance to the user who submitted the request, where the
probabilities are estimated as accurately as possible on the basis of
whatever data have been made available to the system for this purpose,
the overall effectiveness of the system to its user will be the best that
is obtainable on the basis of that data.
What is the probability of this document being
relevant given this query?
Probabilistic Ranking Principle
 Definition
 All index term weights are all binary i.e., wi,j  {0,1}
 Let R be the set of documents known to be relevant to query q
 Let be the set on non relevant document.
 Let be the probability that the document dj is relevant to
the query q
 Let be the probability that the document dj is non
relevant to query q
R
)|( jdRP
)|( jdRP
23
Cont…
24
 Here we want to rank the documents (d w.r.t. query q) according to
the probability of the document to be relevant.
 Mathematically scoring function is given by:-
P(R = 1| d,q)
 R is indicator variable, it takes value 1 if it d(document) is relevant
w.r.t. q, and 0 if d is non-relevant w.r.t. q(query).
Probability Ranking Principle
Let x be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed)
query and let NR represent non-relevance.
)(
)()|(
)|(
)(
)()|(
)|(
xp
NRpNRxp
xNRp
xp
RpRxp
xRp


p(x|R), p(x|NR) - probability that if a relevant (non-relevant)
document is retrieved, it is x.
Need to find p(R|x) - probability that a retrieved document x
is relevant.
p(R),p(NR) - prior probability
of retrieving a (non) relevant
document
25
Probability Ranking Principle
)(
)()|(
)|(
)(
)()|(
)|(
xp
NRpNRxp
xNRp
xp
RpRxp
xRp


Ranking Principle (Bayes’ Decision Rule):
If p(R|x) > p(NR|x) then x is relevant,
otherwise x is not relevant
 The similarity sim(dj,q) of the document dj to the query q is defined
as the ratio.
 Using Bayes’ rule,
)|(
)|(
),(
j
j
j
dRP
dRP
qdsim 



26
Binary Independence Model
27
 Binary Independence Model for calculating the probability of
relevance.
 Name is binary because the documents and queries are represented
as binary (Boolean) term incidence vectors.
, iff term i is present in document x.
 Independence means terms are independent of each other.
),,( 1 nxxx 


1ix
Cont…
28
3 Assumptions are made by Binary Independence
Model (BIM)
1. The documents are independent of each other.
2. The terms in a document are independent of
each other.
3. The terms not present in query are equally
likely to occur in any document i.e. do not
affect the retrieval process.
Okapi BM25 Ranking Function
29
 Probabilistic IR model is very generic in nature.
 Many versions of probabilistic IR exist which are used practically.
 Okapi-BM25 algorithm is based on the probabilistic IR.
 Pays attention to the t.f. and document length.
Disadvantages of PRP
30
PRP model does not hold when the assumptions fails.
Calibration:- If the estimation of probability by the IR system
does not matches the users assessment of
relevance.
Independent Relevance:- Relevance of documents are independent of
each other.
Certainty in Estimation:-Probability of relevance of a document is
reported as scalar by IR system.
Quantum Probability
31
 Quantum probability theory naturally includes interference effects
between events.
 We assume that this interference shows the inter-dependency of
relevance of the documents.
 The outcome is a more sophisticated principle, the Quantum
Probability Ranking Principle(qPRP).
 To understand the difference between Kolmogorovian and Quantum
probability theory on the basis of relevance of documents we will
use Double Slit Experiment.
Double Slit Experiment
32
Settings of Double Slit Experiment
Cont..
33
Distribution of pA and pB
in the double slit
experiment.
Distribution of pk
AB in the
double slit experiment as
estimated by
Kolmogorovian probability.
Distribution of pAB as
measured in the double slit
experiment.
^
Cont…
34
 Kolmogorovian Probability Theory:-
pk
AB = p(x|A) +p(x|B)
= pA + pB
 Quantum Probability Theory:-
pQ
AB = pA + pB + 2*√pA √pB * cos(ƟAB)
= pA + pB +IAB
Where,
ƟAB = ƟA - ƟB
In Reality:-
pAB ≠ pA + pB
≠ pk
AB
Quantum Interference
Term
An Analogy with Document
Ranking35
 Here Particle corresponds to the user who is characterized
by an information need.
 Each Slit Corresponds to document. Ex- 2 Slit means 2 doc.
 The event of a particle passing from left of the screen to the
right is comparable with the user examining the set of doc.
 p(x|A,B) is analogous to p(S|dA , dB) .
S- an event to stop the search with user being satisfied.
Cont…
36
Fig:- Analogy between Double Slit Experiment and Document
Ranking Process in IR.
Ranking Document Within
Analogy.37
Fig- IR analogous of the previous figure
Cont…
38
 Kolmogorovian Probability:-
pk
AB = pA + pB
Following equalities can be defined:-
argmax(pAB) = argmax(pk
AB)
=argmax(pA + pB)
=argmax(pB )
B ϵ Ɓ
B ϵ Ɓ
B ϵ Ɓ
B ϵ Ɓ
Cont…
39
 Quantum Probability:-
pQ
AB = pA + pB + IAB
Following equalities can be defined:-
argmax(pAB) = argmax(pQ
AB)
=argmax(pA + pB + IAB )
=argmax(pB + IAB )
B ϵ Ɓ
B ϵ Ɓ
B ϵ Ɓ
B ϵ Ɓ
Ranking The First Document
40
Kolmogorovian and Quantum Probability Theory gives the same
estimation i.e. pk
AB = pQ
AB = pA
Ranking Subsequent
Documents41
Slit A and B are kept fixed and the 3rd slit is varied among the slits of set Ƈ
Cont…
42
 Kolmogorovian Probability:-
pk
ABC = pA + pB +pC
Following equalities can be defined as:-
argmax(pABC) = argmax(pk
ABC)
=argmax(pA + pB +pC)
=argmax(pC)
C ϵ Ƈ
C ϵ Ƈ
C ϵ Ƈ
C ϵ Ƈ
Cont…
43
 Quantum Probability:-
pQ
ABC = pA + pB +pC + 2*√pA √pB * cos(ƟA -ƟB)
+ 2*√pA √pC * cos(ƟA -ƟC)+ 2*√pB √pC * cos(ƟB -ƟC).
pQ
ABC = pA + pB + pC + IAB +IAC +IBC
Following equalities can be defined:-
argmax(pABC) = argmax(pQ
ABC)
=argmax(pA + pB + pC + IAB +IAC +IBC )
=argmax(pC + IAC + IBC )
C ϵ Ƈ
C ϵ Ƈ
C ϵ Ƈ
C ϵ Ƈ
Quantum Probability Ranking
Principle(qPRP)
44
Assumptions:-
I. Ranking is Performed Sequentially.
II. Empirical data is best described using Quantum Probabilities.
III. It is assumed that the documents that have been ranked before
may influence further relevance assessments.
Interpretation of Interference in
qPRP
45
 Quantum interference is central in the formalization of qPRP.
 Once interference is expressed in terms of IR, these questions may
arise:-
1. What does quantum interference mean in qPRP and in IR?
2. How does the quantum interference term influence document
ranking?
Estimating Interference in qPRP
Information Retrieval
46
IdAdB =2.√P(R|q,ddA) √P(R|q,dB).cos ϴdAdB
≈ 2.√P(R|q,ddA) √P(R|q,dB).βfsim (dAdB)
ϴ present in interference term is computed using a function fsim( dAdB).
where,
fsim
* is a function used to compute the similarity between
dA and dB.
β is a real valued parameter.
Note(*):- Different similarity function can be used viz Cosine
Similarity, Jaccard Similarity etc.
Constructing Document
Representation47
 We associate each document to a vector.
 Vector is defined on the vector space made up by the terms present
in the documents.
 Each term in a collection is considered as a dimension of the vector
space.
 Different strategies can be employed to compute the components of
the term-vector for a document.
example:-
Binary Schema, TF-IDF, BM25 etc.
Proposed Solution
48
 We do not find any major drawbacks in qPRP approach.
 qPRP can be thought as new model for IR.
 Existing qPRP approach considers term present in different section
of document equally.
 Our belief is that representing the document as multidimensional
subspace will give better result.
 We cannot give equal weight to the term present in title and body.
Reason of considering Document
as Multidimensional Objects49
 Writers write the different part of document with different views.
 Title:- Gives idea about the content of document in 3-7 words.
 1st Paragraph or Abstract :- Is an overview of document of whole
paragraph.
 Body:- Content of Document
 Conclusion:
 Writers write the term present in document with different font and
size. Ex: Keyterms->italics, etc.
 Considering documents as multidimensional will allow building
“truly” interactive IR system.
Reason of considering Document
as Multidimensional Objects50
 Complex aspects of the retrieval process benefit from more
sophisticated representation of doc. & queries.
 It reduces the length of subspaces
 Hence if words appears at any segment then it is more likely to
satisfy user.
How document is represented as
multidimensional subspace?
51
 In previous representation of document
Title: School of Tech.
Abstract or
1st paragraph of doc.
Body :………………………….
School of ……………..
Technology……………………
………………
Conclusion
……………………………………
……………………
Document
0
1
1
1
0
0
1
1
1
1
Doc 1 =
Document Fragments
52
To represent document as multidimensional subspace, we need to divide
document in different fragments.
 Choice 1: Use single fragment the document itself
 Choice 2: Use different section of document (i.e. title, abstract,
etc) as fragments.
 Choice 3: Use paragraphs as fragments as they seem to be an
appropriate size to correspond Information Need(IN).
 Choice 4: Use sentence as fragments.
Fragments as Document
Section53

Title: School of Tech.
Abstract or
1st paragraph of doc.
Body :………………………….
School of ……………..
Technology……………………
………………
Conclusion
……………………………………
……………………………………
Doc
1 1 1 1
0 0 0 0
1 1 1 1
0 0 1 0
1 0 0 1
0 1 1 0
1 1 0 1
0 0 1 0
1 1 1 1
0 1 0 0
Title Abstract Body Conclusion
Doc 1 =
Fragment as Paragraph &
Sentence54
 Document can be represented as a set of information needs (IN),
each being represented as a vector.
 We can decompose paragraph or sentence into text excerpts that are
associated with one or more INs.
 In same way query can be broken
to IN.
Representation for each
Segmentation55
 Three weighting schemes are used:-
1. Term Frequency-Inverse Document Frequency (TF-IDF)
2. Term Frequency(TF)
3. Binary(Term presence/absence)
 TF-IDF causes substantial overhead
 We can use TF and binary.
Implementing Multidimensional
Subspace with qPRP
56
 To decide the rank between two document, from qPRP we know
that,
pQ
AB = pA + pB + 2*√pA √pB * cos(ƟAB )
 Different parts of document has different weightage.
 There are two approaches for implementing MD subspace with
qPRP:-
1. Implementing with whole formula
2. Implementing only with similarity function
Implementing with whole
formula57
 This formula is to be used for different section of document
independently.
 After calculating for different part and multiply with respective
weightage we add with other fragment of document.
 Same similarity function can be used.
 Let suppose we give weightage and 2 document A and B
 Title= 0.2 Abstract=0.3
 Body=0.3 Conclusion=0.2
pQ
AB = title* (pQ
AB)title+abstract*(pQ
AB) abstract
+body*(pQ
AB)body +conclusion*(pQ
AB) conclusion
Implementing only with Similarity
Function
58
 Only similarity function is implemented with document fragment
rather than whole formula.
 Calculate similarity function between respective fragments of
documents and add all of them.
ƟAB = title* (ƟAB)title+abstract*(ƟAB) abstract
+body*(ƟAB)body +conclusion*(ƟAB) conclusion
 Use different types of formula for calculating similarity between
multidimensional subspaces.
Metrics for measuring extent of
Interference59
 The subspace similarity sims(Sa, Sb) between the p dimensional sub-
spaces Sa and the r dimensional subspace Sb is defined as:-
sims(Sa,Sb) = 1-
max 𝑝,𝑟 − 𝑖=1
𝑝
𝑗=1
𝑟
(𝑢 𝑖
𝑇
𝑣 𝑗)2
max(𝑝,𝑟)
 This formula can also used to calculate similarity between two
semantic spaces.
Implementation and Results
 For the implementation of the project and evaluation of results in an
efficient way some of the pre-requisites we have used are:-
 Software requirements:-
 Windows 7
 Microsoft Office 2010 (For project report)
 JDK 1.6.0 (Compiler) or higher version
 Notepad++ (with WebEdit)
 Data Set requirement:-
 Ad-Hoc standard Dataset
60
Cont...
 Hardware requirements:-
 3 GB RAM.
 5 GB Hard Disk Free Space.
 Intel Core i5 Processor or higher version.
Package requirements:-
 Lucene 2.4.0
 BM25 Implementation.
 Apache Commons Math 2.2.0
61
Data Collection
 FIRE Ad-Hoc of the year 2010 has been used.
 The queries has also been taken from the same.
 The data set obtained contains around 1,30,000 documents that
comprises of the collection of news from the leading newspaper
“The Telegraph” for the period of 2004-07.
 We have divided the documents into 3 fragment i.e. <title></title>,
<fp></fp> and <sp></sp>.
62
Why fragments???
The fragments are made so as to bring the concept of multi-subspace.
In our case the number of sub-spaces is 3. The reason behind choosing
these three fragments in this order are:-
 Titles are most important part of any document.
 Inverted pyramid is the model for newswriting.
So the title is kept at the top and the main content of the document has
been divided into two parts:
 First paragraph.
 Second paragraph.
63
Implementing of proposed solution
We have divided our implementation process into 3 modules:-
 Indexing of the data set.
 Searching the indexed document using cosine similarity.
 Search using Quantum based similarity measure.
For implementing the proposed solution we have chosen certain library.
They are:-
 DOM Parser (inbuilt in Java).
 Apache Lucene 2.4.0.
 Apache Commons Math Library 2.2.0.
 BM25 Implemented Library.
64
UML Diagram
Class diagram used for indexing:-
65
Indexer
-IndexWriter
-Document
+getIndexWriter(boolean)
+closeIndexWriter()
+indexDocument(TryDOM)
+recursion(File)
+rebuildIndexes(String)
TryDOM
-Document
-NodeList
+buildDocument(File)
+String getName()
+String getDocNum()
+String getTitle()
+String getFirstPara()
+String getSecondPara()
+String getWholeDocument()
Main
+public static void main(String[])
Class diagram used for searching
66
SearchFrame
+String
+Jpanel
+JTextField
+Jbutton
-actionPerformed(ActionEvent)
Class Diagram for Searcher (Part 1)
Main
public static void main()
Cont….
67
DocVector
+SparseRealVector
+Map
+DocVector(Map<Str,Int>terms)
+setEntry(String term, int freq)
+normalize()
MySearcher
#HashMap
#ArrayList
#IndexSearcher
#Document
#double tempScore
#int tempDoc
#int num
#IndexReader
MySearcher()
ScoreDoc[] getProbableRelvDoc(String, String)
HashMap sortQPRP(String, String)
double getSimilarity(int,int)
double testSimilarityUsingCosine(int,int,str)
Class Diagram for Searcher (Part 2)
Explanation(Indexer)
 Indexing mainly starts from the Main class which takes as a input
the ‘directory path’ where the documents to be indexed are kept.
 Main class instantiates the Indexer class and call its method
rebuildIndexes(String), and passes the given directory path to it
which in turn calls recursive(File). All the files available in the
directory will get indexed recursively by this function.
 Each file is then parsed by TryDOM class and it is passed to
indexDocument(TryDOM) to get indexed.
68
Explanation(Searcher)
 Now as the indexing is done, the next step is to search the indexed
document for the given query which is done by using the
MySearcher class.
 Program will start from Main class instantiating the SearchFrame
and one GUI is popped up.
 GUI takes two input as query and file name(where result is stored).
 Clicking the search button MySearcher class is initiated and the
method sortQPRP() is called. It then calls getProbableRelvDoc() to
get the top k result using BM25 model, now sortQPRP() rearranges
the result according to qPRP model.
69
Collaboration diagram for
indexer.70
Indexer
TryDOM
Main
getWholeDocument()
getDocNum()
getName()
getSecondPara()
getFirstPara()
getTitle()
buildDocument()
rebuildIndexes()
Collaboration diagram for
searcher.71
Main SearchFrame
MySearcher
DocVector
sortQPRP()
result()
setEntry()
normalize()
Lucene Index Structure
Documents in Lucene are stored
as an object in Index.
We need to convert the data into
document object and store them
into index.
We break the data into different
part and store them in Document
object as Field object.
72
Doc ID
Title
First Paragraph
Second Paragraph
DocNum
Document Name
Evaluation Measures
In order to evaluate the result we need to consider two dimensions.
Recall:- Measure of ability of system to present all relevant documents.
Mathematically,
recall=
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑖𝑛 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
Precision:- Measure of ability of system to present only relevant
documents.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑒𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑟𝑒𝑡𝑟𝑖𝑣𝑒𝑑
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑡𝑒𝑚𝑠 𝑟𝑒𝑡𝑟𝑖𝑣𝑒𝑑
 Recall and Precision are set based Measures.
73
Cont..
74
 To measure ranked list precision is plotted against recall.
 Whenever new nonrelevant document is retrieved, recall value is
same but precision decreases.
Mean Average Precision(MAP):-
 In recent years TREC community using MAP.
 It provides single figures across recall levels.
 To calculate Mean Average Precision the following formula is used.
𝑀𝐴𝑃 𝑄 =
1
|𝑄|
𝑗=1
|𝑄|
1
𝑚𝑗
𝑘=1
𝑚 𝑗
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝑅𝑗𝑘)
 Rjk is the set of ranked retrieval results from the top result until you
get to document dk
 qj ∈ Q is {d1, . . . dmj}
75
76
 First we retrieved top 150 result using BM25 model.
 Then we sort the result according to qPRP using cosine similarity.
 We noted down the ranked list given by both the model.
 We calculated the recall and precision of both list, whenever new
relevant document is retrieved in the list.
 We plotted the histogram using recall-precision.
Ranking of relevant document
77
Query
No.
Relevant Document Ranking(PRP) Relevant Document Ranking(qPRP)
77 78, 98, 41, 69, 16, 132,47,134,135 60, 48, 47, 46, 42, 52,44,49,54
79 18 20
85 26,38,27,1,44,6,42,2,22,104 52,30,25,1,45,26,87,36,42,48
88 27,6,49,44,7,59,13,20,12,23,21,43,56 35,11,82,54,2,21,29,3,39,24,26,18,28
100 5,19,13,12,14,33,06,03,04,29,1,22,9,1
8,15,23
4,21,12,11,31,33,10,03,06,20,1,27,14,15,7,8
102 2,37,129,84,62,147 2,13,18,16,17,20
103 1,17,30,12,130 1,6,7,5,143
112 18,3,37,14,6,8,1,73,24 7,5,16,3,6,2,1,13,9
121 11,16,5,20,4,1,3,19,6,13 11,10,13,7,1,8,15,14,16
122 6,1,4,10,23,9 3,1,16,7,21,14
Comparison of Precision for PRP and
qPRP(cosine) on same recall value(Query:100)
78
0
0.2
0.4
0.6
0.8
1
1.2
0.062r 0.125r 0.187r 0.25r 0.312r 0.375r 0.437r 0.5r 0.562r 0.625r 0.687r 0.75r 0.812r 0.875r 0.937r 1r
Precision(PRP)
Precision(Cosine)
Comparison of Precision for PRP and
qPRP(cosine) on same recall value(Query:112)
79
0
0.2
0.4
0.6
0.8
1
1.2
0.111r 0.222r 0.333r 0.444r 0.555r 0.666r 0.777r 0.888r 1r
Precision(PRP)
Precision(Cosine)
MAP comparison with respect to the
queries
80
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Q77 Q79 Q85 Q88 Q100 Q102 Q103 Q112 Q121 Q122
MAP(PRP)
MAP(COSINE)
Ranking of relevant document
using qPRP (using Quantum based
simiarity )81
Model Name Document Ranking Average Precision
PRP 11,16,5,20,4,1,3,19,6,13 0.66
qPRP(using cosine
similarity)
11,10,13,7,1,8,15,14,16 0.508
qPRP(using
quantum based
similarity)
11,15,19,3,1,2,18,5,12 0.68
Conclusion:-
 We have calculated the Mean Average Precision (MAP) for both the
models using set of queries.
 We obtained MAP
 The difference between them comes 0.049177 .
 Result obtained for qPRP is 14.1% more precise than that of PRP.
 The result that we have obtained is better in most of the cases but
for very few queries result of PRP is better than qPRP.
82
Model Name MAP
PRP 0.347060
qPRP(using cosine similarity) 0.396237
Future Work:-
 After observing the above result we deduce that qPRP can be used to
rank the Ad Hoc data set. Following direction can be undertaken to
get even better result:-
 Alternative document representation can be used. For
example:- We may divide subspaces on the basis of most
informative terms. Most informative terms can be deducted by
font, term appearing near to query term appearing in document.
 Different similarity measure can be used. For example, one
may use the similarity in paper.
83
Cont..
 By finding the similarity by capturing the meaning of
document. For capturing the meaning of document we may
HAL representation.
 Azzopardi, Leif, Probabilistic Hyperspace Analogue to Language
 One can also test the solution which we have proposed
under section “Implementing with whole formula”
84
Thank You..
85

More Related Content

Document ranking using qprp with concept of multi dimensional subspace

  • 1. PRESENTATION ON PROJECT TOPIC:- “DOCUMENT RANKING USING QPRP WITH CONCEPT OF MULTI-DIMENSIONAL SUBSPACE” 1 Presented By:- • Prakash Kumar Dubey (08) Guided By:- • Mr. Sourish Dhar (Dept. of IT) • Mr. Bhagaban Swain (Dept. of IT)
  • 2. Overview 2  Introduction.  Architecture of IR.  Classical models of IR.  Quantum probability.  Document ranking using qPRP.  Proposed solution.  Implementation and Data collection.  Conclusion.  Future work.
  • 3. Information Retrieval 3 • Information Retrieval (IR) is to search for relevant information in large collections of data. • Examples of IR Q- Give me articles about Laloo Prasad Yadav and the fodder scam. R- Evidence regarding Laloo Prasad Yadav's involvement in the fodder scam. - text retrieval. Q- What does a brain tumor look like on a CT-scan? R- A picture of a brain tumor - image retrieval. • Not to be confused with Data Retrieval.
  • 4. Main Components 4 There are five main components of the basic information retrieval system. i. Crawling. ii. Indexing. iii. User’s Query. iv. Ranking. v. Relevance Feedback.
  • 6. Cont….. 6 Crawling:- The system browses the document collection and fetches documents. i. Selection Policy. ii. Revisit Policy. iii. Politeness Policy. Indexing:- System builds an index of the documents. i. Tokenization. ii. Stop-word Eliminator. iii. Stemmer. iv. Inverted Index.
  • 7. Cont…. 7 Ranking:- When user gives a query the index is consulted to get most relevant document. Relevant documents are then ranked as per their importance. Relevance Feedback:- It is a classical way of refining search engine rankings. eg:- Matrix(maths or movie). Three Types of relevance feedback:- * Explicit. * Implicit. * Pseudo.
  • 8. Theoretical Models in IR 8  Theoretical models gives us different ways of solving IR related Problems.  IR model is defined as 4-tuple [D,Q, F,R(qi,dj)]. Here,  D- It represents the document collection.  Q- Query collection collected from the users.  F- Framework for modeling document representation, queries and their relationships.  R(qi,dj)- Ranking function which associates a score with the pair (qi,dj).
  • 9. Classical Models Of IR 9 The main three classical models of Information Retrieval are:-  Boolean Model  Vector Space Model  Probabilistic Model.
  • 10. Boolean Model 10  The model is based on the set theory and boolean algebra.  Each document is considered as a bag of index terms(words or phrases from the documents important to establish its meaning).  Query here is the expression using boolean algebra connectives like , , etc. And Or Not  Document retrieved should completely match the given query and it is not ordered.
  • 11. Boolean Query Example.. 11 Suppose we have 3 documents:- Doc1:- Cricket is the most popular game of India. Doc2:- Ricky Ponting is the most successful captain of cricket Australia. Doc3:- India is ranked 5th in the latest ICC test cricket ranking. If a user wants to know about Indian Cricket then a simple query is: India Cricket Australia. Inverted index is formed. India is present in document {1,3}, Cricket is present in document {1,2,3} and Australia is present in document {2}.So finally {1,3} is selected.
  • 12. Pros and Cons… 12 Advantages:- i. Simple, efficient and easy to implement. ii. Very precise in nature, user gets exact thing. Disadvantages:- i. Partial matches are not retrieved, which in many cases is not suitable. Retrieved documents are not ranked. ii. Given large set of documents, it retrieves either too many or very few documents. iii. Query does not captures synonymous terms. iv. Model does not use term weights.
  • 13. Vector Space Model 13  In this model the documents are represented as a vector of index terms. It has the ability to fetch partial matches.  Here we do not consider only the presence or absence of terms. So, in vector model the term weights are not binary.  Queries are also represented as vectors.  The similarity between the two vectors is actually calculated as the cosine similarity between them using which we find the relevance of the document. 𝒅𝒋 = 𝒘 𝟏𝒋, 𝒘 𝟐𝒋, … … , 𝒘 𝒕𝒋
  • 14. Some Important Terms 14  Modelling as a Clusturing Method.  Fixing the Term weights. i. Term Frequency(tf) 𝒕𝒇𝒊,𝒋 = 𝒇𝒓𝒆𝒒𝒊,𝒋 𝒎𝒂𝒙 𝒍(𝒇𝒓𝒆𝒒𝒍,𝒋) ii. Inverse Document Frequency(idf) 𝒊𝒅𝒇𝒊 = 𝒍𝒐𝒈 𝑵 𝒏 𝒊
  • 15. Similarity Measure Between Two Vectors  The most widely used method to measure the similarity between the two vectors is Cosine Similarity.  The Cosine Similarity of the two qi and dj is given by:- 𝒔𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚 𝒅𝒋, 𝒒 = 𝒄𝒐𝒔𝜽 = 𝒅𝒋 ∙ 𝒒 |𝒅𝒋| ∙ |𝒒| Here, Ɵ = Angle between two vectors. w(i,j)= Term weight of ith term of jth document. w(i,q)= Term weight assigned to ith term of the query.
  • 16. Cont.. 16  The retrieved set of documents dk are those for which similarity(di,qj) is greater than a threshold value.  The value of threshold can be brought down if for some query the highest similarity is on lower side hence allowing the partial matches to be retrieved. Value of cos Ɵ increases dj qj
  • 17. Pros and Cons… 17 Advantages:- i. Partial matching possible. ii. Ranking of retrieved results according to cosine similarity is possible. Disadvantages:- i. Index terms are considered to be mutually independent which does not allow it to capture semantic of query or document. ii. It cannot denote the “clear logic view” like Boolean model.
  • 18. Probabilistic Model 18  We try to capture the information retrieval process from a probabilistic framework.  Idea is to retrieve the documents according to the probability of the document being relevant.  Several version of Probabilistic model are available.  We will use version of Robertson-Spark-Jones.
  • 19. Probabilistic Model (Why) 19  Other model are empirical for most part  success measured by experimental results  few properties provable  Probabilistic Ranking Principle  provable “minimization of risk”  Information Retrieval deals with Uncertain Information  And it makes uncertain guess of whether a document satisfies the query.  Probability theory provides a principled foundation for such reasoning under uncertainty.  Vector space model: rank documents according to similarity to query.
  • 20. Probability Ranking Principle  Collection of Documents  User issues a query  A Set of documents needs to be returned  Question: In what order to present documents to user ? 20
  • 21. Probability Ranking Principle  Question: In what order to present documents to user ?  Intuitively, want the “best” document to be first, second best - second, etc…  Need a formal way to judge the “goodness” of documents w.r.t. queries.  Idea: Probability of relevance of the document w.r.t. query 21
  • 22. The Probabilistic Ranking Principle22 If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of that data. What is the probability of this document being relevant given this query?
  • 23. Probabilistic Ranking Principle  Definition  All index term weights are all binary i.e., wi,j  {0,1}  Let R be the set of documents known to be relevant to query q  Let be the set on non relevant document.  Let be the probability that the document dj is relevant to the query q  Let be the probability that the document dj is non relevant to query q R )|( jdRP )|( jdRP 23
  • 24. Cont… 24  Here we want to rank the documents (d w.r.t. query q) according to the probability of the document to be relevant.  Mathematically scoring function is given by:- P(R = 1| d,q)  R is indicator variable, it takes value 1 if it d(document) is relevant w.r.t. q, and 0 if d is non-relevant w.r.t. q(query).
  • 25. Probability Ranking Principle Let x be a document in the collection. Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance. )( )()|( )|( )( )()|( )|( xp NRpNRxp xNRp xp RpRxp xRp   p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x. Need to find p(R|x) - probability that a retrieved document x is relevant. p(R),p(NR) - prior probability of retrieving a (non) relevant document 25
  • 26. Probability Ranking Principle )( )()|( )|( )( )()|( )|( xp NRpNRxp xNRp xp RpRxp xRp   Ranking Principle (Bayes’ Decision Rule): If p(R|x) > p(NR|x) then x is relevant, otherwise x is not relevant  The similarity sim(dj,q) of the document dj to the query q is defined as the ratio.  Using Bayes’ rule, )|( )|( ),( j j j dRP dRP qdsim     26
  • 27. Binary Independence Model 27  Binary Independence Model for calculating the probability of relevance.  Name is binary because the documents and queries are represented as binary (Boolean) term incidence vectors. , iff term i is present in document x.  Independence means terms are independent of each other. ),,( 1 nxxx    1ix
  • 28. Cont… 28 3 Assumptions are made by Binary Independence Model (BIM) 1. The documents are independent of each other. 2. The terms in a document are independent of each other. 3. The terms not present in query are equally likely to occur in any document i.e. do not affect the retrieval process.
  • 29. Okapi BM25 Ranking Function 29  Probabilistic IR model is very generic in nature.  Many versions of probabilistic IR exist which are used practically.  Okapi-BM25 algorithm is based on the probabilistic IR.  Pays attention to the t.f. and document length.
  • 30. Disadvantages of PRP 30 PRP model does not hold when the assumptions fails. Calibration:- If the estimation of probability by the IR system does not matches the users assessment of relevance. Independent Relevance:- Relevance of documents are independent of each other. Certainty in Estimation:-Probability of relevance of a document is reported as scalar by IR system.
  • 31. Quantum Probability 31  Quantum probability theory naturally includes interference effects between events.  We assume that this interference shows the inter-dependency of relevance of the documents.  The outcome is a more sophisticated principle, the Quantum Probability Ranking Principle(qPRP).  To understand the difference between Kolmogorovian and Quantum probability theory on the basis of relevance of documents we will use Double Slit Experiment.
  • 32. Double Slit Experiment 32 Settings of Double Slit Experiment
  • 33. Cont.. 33 Distribution of pA and pB in the double slit experiment. Distribution of pk AB in the double slit experiment as estimated by Kolmogorovian probability. Distribution of pAB as measured in the double slit experiment. ^
  • 34. Cont… 34  Kolmogorovian Probability Theory:- pk AB = p(x|A) +p(x|B) = pA + pB  Quantum Probability Theory:- pQ AB = pA + pB + 2*√pA √pB * cos(ƟAB) = pA + pB +IAB Where, ƟAB = ƟA - ƟB In Reality:- pAB ≠ pA + pB ≠ pk AB Quantum Interference Term
  • 35. An Analogy with Document Ranking35  Here Particle corresponds to the user who is characterized by an information need.  Each Slit Corresponds to document. Ex- 2 Slit means 2 doc.  The event of a particle passing from left of the screen to the right is comparable with the user examining the set of doc.  p(x|A,B) is analogous to p(S|dA , dB) . S- an event to stop the search with user being satisfied.
  • 36. Cont… 36 Fig:- Analogy between Double Slit Experiment and Document Ranking Process in IR.
  • 37. Ranking Document Within Analogy.37 Fig- IR analogous of the previous figure
  • 38. Cont… 38  Kolmogorovian Probability:- pk AB = pA + pB Following equalities can be defined:- argmax(pAB) = argmax(pk AB) =argmax(pA + pB) =argmax(pB ) B ϵ Ɓ B ϵ Ɓ B ϵ Ɓ B ϵ Ɓ
  • 39. Cont… 39  Quantum Probability:- pQ AB = pA + pB + IAB Following equalities can be defined:- argmax(pAB) = argmax(pQ AB) =argmax(pA + pB + IAB ) =argmax(pB + IAB ) B ϵ Ɓ B ϵ Ɓ B ϵ Ɓ B ϵ Ɓ
  • 40. Ranking The First Document 40 Kolmogorovian and Quantum Probability Theory gives the same estimation i.e. pk AB = pQ AB = pA
  • 41. Ranking Subsequent Documents41 Slit A and B are kept fixed and the 3rd slit is varied among the slits of set Ƈ
  • 42. Cont… 42  Kolmogorovian Probability:- pk ABC = pA + pB +pC Following equalities can be defined as:- argmax(pABC) = argmax(pk ABC) =argmax(pA + pB +pC) =argmax(pC) C ϵ Ƈ C ϵ Ƈ C ϵ Ƈ C ϵ Ƈ
  • 43. Cont… 43  Quantum Probability:- pQ ABC = pA + pB +pC + 2*√pA √pB * cos(ƟA -ƟB) + 2*√pA √pC * cos(ƟA -ƟC)+ 2*√pB √pC * cos(ƟB -ƟC). pQ ABC = pA + pB + pC + IAB +IAC +IBC Following equalities can be defined:- argmax(pABC) = argmax(pQ ABC) =argmax(pA + pB + pC + IAB +IAC +IBC ) =argmax(pC + IAC + IBC ) C ϵ Ƈ C ϵ Ƈ C ϵ Ƈ C ϵ Ƈ
  • 44. Quantum Probability Ranking Principle(qPRP) 44 Assumptions:- I. Ranking is Performed Sequentially. II. Empirical data is best described using Quantum Probabilities. III. It is assumed that the documents that have been ranked before may influence further relevance assessments.
  • 45. Interpretation of Interference in qPRP 45  Quantum interference is central in the formalization of qPRP.  Once interference is expressed in terms of IR, these questions may arise:- 1. What does quantum interference mean in qPRP and in IR? 2. How does the quantum interference term influence document ranking?
  • 46. Estimating Interference in qPRP Information Retrieval 46 IdAdB =2.√P(R|q,ddA) √P(R|q,dB).cos ϴdAdB ≈ 2.√P(R|q,ddA) √P(R|q,dB).βfsim (dAdB) ϴ present in interference term is computed using a function fsim( dAdB). where, fsim * is a function used to compute the similarity between dA and dB. β is a real valued parameter. Note(*):- Different similarity function can be used viz Cosine Similarity, Jaccard Similarity etc.
  • 47. Constructing Document Representation47  We associate each document to a vector.  Vector is defined on the vector space made up by the terms present in the documents.  Each term in a collection is considered as a dimension of the vector space.  Different strategies can be employed to compute the components of the term-vector for a document. example:- Binary Schema, TF-IDF, BM25 etc.
  • 48. Proposed Solution 48  We do not find any major drawbacks in qPRP approach.  qPRP can be thought as new model for IR.  Existing qPRP approach considers term present in different section of document equally.  Our belief is that representing the document as multidimensional subspace will give better result.  We cannot give equal weight to the term present in title and body.
  • 49. Reason of considering Document as Multidimensional Objects49  Writers write the different part of document with different views.  Title:- Gives idea about the content of document in 3-7 words.  1st Paragraph or Abstract :- Is an overview of document of whole paragraph.  Body:- Content of Document  Conclusion:  Writers write the term present in document with different font and size. Ex: Keyterms->italics, etc.  Considering documents as multidimensional will allow building “truly” interactive IR system.
  • 50. Reason of considering Document as Multidimensional Objects50  Complex aspects of the retrieval process benefit from more sophisticated representation of doc. & queries.  It reduces the length of subspaces  Hence if words appears at any segment then it is more likely to satisfy user.
  • 51. How document is represented as multidimensional subspace? 51  In previous representation of document Title: School of Tech. Abstract or 1st paragraph of doc. Body :…………………………. School of …………….. Technology…………………… ……………… Conclusion …………………………………… …………………… Document 0 1 1 1 0 0 1 1 1 1 Doc 1 =
  • 52. Document Fragments 52 To represent document as multidimensional subspace, we need to divide document in different fragments.  Choice 1: Use single fragment the document itself  Choice 2: Use different section of document (i.e. title, abstract, etc) as fragments.  Choice 3: Use paragraphs as fragments as they seem to be an appropriate size to correspond Information Need(IN).  Choice 4: Use sentence as fragments.
  • 53. Fragments as Document Section53  Title: School of Tech. Abstract or 1st paragraph of doc. Body :…………………………. School of …………….. Technology…………………… ……………… Conclusion …………………………………… …………………………………… Doc 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 0 1 0 0 Title Abstract Body Conclusion Doc 1 =
  • 54. Fragment as Paragraph & Sentence54  Document can be represented as a set of information needs (IN), each being represented as a vector.  We can decompose paragraph or sentence into text excerpts that are associated with one or more INs.  In same way query can be broken to IN.
  • 55. Representation for each Segmentation55  Three weighting schemes are used:- 1. Term Frequency-Inverse Document Frequency (TF-IDF) 2. Term Frequency(TF) 3. Binary(Term presence/absence)  TF-IDF causes substantial overhead  We can use TF and binary.
  • 56. Implementing Multidimensional Subspace with qPRP 56  To decide the rank between two document, from qPRP we know that, pQ AB = pA + pB + 2*√pA √pB * cos(ƟAB )  Different parts of document has different weightage.  There are two approaches for implementing MD subspace with qPRP:- 1. Implementing with whole formula 2. Implementing only with similarity function
  • 57. Implementing with whole formula57  This formula is to be used for different section of document independently.  After calculating for different part and multiply with respective weightage we add with other fragment of document.  Same similarity function can be used.  Let suppose we give weightage and 2 document A and B  Title= 0.2 Abstract=0.3  Body=0.3 Conclusion=0.2 pQ AB = title* (pQ AB)title+abstract*(pQ AB) abstract +body*(pQ AB)body +conclusion*(pQ AB) conclusion
  • 58. Implementing only with Similarity Function 58  Only similarity function is implemented with document fragment rather than whole formula.  Calculate similarity function between respective fragments of documents and add all of them. ƟAB = title* (ƟAB)title+abstract*(ƟAB) abstract +body*(ƟAB)body +conclusion*(ƟAB) conclusion  Use different types of formula for calculating similarity between multidimensional subspaces.
  • 59. Metrics for measuring extent of Interference59  The subspace similarity sims(Sa, Sb) between the p dimensional sub- spaces Sa and the r dimensional subspace Sb is defined as:- sims(Sa,Sb) = 1- max 𝑝,𝑟 − 𝑖=1 𝑝 𝑗=1 𝑟 (𝑢 𝑖 𝑇 𝑣 𝑗)2 max(𝑝,𝑟)  This formula can also used to calculate similarity between two semantic spaces.
  • 60. Implementation and Results  For the implementation of the project and evaluation of results in an efficient way some of the pre-requisites we have used are:-  Software requirements:-  Windows 7  Microsoft Office 2010 (For project report)  JDK 1.6.0 (Compiler) or higher version  Notepad++ (with WebEdit)  Data Set requirement:-  Ad-Hoc standard Dataset 60
  • 61. Cont...  Hardware requirements:-  3 GB RAM.  5 GB Hard Disk Free Space.  Intel Core i5 Processor or higher version. Package requirements:-  Lucene 2.4.0  BM25 Implementation.  Apache Commons Math 2.2.0 61
  • 62. Data Collection  FIRE Ad-Hoc of the year 2010 has been used.  The queries has also been taken from the same.  The data set obtained contains around 1,30,000 documents that comprises of the collection of news from the leading newspaper “The Telegraph” for the period of 2004-07.  We have divided the documents into 3 fragment i.e. <title></title>, <fp></fp> and <sp></sp>. 62
  • 63. Why fragments??? The fragments are made so as to bring the concept of multi-subspace. In our case the number of sub-spaces is 3. The reason behind choosing these three fragments in this order are:-  Titles are most important part of any document.  Inverted pyramid is the model for newswriting. So the title is kept at the top and the main content of the document has been divided into two parts:  First paragraph.  Second paragraph. 63
  • 64. Implementing of proposed solution We have divided our implementation process into 3 modules:-  Indexing of the data set.  Searching the indexed document using cosine similarity.  Search using Quantum based similarity measure. For implementing the proposed solution we have chosen certain library. They are:-  DOM Parser (inbuilt in Java).  Apache Lucene 2.4.0.  Apache Commons Math Library 2.2.0.  BM25 Implemented Library. 64
  • 65. UML Diagram Class diagram used for indexing:- 65 Indexer -IndexWriter -Document +getIndexWriter(boolean) +closeIndexWriter() +indexDocument(TryDOM) +recursion(File) +rebuildIndexes(String) TryDOM -Document -NodeList +buildDocument(File) +String getName() +String getDocNum() +String getTitle() +String getFirstPara() +String getSecondPara() +String getWholeDocument() Main +public static void main(String[])
  • 66. Class diagram used for searching 66 SearchFrame +String +Jpanel +JTextField +Jbutton -actionPerformed(ActionEvent) Class Diagram for Searcher (Part 1) Main public static void main()
  • 67. Cont…. 67 DocVector +SparseRealVector +Map +DocVector(Map<Str,Int>terms) +setEntry(String term, int freq) +normalize() MySearcher #HashMap #ArrayList #IndexSearcher #Document #double tempScore #int tempDoc #int num #IndexReader MySearcher() ScoreDoc[] getProbableRelvDoc(String, String) HashMap sortQPRP(String, String) double getSimilarity(int,int) double testSimilarityUsingCosine(int,int,str) Class Diagram for Searcher (Part 2)
  • 68. Explanation(Indexer)  Indexing mainly starts from the Main class which takes as a input the ‘directory path’ where the documents to be indexed are kept.  Main class instantiates the Indexer class and call its method rebuildIndexes(String), and passes the given directory path to it which in turn calls recursive(File). All the files available in the directory will get indexed recursively by this function.  Each file is then parsed by TryDOM class and it is passed to indexDocument(TryDOM) to get indexed. 68
  • 69. Explanation(Searcher)  Now as the indexing is done, the next step is to search the indexed document for the given query which is done by using the MySearcher class.  Program will start from Main class instantiating the SearchFrame and one GUI is popped up.  GUI takes two input as query and file name(where result is stored).  Clicking the search button MySearcher class is initiated and the method sortQPRP() is called. It then calls getProbableRelvDoc() to get the top k result using BM25 model, now sortQPRP() rearranges the result according to qPRP model. 69
  • 71. Collaboration diagram for searcher.71 Main SearchFrame MySearcher DocVector sortQPRP() result() setEntry() normalize()
  • 72. Lucene Index Structure Documents in Lucene are stored as an object in Index. We need to convert the data into document object and store them into index. We break the data into different part and store them in Document object as Field object. 72 Doc ID Title First Paragraph Second Paragraph DocNum Document Name
  • 73. Evaluation Measures In order to evaluate the result we need to consider two dimensions. Recall:- Measure of ability of system to present all relevant documents. Mathematically, recall= 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑖𝑛 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 Precision:- Measure of ability of system to present only relevant documents. 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑒𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑟𝑒𝑡𝑟𝑖𝑣𝑒𝑑 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑡𝑒𝑚𝑠 𝑟𝑒𝑡𝑟𝑖𝑣𝑒𝑑  Recall and Precision are set based Measures. 73
  • 74. Cont.. 74  To measure ranked list precision is plotted against recall.  Whenever new nonrelevant document is retrieved, recall value is same but precision decreases.
  • 75. Mean Average Precision(MAP):-  In recent years TREC community using MAP.  It provides single figures across recall levels.  To calculate Mean Average Precision the following formula is used. 𝑀𝐴𝑃 𝑄 = 1 |𝑄| 𝑗=1 |𝑄| 1 𝑚𝑗 𝑘=1 𝑚 𝑗 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝑅𝑗𝑘)  Rjk is the set of ranked retrieval results from the top result until you get to document dk  qj ∈ Q is {d1, . . . dmj} 75
  • 76. 76  First we retrieved top 150 result using BM25 model.  Then we sort the result according to qPRP using cosine similarity.  We noted down the ranked list given by both the model.  We calculated the recall and precision of both list, whenever new relevant document is retrieved in the list.  We plotted the histogram using recall-precision.
  • 77. Ranking of relevant document 77 Query No. Relevant Document Ranking(PRP) Relevant Document Ranking(qPRP) 77 78, 98, 41, 69, 16, 132,47,134,135 60, 48, 47, 46, 42, 52,44,49,54 79 18 20 85 26,38,27,1,44,6,42,2,22,104 52,30,25,1,45,26,87,36,42,48 88 27,6,49,44,7,59,13,20,12,23,21,43,56 35,11,82,54,2,21,29,3,39,24,26,18,28 100 5,19,13,12,14,33,06,03,04,29,1,22,9,1 8,15,23 4,21,12,11,31,33,10,03,06,20,1,27,14,15,7,8 102 2,37,129,84,62,147 2,13,18,16,17,20 103 1,17,30,12,130 1,6,7,5,143 112 18,3,37,14,6,8,1,73,24 7,5,16,3,6,2,1,13,9 121 11,16,5,20,4,1,3,19,6,13 11,10,13,7,1,8,15,14,16 122 6,1,4,10,23,9 3,1,16,7,21,14
  • 78. Comparison of Precision for PRP and qPRP(cosine) on same recall value(Query:100) 78 0 0.2 0.4 0.6 0.8 1 1.2 0.062r 0.125r 0.187r 0.25r 0.312r 0.375r 0.437r 0.5r 0.562r 0.625r 0.687r 0.75r 0.812r 0.875r 0.937r 1r Precision(PRP) Precision(Cosine)
  • 79. Comparison of Precision for PRP and qPRP(cosine) on same recall value(Query:112) 79 0 0.2 0.4 0.6 0.8 1 1.2 0.111r 0.222r 0.333r 0.444r 0.555r 0.666r 0.777r 0.888r 1r Precision(PRP) Precision(Cosine)
  • 80. MAP comparison with respect to the queries 80 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Q77 Q79 Q85 Q88 Q100 Q102 Q103 Q112 Q121 Q122 MAP(PRP) MAP(COSINE)
  • 81. Ranking of relevant document using qPRP (using Quantum based simiarity )81 Model Name Document Ranking Average Precision PRP 11,16,5,20,4,1,3,19,6,13 0.66 qPRP(using cosine similarity) 11,10,13,7,1,8,15,14,16 0.508 qPRP(using quantum based similarity) 11,15,19,3,1,2,18,5,12 0.68
  • 82. Conclusion:-  We have calculated the Mean Average Precision (MAP) for both the models using set of queries.  We obtained MAP  The difference between them comes 0.049177 .  Result obtained for qPRP is 14.1% more precise than that of PRP.  The result that we have obtained is better in most of the cases but for very few queries result of PRP is better than qPRP. 82 Model Name MAP PRP 0.347060 qPRP(using cosine similarity) 0.396237
  • 83. Future Work:-  After observing the above result we deduce that qPRP can be used to rank the Ad Hoc data set. Following direction can be undertaken to get even better result:-  Alternative document representation can be used. For example:- We may divide subspaces on the basis of most informative terms. Most informative terms can be deducted by font, term appearing near to query term appearing in document.  Different similarity measure can be used. For example, one may use the similarity in paper. 83
  • 84. Cont..  By finding the similarity by capturing the meaning of document. For capturing the meaning of document we may HAL representation.  Azzopardi, Leif, Probabilistic Hyperspace Analogue to Language  One can also test the solution which we have proposed under section “Implementing with whole formula” 84