Lucene revolutionmontysolr 2011_presentation
- 2. Why should I care?
- Our challenge is to connect Python and Java
- Without compromises
- We created MontySolr extension
- Robust, tested (will be used by our system)
- But works for any Python application (eg. Django)
- And for any C/C++ app that Python understands!
- Open source (GPL v2)
- Try it out!
- https://github.com/romanchyla/montysolr
2
- 3. Outline
‣ Context
- The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
- Evaluation
- Wrap-up
3
- 4. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
- 5. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
- 6. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
- 7. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
- 8. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
- 9. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
- 10. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
- 11. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
- 12. SPIRES
- Stanford Linear Accelerator Center - SLAC
- High-Energy Physics Literature Database
- Started December 1991
- The first web outside Europe/CERN
- The first database on web
5
- 13. SPIRES
- Stanford Linear Accelerator Center - SLAC
- High-Energy Physics Literature Database
- Started December 1991
- The first web outside Europe/CERN
- The first database on web
5
- 16. Invenio
- Integrated digital library software behind INSPIRE
- Used by very large institutional repositories
- http://repositories.webometrics.info/toprep_inst.asp
- Customizable virtual collections
- Flexible management of metadata
- 3 000 authors per article
- Powerful search engine
- Incl. citation map analysis
- Written in Python (since 2001)
- 290 000 lines of code
8
- 17. Outline
- Context
‣ The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
- Evaluation
- Wrap-up
9
- 18. The Challenge
- HEP scientific community
- Searches metadata oriented
- However fulltexts are changing the situation
- And we want to provide even better service
- Bigger volumes of data
- NLP processing
- Semantic search
10
- 20. The Challenge
Query: supersymmetry AND author:ellis
Invenio
11
- 21. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
11
- 22. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
- 23. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
- 24. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
- 25. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
- 26. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
11
- 27. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
1. only IDs,
no score
= no ranking
11
- 28. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
2. score merging 1. only IDs,
difficult (if no score
available) = no ranking
11
- 29. The Challenge
3. push IDs ?
(eg._faceting)
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
2. score merging 1. only IDs,
difficult (if no score
available) = no ranking
11
- 30. What is the “best” solution?
- We love Python...
- ...and our applications are written in Python...
- But what if Solr is the master search engine?
- Merge results inside Solr?
- Typical size: 1-10 mil. IDs
- Expected latency: 1-2 s.
- What we want to achieve:
- Fast transfer of hits from Invenio to Solr
- Leverage the power of both (no compromises)
- Developer-friendly integration, simplicity
- Additional concerns: 12
- 31. Outline
- Context
- The Challenge
‣ Key components
- Available technologies
- Our approach
- Evaluation
- Demonstration
- Wrap-up
13
- 32. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
- 33. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
- 34. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
- 35. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
- 36. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
- 37. To use Solr in non-Java app
- Solr is already usable via HTTP requests, but we
need something else here...
- Remote objects/calls?
- Pyro, execnet, CORBA, SOAP...
- or simply pipes?
- Access Python from Java?
- Jython
- JEPP
- Access Java from Python?
- JPype
- JCC
15
- 38. Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded
- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython
16
- 39. Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded
- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython
17
- 40. Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded
- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython
17
- 41. JEPP - Java Embedded Python
- Python code runs inside
Python interpreter
- Embeds CPython interpreter
via Java Native Interface
(JNI) in Java
- http://jepp.sourceforge.net/
- recently updated (27-Jan)
- but JCC is more active
18
- 43. JCC
- Embeds JVM in Python
- C++ code generator
- C++ object interface
wraps a Java library
- C++ wrappers conform
to Python's C type
system
- result: complete Python
extension module
20
- 47. To use Solr in non-Java app
Jython JCC JEPP
Python ✓ ✓
CModules
Speed ✓ ?
No code ✓ ✓
changes
Access from ✓ ✓
Python
Access from ✓ ... ✓
Java
22
- 50. GIL - Global Interpreter Lock
Unfortunately Python webapp is not like Java...
25
- 51. GIL - Global Interpreter Lock
We can have 200 threads, but only 4 will run at time...
26
- 53. Fortunately solution exists
- JCC can embed Python inside Java
- Special thanks to Andi Vajda! (JCC creator)
- We write ‘empty’ classes in Java ...
- ... and implement them in Python
Python /w Java inside Java /w Python inside 28
- 54. The second try
Solr /w Invenio
Invenio (backend)
frontend
XML
JCC
29
- 55. Implementing the bridge
- Special Java class
- With method pythonExtension()
- Native method pythonDecRef()
- JCC provides its implementation
- And number of other native methods
- These will be implemented using Python
- Like writing JNI Java/C code but without
compilation...
30
- 56. MontySolr extension
- JCC has great potential, but also added
complexity...
- So the MontySolr project was born
- Modules must be built in shared mode
- JCC dynamic library loaded and started from the main
thread
- Simple mechanism of the Python bridge and message
- Configurable handlers on the Python side
- Secured dereferencing of the native objects
- Threading on the Java side
- Multiprocessing on the Python side
- Easy ant targets (compilation) ...
31
- 57. Hello World - Java part
public class MontySolrBridge extends BasicBridge implements
PythonBridge {
private long pythonObject;
public void pythonExtension(long pythonObject) {
this.pythonObject = pythonObject;
}
public long pythonExtension() {
return this.pythonObject;
}
public void finalize() throws Throwable {
pythonDecRef();
}
public native void pythonDecRef();
public void sendMessage(PythonMessage message) {
PythonVM vm = PythonVM.get();
vm.acquireThreadState();
receive_message(message);
vm.releaseThreadState();
}
public native void receive_message(PythonMessage message);
} 32
- 58. Hello World - Python part
from montysolr import MontySolrBridge
class SimpleBridge(MontySolrBridge):
def __init__(self):
super(SimpleBridge, self).__init__()
def receive_message(self, message):
query = message.getParam(‘query’)
message.setResults(‘Hello world!’)
print ‘Python received from Java:’, query
33
- 59. Example - running MontySolr
- Java side
- JRE (32/64 bit)
- Standard Solr/Lucene jars
- JCC dynamic library
- Python side
- Python interpreter (32/64 bit)
- 4 Python modules (jcc, solr, lucene, montysolr)
- In the main thread
- First we load JCC
- Then start Python interpreter ...
- ... load Python handlers
34
- 60. Solr as search service
Solr /w Invenio
Invenio (backend)
frontend
XML
JCC
35
- 61. Example
Solr
MyCustom
Handler
36
- 63. Example - Solr custom handler
MontySolrVM.INSTANCE.sendMessage(message);
PythonMessage msg = MontySolrVM.INSTANCE
.createMessage("perform_search")
.setSender("Invenio")
.setParam("query","refersto:author:ellis");
MontySolrVM.INSTANCE.sendMessage(msg);
Object result = msg.getResults();
if (result != null) {
int[] hits = (int[]) message.getResults();
}
38
- 64. Example - JNI connection
refersto:author:ellis
Solr
MyCustom Python
Handler Bridge
39
- 65. Example - JNI connection
refersto:author:ellis
Solr
MyCustom Python Invenio
Handler Bridge wrappers
40
- 66. Example - Python side
# handler is made ‘visible’ at startup
SolrpieTarget('Invenio:perform_search',
perform_search)
# search time - called from Java
def perform_search(message):
query = message.getParam(“query”)
hits = call_real_search(query)
# cast Python list into Java array
message.setResults(JArray_ints(hits))
41
- 68. Example - Java side again
MontySolrVM.INSTANCE.sendMessage(message);
PythonMessage msg = MontySolrVM.INSTANCE
.createMessage("perform_search")
.setSender("Invenio")
.setParam("query","refersto:author:ellis");
MontySolrVM.INSTANCE.sendMessage(msg);
Object result = msg.getResults();
if (result != null) {
int[] hits = (int[]) message.getResults();
}
43
- 69. Solr as search service
Solr /w Invenio
Apache (backend)
webserver
XML
Invenio
Invenio
JCC
44
- 70. Outline
- Context
- The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
‣ Evaluation
- Wrap-up
45
- 74. Robust?
- Extensive siege tests show very good
performance and stability under high load
- 100-200 users, complex searches
- 50 concurrent users, citation analysis
- JCC incurs small overhead
- We detected no memory leaks
- The same as dbpedia.org
- But watch out for errors in C
- An error in C module brings down the whole JVM
- (errors in pure Python module can be handled)
49
- 75. Easy to develop/maintain?
- Added complexity
- Java in the toolbox
- Need to compile C++ extensions
- Python/OS version dependencies
- For this we get
- Easy integration with Invenio
- The best of two applications
- A lot of features for free
- And we can control Solr from Python!
50
- 76. Outline
- Context
- The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
- Evaluation
‣ Wrap-up
51
- 77. Wrap-up
- Our challenge was to connect two different
languages/systems
- And we wanted to get the best of the two...
- So we had to plug Python into Solr
- And now our Solr knows citation analysis!
- We created MontySolr extension
- Robust, tested (will be used by INSPIRE)
- Works for any Python application (eg. Django)
- And for any C/C++ app that Python understands!
- Free software license
- Try it out! Help us make it better!
- https://github.com/romanchyla/montysolr
52
- 78. Questions?
- MontySolr
- https://github.com/romanchyla/montysolr
- Roman Chyla
- Fellow, CERN Scientific Information Service
- roman.chyla@cern.ch
- @rchyla
- https://svnweb.cern.ch/trac/rcarepo
- 80. Links
- Invenio platform
- http://invenio-software.org/
- INSPIRE Digital library
- http://inspirebeta.net/
- Diagrams of JCC and JEPP
- Andreas Schreiber : Mixing Java and Python
- http://www.slideshare.net/onyame/mixing-python-and-
java
- On Jython C Extension API
- http://stackoverflow.com/questions/3097466/using-
numpy-and-cpython-with-jython
- Demo of a running service:
- http://insdev01.cern.ch 55
- 81. #1 - How to embed Solr (standard)
- solr.client.solrj.embedded.EmbeddedSolrServer
56
- 82. #2 - How to embed Solr (simplified)
- solr.servlet.DirectSolrConnection
- like previous, but simpler
- all the queries are sent as strings, everything is
just a string
- very flexible and probably suitable for quick
integration
57
- 83. #2 - How to embed Solr (simplified)
- solr.servlet.DirectSolrConnection
- like previous, but simpler
- all the queries are sent as strings, everything is
just a string
- very flexible and probably suitable for quick
integration
57
Editor's Notes
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- mention the transition/collaboration: cern-desy-fermilab-slac\n
- paradigm of a full result set\n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- Python: fast-prototyping, easy for students (who write a lot of the code)\n
- \n
- X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
- X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
- X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
- X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
- I don’t mention some options like writing JNI ourselves or using intermediaries other than remote objects (eg. shared memory, if that would be possible)\n
- everybody thinks Jython, right? No!\n
- \n
- \n
- \n
- \n
- \n
- \n
- These are only some important features, omitted is simplicity and beauty (JEPP eval is just ugly way of doing things), documentation, community, support etc.\n
- \n
- \n
- Make sure that it is clear that processes can have threads - here it is not clear what is process and what is thread (it is not visible)\n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- Truly bi-directional\n We can call Python functions and pass Java objects\n From inside Python we can call Java object/methods\n
- \n
- the real-code example is in appendix #3\n
- \n
- \n
- the real code is in appendix #4\n
- note: don’t forget to mention how the multiprocessing is saving memory on the linux systems (due to the read-write and forking). This is effectively an alternative to Python WSGI that cannot run multiprocessing. We show that it is possible to use multiprocessing effectively.\n
- the real code is in appendix #3\n\n
- \n
- more precise - montysolr intro (include)\n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- TODO:\nInvnenio is the same as Django\nToday, Solr can now do 2nd order operations\n
- \n
- \n
- \n
- \n
- \n
- \n