EVOLVE'13 | Enhance | External Search | Matthias Wermund

1
ADOBE EXPERIENCE MANAGER
& EXTERNAL SEARCH PLATFORMS
Matthias Wermund, Senior Application Architect

2
WHY EXTERNAL SEARCH
Search is part of most implementation projects
• Most of today’s web sites offer any type of search feature
• Search exists in various flavors and mixtures
• Site search
• Typed search
• Search as navigation
• Relevance search
• Location based search
• AEM comes with its own search implementation
• “External search” in context of AEM means leveraging another platform,
hosted outside of the AEM Author/Publish environments

3
WHY EXTERNAL SEARCH
Publish search vs. Author search
• Publish search
• End user accessible
• Indexed content is in published state
• High frequency access
• Author search
• Internal AEM author search
• Index must include unpublished content
• Criteria can include additional content metadata
• Both are fundamentally different use cases with different index lifecycle

4
WHY EXTERNAL SEARCH
AEM standard search
• Part of AEM Java Content Repository (JCR) implementation
• AEM adds Predicate API layer
• Features
• Automatic index generation in all environments
• Full-text, facetted search
• Access restrictions based on repository access control lists (ACL)
• Used behind the scenes for many AEM features
So, why not always use JCR search? Here are some reasons.

5
WHY EXTERNAL SEARCH
Performance issues of standard search with growing result size
0
2
4
6
8
10
12
14
1 4 7 10 13 16 19 22 25 28
Time(s)
Results (x1000)
AEM query performance*
JCR Query Facet Generation
• Facet generation time
increases linear with
growing result size
• JCR query time not
impacted the same,
but increase is
noticeable
• Search results are
often impossible to
cache due to high
number of variations
* Synthetic content, full-text search, one facet, single requests

6
WHY EXTERNAL SEARCH
Scale independently of AEM
• External search platforms
decouple the search
infrastructure from AEM
• Search platform can scale
independently from AEM, both
horizontally and vertically
• Some platforms support cloud
deployments, e.g. ZooKeeper
for Apache Solr
• Client-side integration can fully
eliminate query impact on AEM

7
WHY EXTERNAL SEARCH
Extended feature offering
• External platforms provide functionality which AEM currently doesn’t bring
out-of-box
• A few examples:
• Geospatial search
• Dynamic relevance control
• Index-based type-ahead
• Index maintenance UI
• More about various possible uses of external index data later.

8
WHY EXTERNAL SEARCH
Search multiple data sources at once
• Search can span multiple
data sources besides AEM
• External platforms can join
data from any number and
type of source systems
• Users can query all data at
once, with combined
pagination, filters and
relevance calculation

9
• Created initially by CNET, since 2006 open source
• Incorporates and extends Apache Lucene
• Supports distributed indexing and searching
• Rich search capabilities
• HTTP interface, JSON/XML/BIN formats
• Integration clients
• Standalone Java web application
• Administration UI
• Widely used on prominent sites
wiki.apache.org/solr/PublicServers
APACHE SOLR
Lucene-powered Open Source Search Platform

10
• Schema fields
• Dynamic fields
• Custom field types
APACHE SOLR
Index schema configuration

11
• The main challenge is data extraction
• When should the extraction process get initiated?
• How to convert the AEM content tree structure into the index format?
• And how to transfer the converted data to the external platform?
• Once the external index is generated, data querying is a relatively easy step
• Query generation is highly specific to the use case
• Most search platforms offer standard interfaces or Java libraries to integrate
INDEXING AEM CONTENT
Steps to create an external index

12
• Pull
• Content downloaded by external search platform
• Platform needs trigger, e.g. scheduler
• Data generation can use same rendering as for user requests
• Push
• Data uploaded from AEM to external platform
• Can happened immediately on modification
• Requires to generate data standalone
• Combination is possible – Example:
• On modification, AEM notifies search platform (Push)
• Platform loads the modified content from AEM (Pull)
Integration patterns : Pull vs. Push

13
• Unstructured
• No index-specific format
• Metadata is extracted after loading
• Least effort, end user rendering can be used
• Structured
• Source data formatted to match search index structure
• Leaner
• Can carry different data than end user view
• Requires structure generation in AEM
• The typical unstructured pull data extraction is crawling
Integration patterns (II) : Unstructured vs. Structured

14
• EASE = External AEM Search Extension
• Primary goal of the framework is to reduce the complexity of integrating search
platforms with AEM
• The indexing approach is structured push triggered by content replication
• Open source, available starting today
• For documentation, API, Maven dependencies & more
see github.com/mwmd/ease
Introducing the EASE framework

15
• Supports generation of structured index data
• Binary asset indexing
• Integrates in AEM Author environment
• Incremental index updates triggered by AEM replication (push)
• Indexing of versioned content for scheduled replication
• Full index generation
• Generic integration with search platforms
• Apache Solr integration
(Current) EASE framework features

16
EASE Maven modules

17
EASE index generation approach

18
Basic sample project: ease/example
• Prerequisites
• AEM 5.6 Author
• Apache Solr 4.4
• Demonstrates use of
EASE framework
• No configuration
needed
• Uses Facets, full text
search, relevance
• Available on GitHub

19
Steps to integrate EASE and Solr into project
1. Include ease-core and ease-scr as Maven dependencies
2. Implement indexers matching your content
3. Create OSGi configurations:
• IndexService
• SolrIndexServer
4. Deploy to AEM Author:
• ease-core
• ease-solr & dependencies
When this is done, activated content will get automatically indexed.

20
Generation of index data with EASE

21
Resolving content structure with EASE

22
Encapsulate handling of proprietary requests
• IndexServer handles all communication with
search platform
• ease-core bundle doesn’t provide platform
specific implementation
• Implementation of IndexServer for Apache Solr in
ease-solr bundle
• New connectors to additional platforms are only
required to implement this interface

23
• When the data is indexed, it can get queried from custom components
• Leveraging platform specific features with proprietary clients
• While EASE currently focuses on simplifying the indexing, it helps with queries too
• EASE connector bundle per external search platform
• Proprietary clients are provided by the bundle (SolrJ for Apache Solr)
• In the following, an example implementation will walk through some use cases
USING THE EXTERNAL INDEX

24
Example implementation: AEM Know-How Database
• Central search for AEM
related information
• Uses EASE framework
• Server- and client-side
queries
• 50,000 pages in AEM
• stackoverflow
• Adobe offices
• 3,000 external pages
• Adobe AEM doc
• Adobe CRX doc
• Marketing Cloud doc

25
Example implementation: AEM Know-How Database

26
• User-input text query
• Query-based ranking
• Generation of extracts
and term highlighting
• Sorting on different
fields
Full-text search
Sorting
options
Pagination
Extract &
Highlighting

27
• Manipulation of
relevance calculation
• Boosting possible on
• Terms
• Fields
• Implementation can
leverage user data to
generate personalized
result (Client Context)
Boost manipulation / Personalization
Personalization
Same result
count…
…but different
ranking

28
Faceted search
• Navigation via facets
• AND combination of
multiple facet values
• Facet hit counts
calculated based on
current search resultFacet hits

29
• Offer search term
suggestions based on
user input
• Highly configurable
• Index data
• Dictionary
• Query parsing
• Client-side call to
Apache Solr
• Can use a standard
query or dedicated
feature
Type-ahead / Auto-complete / Auto-suggest

30
• Proximity search
• Distance calculation
• Sorting by distance
• Center of search is
dynamic, can be
based on user’s
location
(Client Context)
Geospatial search
Distance
calculation
Distance
sorting
Location-
based query

31
• Component rendering can include content maintained on other pages
• Aggregation logic could be easily mirrored in ResourceIndexer
• But: If the page-external content is modified, its activation won’t trigger re-
indexing of the aggregating page
• Use case:
• Inherited paragraph system
• Reference component
• Mitigation options:
• Content strategy: Index only standalone, unique page content
• Use WCM ReferenceSearch to find and re-index references
• Dangerous: reference loops, cascading re-indexing
REAL WORLD SEARCH IMPLEMENTATIONS
Handling aggregated page content

32
• External platforms have no information about AEM roles and permissions
• All index items are visible to everyone by default
• In some use cases, access to parts of the index must be restricted
• Use case:
• Closed User Groups
• Mitigation options:
• Check access rights for all items on the current result page at runtime
• Will break pagination information, type-ahead suggestions
• Performance hit
• Export effective role permissions as part of index item metadata
• Add filter for current user’s role to all queries or into search platform
• Requires re-indexing of content on ACL changes
• Only practical with a limited number of roles
Permission Sensitive Search

33
Index tuning
• Interpretation of raw index data dependent on
search technology and configuration
• Powerful platforms offer deep level of index
configuration
• Tuning of search behavior means significant effort
• Use case:
• Full text query and content parsing
• Type-ahead suggestions
• Result relevance calculation
List of available text
processors for Solr

34
Questions?
matthias.wermund@acquitygroup.com
Thank you!
ADOBE EXPERIENCE MANAGER
& EXTERNAL SEARCH PLATFORMS

EVOLVE'13 | Enhance | External Search | Matthias Wermund

More Related Content

EVOLVE'13 | Enhance | External Search | Matthias Wermund