SlideShare a Scribd company logo
Solving Data Discovery in the Enterprise:
Building an Enterprise Data Catalog
Contents
Overview.......................................................................................................................................................3
The Business Challenges of Data Discovery in the Enterprise......................................................................3
Why MDM is not the Answer........................................................................................................................4
Introducing the Enterprise Data Catalog ......................................................................................................4
Getting Technical: Building an Enterprise Data Catalog...............................................................................6
Catalog Portal...........................................................................................................................................7
Catalog Mobile .........................................................................................................................................8
Catalog Store ............................................................................................................................................8
Data Source Publishing API......................................................................................................................8
Data Source Discovery API.......................................................................................................................8
Data Source Notifications API..................................................................................................................8
Data Source Search API............................................................................................................................8
Data Governance API ...............................................................................................................................9
Metadata Connectors...............................................................................................................................9
Data Collaboration System and APIs.......................................................................................................9
Putting all Together ......................................................................................................................................9
Summary.......................................................................................................................................................9
Overview
Data discovery, understanding and governance is becoming one of the key
elements of data architectures in the enterprise. The explosion in the volumes of
data produced and consumed by organizations have exponentially increased the
complexities related to discovering and understanding data in an efficient manner.
Despite its relevance, data discovery and governance often tends to be an
overlooked aspect of enterprise big data solutions more focused in sexy areas such
as analytics, machine learning etc. However, more and more organizations are
realizing that data discovery is an essential component to effectively enable
analytics, visualizations and general data consumption capabilities in the enterprise.
However, the road of enabling data discovery in the enterprise is plagued with
challenges as we will explore in the next section.
The Business Challenges of Data Discovery in the Enterprise
As data grows in the enterprise so are the initiatives to gather intelligence about
that data. In that sense, the efforts around big data, analytics, visualizations, etc.
have increased exponentially during the last few years. In that sense, data
discovery has become a foundation block to any enterprise data initiative. However,
in order to enable efficient data discovery models, enterprises need to address
some of the following challenges:
 Increasing Data Volume: The increasing volume of data produced in the
enterprise has drastically degraded the ability of information workers for
quickly finding and consuming different data sources from enterprise
applications.
 Lack of Metadata Management: Even when data can be found,
information workers struggle to understand the specific semantics of
enterprise data sources. This is due to the lack of metadata management
solutions implemented in enterprise environments.
 Different Data Access Interfaces: One of the biggest challenges for
accessing data in the enterprise is the proliferation of heterogeneous data
access protocols and APIs introduced by new line of business solutions. In
that sense, organizations struggle with the lack of consistent protocols and
models to access data from different business applications.
 Lack of Established Data Stewardship: Complementing the previous
point, the lack of mainstream data stewardship models make it challenging
for applications trying to access enterprise data sources.
 Limited Collaboration Interfaces: Top-down data stewardship is just a
mechanism for establishing contextual information about enterprise data
sources. A lot of the knowledge about business data lives with business users
who actively interact with it. However, enterprises rarely implement the
collaboration interfaces that capture the knowledge of those domain experts
in order to add contextual information to corporate data sources.
Why MDM is not the Answer
Master data management (MDM) platforms has been traditionally seen as a
mechanism to keep a record of data sources in an enterprise environment.
However, over the years MDM solutions have become extremely heavy, complicated
and very limited to address some of the mainstream scenarios of data discovery in
the enterprise. Additionally, MDM solutions struggle to quickly integrate with
modern SaaS, cloud and mobile platforms which are becoming a significant source
of data in the enterprise.
As a result of the limitations of MDM platforms, organizations have started to adopt
lighter, simpler and more modern data discovery models that are optimized for the
modern technology ecosystem. From the different models used to enabled data
discovery in the enterprise there is one we’ve seen been incredibly successful in
organizations of all sizes: the enterprise data catalog.
Introducing the Enterprise Data Catalog
A data catalog is a simple but incredibly effective and robust model to enable data
discovery in the enterprise. From a functional standpoint, an enterprise data catalog
should provide a global repository that registers data sources from different line of
business systems as well as the corresponding metadata and contextual
information associated with it.
Conceptually, a data catalog borrows elements from popular repositories such as
mobile app stores or ecommerce marketplaces. In that sense, an enterprise data
catalog goes beyond the classification and organization of enterprise data sources
and enables capabilities such as search, collaboration, alerting and other features
that can be combine to provide a fresh, modern experience to discover data sources
in an enterprise environment.
From the functional standpoint, an enterprise data catalog should enable some of
the following capabilities:
 Data Source Discovery: An enterprise data catalog should allow
information workers to browse, and discover different data sources business
data sources linked to line of business systems. Additionally, the catalog
should allow a simple registration for new data sources.
 Data Source Publishing: Complementing the previous point, an enterprise
data catalog should allow information workers to register new data sources
using simple interfaces both visually and programmatically.
 Metadata Management: Enterprise data catalog solutions should allow data
stewards to provide adequate metadata related to business data sources.
Simple metadata such as field descriptions or other contextual information
can be incredibly relevant to correctly understand business data sources.
 Tagging and Classification: An enterprise data catalog should allow users
to classify the different data sources using tags or simple hierarchical
categories.
 Search: Finding data using simple keyword and facet search should be one
of the key capabilities of an enterprise data catalog solution.
 Testability: An enterprise data catalog should allow users to test and
validate the different data sources exposed in the catalog.
 Collaboration: An enterprise data catalog should facilitate the collaboration
between information workers working on specific data sources.
 Governance: Access control, SLAs, exception management are just some of
the key governance and data stewardship capabilities that should be enabled
by enterprise data catalogs.
 Alerts: Throughout the lifetime of a data source, information workers might
want to receive alerts about relevant events such as schema data changes of
performance degradations. An enterprise data catalog should provide a
simple interface for power users to configure alert conditions on specific data
sources.
Getting Technical: Building an Enterprise Data Catalog
As explained in the previous sections, enterprise data catalogs have become one of
the most popular solutions to enable data discovery in the enterprise. In the last
couple of years, we have implemented several enterprise data catalogs for dozens
of organizations. As a result, there are a few reference architectures that you can
implement with today’s technology. The following diagram illustrates a reference
architecture model for an enterprise data catalog solution.
The previous diagram includes highlights some of the following functional
components:
Catalog Portal
The catalog portal is the main user interface to register, browse and discover data
sources in an enterprise environment. From the architecture standpoint, the catalog
portal will interact with the different APIs of the solution to perform operations on
data sources. The catalog could be implemented using any web development
platform such as NodeJS express, ASP.NET or Python Django.
Catalog Mobile
Similar to the portal interface, users will be able to interact with data sources from
smartphones or tablets using the catalog mobile interface. This component of the
platform provides a mobile-first, simple functionality to enable data discovery from
mobile devices.
Catalog Store
The catalog store is the main data repository for maintaining the metadata
associated with different data sources. Considering the arbitrarily nature of
information related to business data sources we have typically preferred to leverage
NOSQL databases such as MongoDB or Couchbase when implementing this type of
solution.
Data Source Publishing API
The data source publishing API provides the interfaces for publishing and managing
business data sources from different applications including the catalog portal. This
API should handle all aspects related to data source management such as
categorization, tagging, metadata management etc.
Data Source Discovery API
The data source discovery API provides the interfaces required to dynamically query
and discover data sources registered on the platform. Typically, we have leveraged
industry standards such as OData or GraphQL as the main protocol for these
interfaces.
Data Source Notifications API
The data source notifications API provides the mechanisms for third party
applications to dynamically subscribe to changes on specific data sources. The API
should be able to deliver notifications via traditional channels such as email, SMS or
push notifications as well as via programmatic interfaces.
Data Source Search API
The data source search API is responsible for providing traditional search
capabilities to enterprise data sources registered in the catalog. The search
capabilities should focus on the data source metadata and not on the data itself.
Search techniques like facet searching and proximity algorithms are very relevant
for this API. Typically we rely on search platforms like Elastic to implement this
capability.
Data Governance API
The data governance API is responsible for enabling data governance and
stewardship capabilities such as access control, data privacy, data ownership, SLA
monitoring etc. These APIs can be integrated with existing security and access
control platforms in the enterprise.
Metadata Connectors
The connectors are responsible for abstracting the integration with the different line
of business systems hosting the data sources will be discovered via the catalog.
From the functional standpoint, the connectors should provide the authentication
and data querying capabilities required to register a data source in the enterprise
data catalog.
Data Collaboration System and APIs
The data collaboration system and APIs provides the interfaces for teams
collaborate around specific data sources stored in the data catalog. This interface
can be the main gateway to capture contextual information related to data sources
such as comments, documents, etc.
Putting all Together
As simple as the previous architecture model seems, it contain the fundamental
building blocks to enable robust data discovery scenarios in enterprise
environments. This architecture model is based on our experience implementing
dozens of similar solutions and can be easily extended with other relevant aspects
such as data quality rules, data access optimization, etc.
Summary
Data discovery is one of the most important elements of enterprise data solutions
and one that is frequently ignored. This paper has provided a reference architecture
to enable data discovery in the enterprise environments. The reference architecture
covers relevant aspects of data discovery solutions such as metadata management,
governance, alerting, discovery, etc. The reference architecture described in this
project has been implemented dozens of times using commodity technology stacks
available to any organization in the world.

More Related Content

Solving data discovery in the enterprise

  • 1. Solving Data Discovery in the Enterprise: Building an Enterprise Data Catalog
  • 2. Contents Overview.......................................................................................................................................................3 The Business Challenges of Data Discovery in the Enterprise......................................................................3 Why MDM is not the Answer........................................................................................................................4 Introducing the Enterprise Data Catalog ......................................................................................................4 Getting Technical: Building an Enterprise Data Catalog...............................................................................6 Catalog Portal...........................................................................................................................................7 Catalog Mobile .........................................................................................................................................8 Catalog Store ............................................................................................................................................8 Data Source Publishing API......................................................................................................................8 Data Source Discovery API.......................................................................................................................8 Data Source Notifications API..................................................................................................................8 Data Source Search API............................................................................................................................8 Data Governance API ...............................................................................................................................9 Metadata Connectors...............................................................................................................................9 Data Collaboration System and APIs.......................................................................................................9 Putting all Together ......................................................................................................................................9 Summary.......................................................................................................................................................9
  • 3. Overview Data discovery, understanding and governance is becoming one of the key elements of data architectures in the enterprise. The explosion in the volumes of data produced and consumed by organizations have exponentially increased the complexities related to discovering and understanding data in an efficient manner. Despite its relevance, data discovery and governance often tends to be an overlooked aspect of enterprise big data solutions more focused in sexy areas such as analytics, machine learning etc. However, more and more organizations are realizing that data discovery is an essential component to effectively enable analytics, visualizations and general data consumption capabilities in the enterprise. However, the road of enabling data discovery in the enterprise is plagued with challenges as we will explore in the next section. The Business Challenges of Data Discovery in the Enterprise As data grows in the enterprise so are the initiatives to gather intelligence about that data. In that sense, the efforts around big data, analytics, visualizations, etc. have increased exponentially during the last few years. In that sense, data discovery has become a foundation block to any enterprise data initiative. However, in order to enable efficient data discovery models, enterprises need to address some of the following challenges:  Increasing Data Volume: The increasing volume of data produced in the enterprise has drastically degraded the ability of information workers for quickly finding and consuming different data sources from enterprise applications.  Lack of Metadata Management: Even when data can be found, information workers struggle to understand the specific semantics of enterprise data sources. This is due to the lack of metadata management solutions implemented in enterprise environments.  Different Data Access Interfaces: One of the biggest challenges for accessing data in the enterprise is the proliferation of heterogeneous data
  • 4. access protocols and APIs introduced by new line of business solutions. In that sense, organizations struggle with the lack of consistent protocols and models to access data from different business applications.  Lack of Established Data Stewardship: Complementing the previous point, the lack of mainstream data stewardship models make it challenging for applications trying to access enterprise data sources.  Limited Collaboration Interfaces: Top-down data stewardship is just a mechanism for establishing contextual information about enterprise data sources. A lot of the knowledge about business data lives with business users who actively interact with it. However, enterprises rarely implement the collaboration interfaces that capture the knowledge of those domain experts in order to add contextual information to corporate data sources. Why MDM is not the Answer Master data management (MDM) platforms has been traditionally seen as a mechanism to keep a record of data sources in an enterprise environment. However, over the years MDM solutions have become extremely heavy, complicated and very limited to address some of the mainstream scenarios of data discovery in the enterprise. Additionally, MDM solutions struggle to quickly integrate with modern SaaS, cloud and mobile platforms which are becoming a significant source of data in the enterprise. As a result of the limitations of MDM platforms, organizations have started to adopt lighter, simpler and more modern data discovery models that are optimized for the modern technology ecosystem. From the different models used to enabled data discovery in the enterprise there is one we’ve seen been incredibly successful in organizations of all sizes: the enterprise data catalog. Introducing the Enterprise Data Catalog A data catalog is a simple but incredibly effective and robust model to enable data discovery in the enterprise. From a functional standpoint, an enterprise data catalog should provide a global repository that registers data sources from different line of
  • 5. business systems as well as the corresponding metadata and contextual information associated with it. Conceptually, a data catalog borrows elements from popular repositories such as mobile app stores or ecommerce marketplaces. In that sense, an enterprise data catalog goes beyond the classification and organization of enterprise data sources and enables capabilities such as search, collaboration, alerting and other features that can be combine to provide a fresh, modern experience to discover data sources in an enterprise environment. From the functional standpoint, an enterprise data catalog should enable some of the following capabilities:  Data Source Discovery: An enterprise data catalog should allow information workers to browse, and discover different data sources business data sources linked to line of business systems. Additionally, the catalog should allow a simple registration for new data sources.  Data Source Publishing: Complementing the previous point, an enterprise data catalog should allow information workers to register new data sources using simple interfaces both visually and programmatically.  Metadata Management: Enterprise data catalog solutions should allow data stewards to provide adequate metadata related to business data sources. Simple metadata such as field descriptions or other contextual information can be incredibly relevant to correctly understand business data sources.  Tagging and Classification: An enterprise data catalog should allow users to classify the different data sources using tags or simple hierarchical categories.  Search: Finding data using simple keyword and facet search should be one of the key capabilities of an enterprise data catalog solution.  Testability: An enterprise data catalog should allow users to test and validate the different data sources exposed in the catalog.  Collaboration: An enterprise data catalog should facilitate the collaboration between information workers working on specific data sources.
  • 6.  Governance: Access control, SLAs, exception management are just some of the key governance and data stewardship capabilities that should be enabled by enterprise data catalogs.  Alerts: Throughout the lifetime of a data source, information workers might want to receive alerts about relevant events such as schema data changes of performance degradations. An enterprise data catalog should provide a simple interface for power users to configure alert conditions on specific data sources. Getting Technical: Building an Enterprise Data Catalog As explained in the previous sections, enterprise data catalogs have become one of the most popular solutions to enable data discovery in the enterprise. In the last couple of years, we have implemented several enterprise data catalogs for dozens of organizations. As a result, there are a few reference architectures that you can implement with today’s technology. The following diagram illustrates a reference architecture model for an enterprise data catalog solution.
  • 7. The previous diagram includes highlights some of the following functional components: Catalog Portal The catalog portal is the main user interface to register, browse and discover data sources in an enterprise environment. From the architecture standpoint, the catalog portal will interact with the different APIs of the solution to perform operations on data sources. The catalog could be implemented using any web development platform such as NodeJS express, ASP.NET or Python Django.
  • 8. Catalog Mobile Similar to the portal interface, users will be able to interact with data sources from smartphones or tablets using the catalog mobile interface. This component of the platform provides a mobile-first, simple functionality to enable data discovery from mobile devices. Catalog Store The catalog store is the main data repository for maintaining the metadata associated with different data sources. Considering the arbitrarily nature of information related to business data sources we have typically preferred to leverage NOSQL databases such as MongoDB or Couchbase when implementing this type of solution. Data Source Publishing API The data source publishing API provides the interfaces for publishing and managing business data sources from different applications including the catalog portal. This API should handle all aspects related to data source management such as categorization, tagging, metadata management etc. Data Source Discovery API The data source discovery API provides the interfaces required to dynamically query and discover data sources registered on the platform. Typically, we have leveraged industry standards such as OData or GraphQL as the main protocol for these interfaces. Data Source Notifications API The data source notifications API provides the mechanisms for third party applications to dynamically subscribe to changes on specific data sources. The API should be able to deliver notifications via traditional channels such as email, SMS or push notifications as well as via programmatic interfaces. Data Source Search API The data source search API is responsible for providing traditional search capabilities to enterprise data sources registered in the catalog. The search capabilities should focus on the data source metadata and not on the data itself. Search techniques like facet searching and proximity algorithms are very relevant
  • 9. for this API. Typically we rely on search platforms like Elastic to implement this capability. Data Governance API The data governance API is responsible for enabling data governance and stewardship capabilities such as access control, data privacy, data ownership, SLA monitoring etc. These APIs can be integrated with existing security and access control platforms in the enterprise. Metadata Connectors The connectors are responsible for abstracting the integration with the different line of business systems hosting the data sources will be discovered via the catalog. From the functional standpoint, the connectors should provide the authentication and data querying capabilities required to register a data source in the enterprise data catalog. Data Collaboration System and APIs The data collaboration system and APIs provides the interfaces for teams collaborate around specific data sources stored in the data catalog. This interface can be the main gateway to capture contextual information related to data sources such as comments, documents, etc. Putting all Together As simple as the previous architecture model seems, it contain the fundamental building blocks to enable robust data discovery scenarios in enterprise environments. This architecture model is based on our experience implementing dozens of similar solutions and can be easily extended with other relevant aspects such as data quality rules, data access optimization, etc. Summary Data discovery is one of the most important elements of enterprise data solutions and one that is frequently ignored. This paper has provided a reference architecture to enable data discovery in the enterprise environments. The reference architecture covers relevant aspects of data discovery solutions such as metadata management, governance, alerting, discovery, etc. The reference architecture described in this
  • 10. project has been implemented dozens of times using commodity technology stacks available to any organization in the world.