Browse Definitions :

Getty Images/iStockphoto

Who manages data lakes and what skills are needed?

Data engineers, data scientists and chief data officers are just some of the people who have the skills to manage data lakes.

Among the most common components of modern data architecture is the use of a data lake, which is a location where data flows in to serve as a central repository.

The concept of the data lake has evolved from being just a location for data collection to a more organized approach known as a data lakehouse. Whether it's called a data lake or a data lakehouse, there is a need for certain skills and IT professionals to effectively manage the technology.

What is a data lake?

A data lake is a large open storage location that typically uses object storage as a unified repository for unstructured data coming from multiple sources. Those sources can include event streaming data, operational and transactions data and databases.

While data lakes can be in on-premises environments, they are more commonly created with cloud object storage services that enable large scalable data capacity, such as Amazon Simple Storage Service (S3), Google Cloud Storage or Microsoft Azure Data Lake Storage. Data lakes first emerged to help enable big data workloads with the Apache Hadoop big data platform.

A data lake architecture differs from a data warehouse in that warehouse data is transformed into a format that provides structured data and organization. A data warehouse enables users to more easily query the data and use it for data analytics and business intelligence use cases. Data warehouses also provide data governance and data management capabilities.

The concept of the data lakehouse -- first coined by Databricks -- is an attempt to bring together the best of data lakes and data warehouse technologies. A data lakehouse aims to combine the ease of use and open nature of a data lake with the data warehouse's ability to easily execute queries against data. A data lakehouse provides additional structure on top of a data lake -- often with the use of a data lake table format technology, such as Delta Lake, Apache Iceberg and Apache Hudi. It also uses a query engine technology, such as Apache Spark, Presto and Trino.

Who manages data lakes?

Managing data within an organization can be a multi-stakeholder effort. It can involve different job roles depending on the particular use case.

Data warehouses are often managed by data warehouse managers and data warehouse analysts. Those two roles involve data management and data analytics skills, which are typically tied to a specific data warehouse vendor technology.

Data lake management is often the domain of data engineers, who help design, build and maintain the data pipelines that bring data into data lakes. With data lakehouses, there can often be multiple stakeholders for management in addition to data engineers, including data scientists. Business analysts also fit into the management mix. They take responsibility to ensure data quality and metadata are properly managed to support business objectives.

As organizations begin to shift from data warehouse to data lake architectures, there is some overlap between the people who manage data warehouses and those who manage data lakes. Data still needs to come from multiple sources, it still needs to be governed and there is the same need for analytics so the data can be used effectively.

At the executive level -- whether it's a data warehouse, data lake or data lakehouse -- a chief data officer is often the job role that is tasked with the top level of responsibility for all data use.

What skills are necessary to manage data lakes?

There are a variety of skills that are necessary to effectively manage data lakes:

  • Data engineering. These skills include the design, development, deployment and ongoing operations of data pipelines to bring data from source destinations into a data lake. Data engineering and data pipelines often involve the use of skills and tools for extract, transform and load operations.
  • Data validation. Ensuring data is accurate and timely goes hand in hand with the data engineering skill set. Data validation is a core skill set that ensures data quality and usable data is being ingested by a data lake.
  • Data science. To be sure the right data lands in a data lake, there is also a need for data science skills. Data science skills can help align data sources to generate the insights an organization is looking for.
  • Business analysis and data analytics. Determining what insights an organization wants is often a skill that involves business analysis and data analytics. These skills help define what metrics an organization is looking to measure. These metrics often help with a business goal or operational trend that needs to be monitored and analyzed.
  • Cloud management. As data lakes are increasingly deployed in the cloud, there is a need for cloud management skills -- including the ability to provision and manage cloud resources. A fundamental component of cloud management for data lakes is cost management skills. This helps organizations understand and budget data lake usage and operations.

Available certifications

There are a variety of paths toward certifications for those looking to verify their skills for managing data lakes. A modern data lake or lakehouse deployment often uses cloud resources and tools from a specific vendor to enable data management and data queries.

Managing a data lake is not an abstract idea. It's a hands-on effort that can benefit from specific certifications. The leading cloud and data lake vendors all have some form of training and available certification.

AWS Certified Data Analytics -- Specialty

Amazon Web Services and its S3 cloud object storage service are commonly used to enable data lakes.

This certification is geared toward people with experience working with AWS services. It validates skills in using AWS data lakes and analytics services.

Website: https://aws.amazon.com/certification/certified-data-analytics-specialty

Google Professional Data Engineer

This Google certification provides an examination that verifies skills to build, deploy and use data models that benefit from data lake and analytics services running in Google Cloud.

Website: https://cloud.google.com/certification/data-engineer

Microsoft Certified: Azure Data Engineer Associate

Microsoft's Azure Data Lake Storage Gen2 is a popular option for building data lakes. With this certification, Microsoft provides validation for those looking to use Microsoft services for data lakes.

Website: https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer

Databricks Lakehouse Platform Essentials

This certification helps users learn and validate data lakehouse skills on the Databricks platform. This tool integrates multiple open source technologies, including Apache Spark and Delta Lake.

Website: https://credentials.databricks.com/group/296467

Cloudera CCP Data Engineer

Cloudera is often associated with the open source Hadoop big data technology, which is one of the originators of the data lake concept. This certification will validate skills required to ingest, transform, store and analyze data in the Cloudera environment.

Website: https://www.cloudera.com/about/training/certification/cdhhdp-certification/ccp-data-engineer.html

Informatica Cloud Data Warehouse & Data Lake Modernization Foundation Level

This certification is designed to help organizations that are updating from a data warehouse to a cloud data lake or data lakehouse mode.

Website: https://now.informatica.com/cdwdl-foundation-series-certification.html

Dremio data lake training

Dremio, a company that builds data lakehouse technology, has expanded its training options with Dremio University, which provides certificates of completion.

Website: https://university.dremio.com

Next Steps

Explore top data lake providers for substantial storage use

Dig Deeper on Data analytics and AI

Networking
  • subnet (subnetwork)

    A subnet, or subnetwork, is a segmented piece of a larger network. More specifically, subnets are a logical partition of an IP ...

  • Transmission Control Protocol (TCP)

    Transmission Control Protocol (TCP) is a standard protocol on the internet that ensures the reliable transmission of data between...

  • secure access service edge (SASE)

    Secure access service edge (SASE), pronounced sassy, is a cloud architecture model that bundles together network and cloud-native...

Security
  • cyber attack

    A cyber attack is any malicious attempt to gain unauthorized access to a computer, computing system or computer network with the ...

  • digital signature

    A digital signature is a mathematical technique used to validate the authenticity and integrity of a digital document, message or...

  • What is security information and event management (SIEM)?

    Security information and event management (SIEM) is an approach to security management that combines security information ...

CIO
  • product development (new product development)

    Product development -- also called new product management -- is a series of steps that includes the conceptualization, design, ...

  • innovation culture

    Innovation culture is the work environment that leaders cultivate to nurture unorthodox thinking and its application.

  • technology addiction

    Technology addiction is an impulse control disorder that involves the obsessive use of mobile devices, the internet or video ...

HRSoftware
  • organizational network analysis (ONA)

    Organizational network analysis (ONA) is a quantitative method for modeling and analyzing how communications, information, ...

  • HireVue

    HireVue is an enterprise video interviewing technology provider of a platform that lets recruiters and hiring managers screen ...

  • Human Resource Certification Institute (HRCI)

    Human Resource Certification Institute (HRCI) is a U.S.-based credentialing organization offering certifications to HR ...

Customer Experience
  • contact center agent (call center agent)

    A contact center agent is a person who handles incoming or outgoing customer communications for an organization.

  • contact center management

    Contact center management is the process of overseeing contact center operations with the goal of providing an outstanding ...

  • digital marketing

    Digital marketing is the promotion and marketing of goods and services to consumers through digital channels and electronic ...

Close