With the evolution of technology, the amount of data generated worldwide (mainly through smartphones, social media and IoT) will grow rapidly to 181 zettabytes of data by 2025, according to the international study Data Never Sleeps 10.0. Against this backdrop, the concept of data lakes is catching on among businesses that want to make the most of their data because of its many benefits.
The term data lake was first coined by James Dixon, CTO of Pentaho, a data integration and analytics platform, in his blog “Union of the State – A Data Lake Use Case”. Data lakes are data warehousing repositories that provide big data analytics natively from multiple sources. It helps decision making by running various types of analytics, such as dashboards, visualisations, big data processing, real-time analytics and machine learning. There is no size limit and various types of data are stored.
Unlike data warehouses where large amounts of data are stored in a structured form, data lakes collect raw and unprocessed data in various formats for data analysts. Structured data, semi-structured data and unstructured data can be stored and, when storing data, search can be accelerated by linking identifiers and metadata tags. The users of data lakes are data scientists and developers, those of a data warehouse specialists and business analysts.
The data warehouse is an advantageous data model for reporting because it uses structured data for one purpose, but it is inappropriate in terms of cost and time to collect and use large amounts of unstructured data needed for big data technology. Currently, most data lakes are implemented in the cloud.
With a data lake, all data is retained, not purged or filtered before storage, and is stored in an undefined state until it is queried. The data in a data lake is transformed when it is needed for analysis, in which case a schema is applied to make the data analysable. While the purpose of data lake data is accumulated without a predefined purpose, data warehouse data is defined in advance.
This type of data warehouse, applied to the health domain, is known as a Health Data Lake. The Plan for Recovery, Transformation and Economic Resilience (PRTR) foresees funds to develop a huge health data lake, called the National Health Data Space, which “will make it possible to improve diagnoses and treatments based on the massive analysis of information collected from the autonomous health systems”, according to the Ministry of Health.
Advantages of data lakes
- They provide easier collection and indefinite storage of all types of data.
- They allow companies to transform raw data into structured data suitable for SQL-based analytics, data science and machine learning, all with lower latency.
- Can be kept up to date more easily because it supports multiple file formats and provides a safe place for new data.
- They offer flexibility for big data and machine learning applications.
- Different tools can be applied to gain insight into what the data means.
- Cost is cheaper than data warehousing.
Disadvantages of data lakes
- Holding all kinds of data can be complex to manage.
- If not managed properly, they can become disorganised and difficult to connect to analytics and business intelligence tools.
- They tend to be more vulnerable to the development of data silos (data that is not accessible to all departments or teams in the company), which can then become data swamps (no metadata, unorganised).
- Containing sensitive data can raise security concerns.
- Initial investment and maintenance can be costly, especially when dealing with large volumes of data.
Data Lake House, the new trend
Given the differences between data lakes and data warehouses, most companies choose to operate both systems at the same time in a complementary way. However, a new trend is also emerging that combines the advantages of both types of repositories, the Data Lake House. Roughly speaking, they implement the data structuring and data management capabilities of a data warehouse, but do so with the flexibility and low cost of a data lake.
A report by Adroit Market Research forecasts that, at a compound annual growth rate (CAGR) of 24.0%, the global data lake market will reach $25.49 billion by 2029. Rising demand for data governance and security, growing trend of cloud-based deployments, and increasing need for analytics and big data solutions are factors contributing to the growth of the data lake market.