Ready to deep dive into the data lake?

Think data lakes are just a new incarnation of data warehouses? Our resident expert Ingo Steins rates the two.

Data lakes and data warehouses only have one thing in common, and that is the fact that they are both designed to store data. Apart from that, the systems have fundamentally different applications and offer different options to users.

A data lake is a storage system or repository that gathers together enormous volumes of unstructured raw data. Like a lake, the system is fed by many different sources and data flows. Data lakes allow you to store vast quantities of highly diverse data and use it for big data analysis.

A data warehouse is a central repository for company management, so it’s quite different. Its primary role is as a component of business intelligence: it stores figures for use in process optimization planning, or for determining the strategic direction of the company. It also supports business reporting, so the data it contains must all be structured and in the same format.

Challenges with data warehouses

Data warehouses aren’t actually designed for large-scale data analysis, and when used in this way these systems will reach their structural and capacity limits very quickly. We now generate enormous volumes of unstructured data which needs to be processed quickly.

Another limitation is the fact that high-quality analyses now draw on a variety of different data sources in different formats, including social media, weblogs, sensors and mobile technology.

A data warehouse can be very expensive. Large providers such as SAP, Microsoft and Oracle offer various data warehouse models, but you generally need relatively new hardware and people with the expertise to manage the systems.

Data warehouses also suffer from performance weaknesses. Their loading processes are complex and take hours, the implementation of changes is a slow and laborious process, and there are several steps to go through before you can generate even a simple analysis or report.

Virtually limitless data lakes

Data lakes, on the other hand, are virtually limitless. They aren’t products in the same way that data warehouses are, but are more of a concept that is put together individually and can be expanded infinitely.

Data lakes can store infinite different data formats in very high volumes for indefinite periods of time. Because they are built using standard software, the memory is comparatively cost-effective too.

Data lakes can store huge volumes of data, but need no complex formatting or maintenance. The system doesn’t impose any limits on processes or processing speeds – in fact, it actually opens up new ways to exploit the data you have, and can therefore help companies more generally in the process of digitalization.

Put on your swim suit

All you really need to start a data lake is a suitable database. This is relatively easy to set up with a solution like Hadoop. Companies who want to access a wide range of data and process it effectively in real time to answer highly specialized and complex questions will find that the data lake is the perfect infrastructure to realize this goal.

Ingo Steins

Ingo Steins is Unbelievable Machine’s Deputy Director of Operations, heading up the applications division from our base in Berlin. He has years of experience in software, data development and managing large teams, and now runs three such teams distributed across our sites. Ingo joined The Unbelievable Machine Company in January 2016.

Ingo Steins, Deputy Director of Operations, The Unbelievable Machine Company, part of Basefarm Group since June 2017