00:00:08 Data lakes and their importance.
00:00:39 Data lakes defined and their purpose in business.
00:02:13 Evolution of data lakes from data warehouses.
00:04:15 Shift in mindset and philosophy around data lakes.
00:07:43 Ensuring data accuracy in data lakes.
00:10:06 How technology has improved data warehousing since 20 years ago.
00:12:14 The benefits of on-demand systems in data lakes.
00:13:31 Limitations of business intelligence and its outdated approach.
00:15:22 Comparing business intelligence to data lakes and their ability to inform decision making.
00:16:49 Implementation complexity: Accessing data sources and the impact on multinational companies.
00:18:32 Adoption of Data Lakes: Benefits for tech-driven companies and their use in cross-functional optimization.
00:20:08 The future of Data Lakes: Increasing accessibility and implementation, next steps with APIs.
00:22:45 Closing remarks and conclusion.

Summary

In this interview, Kieran Chandler and Joannes Vermorel, founder of Lokad, discuss data lakes and their role in supply chain optimization. Data lakes are centralized repositories of raw data that enable machine learning-driven apps to make smart decisions. Vermorel highlights the limitations of traditional business intelligence tools, emphasizing that data lakes offer more efficient and automated data analysis. He believes that tech-driven companies have already adopted data lakes and moved towards implementing application programming interfaces (APIs) for their subsystems, allowing for end-to-end automation. Vermorel predicts that large companies will increasingly adopt data lakes and APIs in the next five years for better data-driven decision-making.

Extended Summary

In this interview, Kieran Chandler discusses data lakes with Joannes Vermorel, the founder of Lokad, a software company specializing in supply chain optimization. They begin by defining data lakes and their origins. Data lakes are a type of database designed to consolidate all the core transactional data of a company, such as sales, purchases, and stock levels. These databases are intended for use by applications, rather than humans, enabling data-driven, domain-specific apps to make smart decisions for marketing, supply chain, human resources, and more.

Data lakes have a history dating back to data warehousing and data marts, trends from over 20 years ago. Vermorel explains that the main difference between data lakes and data warehouses lies in the technology and the philosophy behind them. Data lakes are more efficient at storing and serving large amounts of data, while cloud computing has made them more accessible and affordable.

Twenty years ago, a company would need to purchase an expensive appliance, such as one from Oracle, to house their data warehouse. Now, with cloud computing platforms, companies can have pay-as-you-go data lakes that are scalable and aggressively priced. This flexibility allows businesses to easily adjust their data storage approach if needed.

The philosophy behind data lakes has also evolved compared to data warehouses. The older approach put a lot of pressure on IT departments to properly organize and manage data. Data warehouses were designed with data marts for different divisions, such as marketing, supply chain, and finance. This created challenges in managing and accessing data across different departments.

Data lakes aim to consolidate data in a more centralized and accessible way, making it easier for applications to process and make smart decisions. This shift in mindset has allowed for greater efficiency and flexibility in data management and usage.

Twenty years ago, data warehousing was a popular method for managing and organizing data. This approach involved a high level of technical effort to connect various data tables and required a unified model of the company’s data. However, this method often led to IT divisions being overwhelmed by the sheer amount of work and resulted in many failed projects.

Today, data lakes have emerged as a leaner, more efficient approach to data management. Data lakes act as a repository for raw data extracted from various systems such as CRM, ERP, and web platforms. Instead of attempting to organize or combine the data, it is simply dumped into the data lake, which can handle large amounts of data without issue.

One of the challenges in using data lakes is ensuring that the data is accurate and up-to-date. IT divisions are responsible for ensuring that the data lake contains an accurate reflection of the original systems, but they do not need to understand the business implications of the data. The responsibility of understanding the data within the CRM, for example, falls on the divisions that use it, such as sales or marketing. This approach allows for a more problem-specific interpretation of the data, as different divisions may have different needs and perspectives on the data.

The technology landscape has changed significantly since the days of data warehouses, making data lakes a more viable option. For one, the quality of tools for moving data across the internet has improved, making it easier to consolidate data from distributed systems, such as supply chains. Additionally, internet infrastructure has improved, making it possible for even smaller businesses to move large amounts of data without difficulty.

Furthermore, cloud computing platforms have made data lakes more accessible and cost-effective. These platforms allow for rapid iteration and on-demand usage, enabling companies to experiment with data lakes without significant financial risk.

While business intelligence tools have been useful for companies to gain insights from their data, they are fundamentally intended for human consumption. This means that companies must pay employees to analyze the data instead of automating the process. Data lakes, in contrast, allow for more efficient and automated data analysis, making them an attractive option for multinational companies looking to improve their data management.

Vermorel explains the limitations of traditional business intelligence (BI) tools, the advantages of data lakes, and the future of data management in supply chain optimization.

Vermorel describes BI as a dated technology that provides only basic data analysis in a somewhat real-time manner. This technology was revolutionary 30 years ago, allowing companies to access and aggregate their data, but it doesn’t offer actionable insights or decisions. In contrast, data lakes are part of a bigger picture, serving as a storage repository for raw data from various sources. Machine learning-driven apps can then efficiently process this data to generate actionable decisions that impact the company and create tangible value.

Implementing a data lake is dependent on the complexity of accessing a company’s data sources. For large multinational companies, this can be a difficult process, as each country might have its own system. However, there are no alternatives if a company wants to gain insights and make data-driven decisions. Vermorel believes that small, tech-driven companies have already adopted data lakes, and even moved beyond them by implementing application programming interfaces (APIs) for their subsystems. This enables cross-functional optimization and smart decision-making.

Vermorel sees large companies increasingly adopting data lakes in the next five years, as they become more accessible and affordable. Companies that fail to implement data lakes risk being outcompeted by those who have already done so. However, data lakes are not the final step in data management. Vermorel suggests that APIs are the future, allowing companies to not only read and analyze data, but also act upon it. APIs can enable end-to-end automation, generating decisions automatically and implementing them within the system.

Joannes Vermorel emphasizes the importance of moving beyond traditional BI tools and adopting data lakes for more efficient data-driven decision-making in supply chain optimization. He envisions a future where large companies implement data lakes and APIs to automate their processes and make smarter decisions.

Full Transcript

Kieran Chandler: Today on Lokad TV, we’re going to discuss a little bit more about the concept of data lakes and understand why those companies should be taking more interest in them. So Joannes, as always, perhaps we should just start by defining a little bit more about what data lakes are and where they’ve come from.

Joannes Vermorel: A data lake is typically a kind of database with some particularities, which is intended to consolidate pretty much all the core data of your company, especially all the transactional data like what you’ve sold, what you’ve purchased, your stock levels, and so on. The intent and end usage of the data lake is that it’s supposed to be for apps, not humans. The idea is that you put a data lake in place so that you can have domain-specific apps that are very data-driven and can use tons of data from the data lake to generate smart decisions for marketing, supply chains, human resources, or whatnot. Fundamentally, it’s a place where you can consolidate all the data to serve it in batch to smart apps. As for the second part of your question, data lakes have a long history, dating back to the idea of data warehousing and data marts.

Kieran Chandler: Data warehouses were a trend we saw probably over 20 years ago. So, what’s changed between then and now, and what are the key differences?

Joannes Vermorel: That’s interesting. The buzzword nowadays is “data lake” and “data scientist,” while twenty years ago, it was “data warehouse” and “data mining,” which are basically the same evolution of the same ideas, just revisited twenty years later. What has changed is quite a few things. First, the technology of data lakes has changed, so they are much more efficient at storing and serving large amounts of data. Then, we had cloud computing in the middle, which means that nowadays, you can have completely on-demand data lakes with pay-as-you-go per terabyte pricing. This is quite different compared to 20 years ago when you would have to buy a very expensive appliance, like from Oracle, to store all your data in. Nowadays, with cloud computing platforms, you can have pay-as-you-go terabytes and be extremely aggressive in terms of pricing.

Kieran Chandler: That’s sort of the technical side of things. How about the philosophy? What’s changed in the mindset and how we’re using data lakes compared to data warehouses?

Joannes Vermorel: There has indeed been quite an evolution. The problem with data warehouses as they were thought of 20 years ago is that they put a lot of pressure on IT to properly organize the data. You even had a data warehouse that was supposed to be organizing data marts, with one data mart intended for every kind of division, like marketing, supply chain, finance, and so on. The data marts were like subsets or subsistence within your data warehouse. The problem with this approach, which was kind of similar in spirit to the data lakes we have today, is that it required a lot of organization and management from the IT side.

Kieran Chandler: What was done for business intelligence is that there was a high level, high degree of expectations on the fact that there would be already kind of prepared, organized, you know, with where you’ve attached, you know, customers to sales to returns. So, you know, you kind of glue things together. All the things that are going together, it’s a lot of effort, actually. Technically, it’s about, you know, joining tables, it’s to connect all those tables together with proper joint Exeter. So, 20 years ago, the philosophy was to do a lot, and it was so that it would be quite similar to what was being done on BI and quite similar to what was done naturally for relational systems. The problem with this approach was that the amount of work it requires is completely enormous, and so you end up with typically IT divisions that were just completely overwhelmed by the sheer amount of requirements that were falling on them because of these data warehousing projects. So as a result, frequently, it was failing because, well, just IT failed to deliver. But what about today? I mean, surely things are going to get a little bit messy now you’ve got these data lakes.

Joannes Vermorel: Data lakes, in terms of philosophy, are much leaner because the philosophy is that data lake is just a recipient for a clean extraction, but a clean dump extraction of all the data that lies in other systems. So, you do not try to do any fancy recombination of both the data that comes from the CRM, plus the data that comes from the ERP, plus the data that comes from your web platform. You’re just going to extract those data sources and dump them into the data lake. And the data lake is well behaved thanks to the technology, meaning that you can dump a huge amount of data, and it will handle the load without complaining. If you’re on the cloud, you will be charged for it.

Kieran Chandler: How do you know that the data you’re actually using is good data? I mean, how are you keeping track of which data is up to date? I mean, if you’re just dumping it all in this lake, how do you keep track?

Joannes Vermorel: The responsibility of IT with a data lake is to make sure that the data lake contains an accurate reflection of what is in the original systems. But that does not require any understanding of what’s going on business-wise. You just have a CRM that has 200 tables, relational tables, and you just mirror them into the data lake, and this is it. You don’t need to understand whatever is happening in the CRM.

Kieran Chandler: So, who needs to understand what is going on within the CRM?

Joannes Vermorel: It turns out that it’s the divisions themselves who want to exploit the data, and the problem is that it’s highly problem-specific, the interpretation of the data. So, for example, the way you look at the sales data is different whether you want to solve a marketing problem or a supply chain problem. That’s why, and that was also one of the prime reasons why, twenty years ago, many of those data warehousing initiatives failed. It’s because the vision was to produce a unified model of the company, but then it turned out that it was highly frustrating for every division because marketing said, “Oh, it doesn’t exactly fit in the vision that I have of my domain,” and the supply chain said the same, and Finance would say the same. So, in contrast, the idea is that now it’s more like the divisions themselves, like supply chain, marketing, finance, human.

Kieran Chandler: Means they’re not going to fail today. I mean, again, there are tons of things that change. A particular challenge, especially in supply chain, is that we are pretty much by design dealing with distributed systems. What do I mean by distributed? I mean not everything in one place because, by definition, if you have multiple warehouses, they are not in the same place. Your suppliers are not in the same place as your warehouses, and your clients are not either. So, by definition, we are looking at systems that are dispersed, and you want to consolidate all of those data in one place, which is your data lake, which technically needs to happen over the network.

Joannes Vermorel: Obviously, twenty years ago, the internet was already invented. It did exist, but the quality of the tools to move data around across the internet was completely different compared to what we have today. And then the network, the quality of the network itself, was also completely different. Nowadays, if you want to move, let’s say, for a not-so-large company, a 1,000-employee company, so you’re sizeable but not a mega-corporation. Twenty years ago, if you wanted to move one gigabyte of data per day across the internet, it was complicated.

I mean, you needed to have access to fiber, for example, in Paris. There was only one place in Paris twenty years ago where you could have access to the fiber, which was the area near the stock exchange. There was like one square kilometer where you could easily get access to the fiber. Anywhere else, you had to lay down your own fiber if you wanted it. So, mega-corporations could do that, but even a sizable business, you know, with 1,000 employees, could not. This has changed. Now, it’s very straightforward. The tooling is better, and you can move around literally gigabytes without too much fuss.

And the fact that you have on-demand systems, those data lakes are not only very cheap, thanks to the economies of scale of those cloud computing platforms, but the fact that they are on-demand means that you can do trial and error. If you just try to set up a data lake and it’s a complete failure, you can just say “delete” and retry, and you’re just paying for what you use. So, you can iterate rapidly. It’s not like twenty years ago when you had to commit yourself to buying a very expensive appliance, and if you got it wrong, that was a big problem.

Kieran Chandler: And I bet those finance areas probably still have the quickest internet. What would you say to a big multinational company that’s already got a good grip on their data, they’re already understanding things using business intelligence tools? I mean, why should they be interested in a data lake?

Joannes Vermorel: The problem with business intelligence is that, fundamentally, it’s intended for humans. It’s good, but it means that every single minute that people are going to look at those numbers is a minute where you are actually paying an employee to have a look at numbers instead of doing something else. You can very easily produce millions of numbers, which will require thousands of hours of man-hours to be processed, which is extremely expensive.

So, the problem is that business intelligence, the way I see it, is a type of technology that is fairly dated. It was a way to get a basic analysis of your data in a way that was relatively real-time. It was very interesting because, if we go back 30 years ago, which was the time when Business Objects was founded, they were the company that. And otherwise, you just couldn’t know that you could not perform synchronized queries that would give you this information: how many units are sold per day, per product, and so on. That was not possible with business intelligence. Suddenly, it was possible to have this cube, you can even have hypercubes, and even better, you can have it very, very nice. But then, in the end, you’re just looking at a super basic aggregation of your data, and this aggregation is not a decision. It doesn’t tell you if you should raise or lower your price, it doesn’t tell you if you should produce more or less, it doesn’t tell you if out of production batch of 1000 units, you should put 100 units in an airplane for faster delivery. So fundamentally, it’s just about getting quantitative insights. So the big difference between, you know, BI and data lake is that data lake comes with the insight that it’s fundamentally a cog into a bigger picture where, sitting in front of the data lake, you will have typically a machine learning driven app that will crunch the data served super efficiently by the data lake in order to generate decisions automatically. And those decisions, they are something that has a physical impact on your company and that will create tangible value in a way.

Kieran Chandler: Okay, so if we agree that maybe business intelligence tools have kind of got their limitations, and if it comes to an implementation of a data lake, how actually easy is that to do? Is it just a case of uploading all this data to the cloud and then you’re good to go?

Joannes Vermorel: The complexity of implementing a data lake is strictly proportional to the complexity to access your data sources, you know, literally accessing them, not doing anything smart with them. So that means, for large multinationals, well, that means that if you have every single country in your company having its own system, well, guess what? You will have as many types of data lakes to lay down for so that you can bring the data from every single country to the data lake. But I mean, it’s unfortunate you have no alternative because the only alternative is to have a direct integration with the countries directly, and that’s even more costly because if you have two divisions, let’s say marketing and supply chain, wanting to access the sales data, you will pay for this integration twice. So the idea with a data lake is that, while you do it once, and then it’s in the data lake, which makes it very suitable for the rest of the company to access the data. So the complexity is completely dependent on what you have. But also, again, if we go back to your initial quote, if you don’t have data, you’re just a man with an opinion. Well, you don’t have any alternative to retrieve this data anywhere if you want to do any kind of measurements.

Kieran Chandler: Let’s sort of bring things together now. If there are so many positives to data lakes and it seems fairly simplistic, it’s just a large receptacle of data at the end of the day, why is it something that’s not being readily adopted by industry at the moment?

Joannes Vermorel: It turns out that very small tech-driven companies did adopt data lakes quite a while ago, and they even went beyond that with, I would say, the API-fication of their company, meaning that you are going to put an API (Application Programming Interface) on every subsystem, which is like the next step that happens after the data lake. So, I would say smart e-commerce, for example, they have already consolidated their data, and that.

Kieran Chandler: You need to have a look at both today, that comes from the website, what you pay for search engine merchandise marketing, you know, the Google AdWords and whatnot, and cross orders. They are able to take smart decisions in terms of direct marketing action and whatnot. In terms of pure tech-driven companies like Microsoft or Google, they have also been doing similar things, you know, for literally decades. I mean, Google has only been around for two decades, but other companies like all the tech companies have been doing that for quite a while now. So, if they’ve been doing that for decades, how about the future? What’s next? Are we going to be taking a dip in a data ocean sometimes?

Joannes Vermorel: Yes, I mean, what I can see next is that companies that are very supply chain-oriented, now that data lakes have become very accessible and very cheap, will implement these data lakes. We see among our customers that a lot of customers that didn’t have a data lake one year ago now have a data lake. I would say there has been a turning point in the last two years on data lake matters. So, I suspect that most large companies will, within the next probably five years, really have implemented their own data lakes, because otherwise, they would just be completely outcompeted by all the large companies who will have done that for them.

But there are also limits, in particular, a data lake is just like a read-only copy of all the data that is lying in other subsystems. That’s why I was saying that the next step is to have all the subsystems expose APIs, application programming interfaces, because that’s what Amazon has done. These APIs let you do even more, suddenly you’re not just read-only, you can also act. The idea is that you can consolidate all the data, read, crunch, take all those decisions, and then what do we do with those decisions that have been computed? The answer is that you can send the Excel spreadsheet to the right division so that they implement your decisions, such as purchasing. But if there’s an API, you can directly call this API to just inject the purchase order for this product, of this quantity, from this supplier, with this transport being specified and whatnot. So you can actually, if you have APIs, have end-to-end automations where you not only generate the decision automatically, but then you implement these decisions physically automatically because it’s reinjected into one of the systems.

Kieran Chandler: Okay, we’re going to have to leave it there, but thanks for your time today. So that’s everything for this week. Thanks very much for tuning in, and we’ll be back again next time. Bye for now.