Bad Data in Supply Chain

November 27, 2019

supply chain science and tech

00:00:07 Scapegoating bad data in supply chain projects.
00:00:33 Why bad data is an easy excuse for project failure.
00:01:42 Challenges with bad data and misconceptions about its quality.
00:03:16 Difficulties in accessing data from ERP systems and challenges with vendors.
00:06:32 Issues arising during migration between ERP systems and data corruption.
00:08:01 Addressing incorrect data entries and their impact on ERP systems.
00:09:48 Forecasting and spotting issues in historical data.
00:11:37 How evolving semantics and definition changes can lead to data issues.
00:12:20 Scalability issues and optimizing data retrieval as companies grow.
00:14:45 Challenges in creating clean daily extractions and potential for data errors.
00:16:02 The impact of longer processing times on problem-solving in IT departments.
00:17:15 The issue of data semantics and misunderstandings in data interpretation.
00:19:22 The importance of documentation for each data field to ensure proper understanding.
00:21:01 Supply chain practitioners and IT department’s roles in understanding data semantics.
00:23:59 The range of problems under the umbrella of bad data and identifying the root causes.

Summary

In this interview, Kieran Chandler and Joannes Vermorel discuss data’s role in supply chain optimization and challenges faced by software vendors and practitioners. Vermorel highlights that the main issue is not “bad data,” but rather accessing and utilizing it effectively. Challenges include outdated systems, inadequate documentation, and responsibility for data access. Conflicts of interest with integrators, system migration issues, forecasting, and scalability also pose problems. To optimize supply chain management, companies must understand and address data issues, invest in proper documentation, clarify data semantics, and maintain realistic expectations, rather than blaming data for failures.

Extended Summary

In this interview, Kieran Chandler and Joannes Vermorel, the founder of Lokad, discuss the role of data in supply chain optimization projects and the challenges faced by software vendors and supply chain practitioners. They begin by addressing the notion that “bad data” is often used as a scapegoat for the failure of supply chain projects. Vermorel points out that blaming data is a convenient way to avoid placing blame on people who might take it personally and fight back. However, he also emphasizes that understanding the root cause of a problem is crucial.

Vermorel asserts that data-related problems are probably the number one cause of failure for supply chain optimization projects, but the perception of “bad data” is often misguided. He argues that most Western companies have had accurate data for decades, thanks to the use of barcodes, barcode scanners, and other technologies. The real issue, according to Vermorel, is not the quality of the data itself but rather the challenges in accessing and utilizing it.

One of the key challenges in using data effectively is gaining access to it. Many companies have been using various enterprise resource planning (ERP) systems, warehouse management systems (WMS), transportation management systems (TMS), and other software solutions for years, but these systems can be difficult to work with when it comes to exporting data. Vermorel identifies a few scenarios where accessing data can be particularly problematic:

1 Ancient systems: Some companies still use systems that are 40 years old, with obsolete and proprietary backends that make extracting data extremely difficult. 2 Lack of documentation: Software vendors may not provide adequate documentation for their systems, making it hard to understand and navigate the numerous tables and fields found in databases. 3 Responsibility and access: Determining who is responsible for granting access to data can be a challenge, as it involves multiple stakeholders within a company, including the software vendor, IT department, and supply chain practitioners.

The interview highlights the importance of understanding and addressing data-related challenges in supply chain optimization projects. While the quality of the data itself is not typically the issue, difficulties in accessing and using it can contribute to the failure of these projects. Identifying and addressing the root causes of these challenges is essential to ensure the success of supply chain optimization initiatives.

They delve into data issues that can arise from vendor relationships, system integrations, and scalability as companies grow.

One key issue they discuss is the potential for conflicts of interest with integrators, who may be more interested in selling their own supply chain optimization solutions rather than cooperating with a company’s chosen vendor. This can lead to companies being taken hostage by their integrators, making it difficult to access or utilize their data effectively.

Another challenge arises during the migration from one Enterprise Resource Planning (ERP) system to another, which can lead to poor data quality or “garbage integration.” While individual data entries may be accurate, the process of migrating historical data between systems can introduce errors, as there is often no direct one-to-one mapping between data in the old and new systems. This can lead to data corruption, which might not have a significant impact on daily operations but can re-emerge as a problem when attempting supply chain optimization or data crunching projects.

The interview also touches on forecasting based on historical data, which can be difficult due to the inherent uncertainty of the future. Spotting issues within historical data is easier when the problems are visible, such as gaps or sudden changes in data. However, subtle changes in semantics or definitions over time can lead to more difficult-to-detect issues, particularly when migrating between systems.

As companies grow, scalability can also introduce data issues. For smaller companies, the entire historical dataset can often fit on a smartphone, making optimization less of a concern. However, as companies grow larger, the sheer volume of data can become an issue. The discussion emphasizes the importance of understanding and addressing these data problems in order to optimize supply chain management effectively.

Vermorel explains that companies often struggle with extracting data from their Enterprise Resource Planning (ERP) systems, as these systems are not designed to provide clean daily increments of data. This results in complex processes, which may lead to incorrect data extraction and introduce bugs. Debugging and fixing these issues can become time-consuming, taking weeks instead of hours, due to the amount of data involved and the slow processing times.

Many companies believe they have good data, but in reality, the semantics of the data are often unclear. This can lead to misunderstandings and misleading calculations. For example, an “order date” can have multiple interpretations, such as the time the order was placed by the client, the time it was registered in the system, or the time the payment was confirmed. To avoid misinterpretations, Vermorel suggests that companies should have detailed documentation for each field and table in their data systems, reflecting the complexity of their supply chain.

A common problem in supply chain optimization is that practitioners may not spend enough time qualifying their data, leading to vendors working with incomplete or unclear information. This can result in a “garbage in, garbage out” situation, where the data is not necessarily incorrect but is misunderstood due to poor documentation.

To address these issues, Vermorel emphasizes the importance of identifying the root cause of the problem, which usually involves people and organizational structures. Companies should understand who has ownership of the failure and work to fix the underlying issues, rather than simply blaming the data. Vendors should also be honest about the challenges and time required to clarify data semantics, instead of being overly optimistic in order to close deals.

Companies need to invest in proper documentation, clear data semantics, and realistic expectations to optimize their supply chain and prevent failures stemming from data issues.

Full Transcript

Kieran Chandler: Today on Lokad TV, we’re going to understand why this is such an imprecise diagnostic and also understand what some of the data challenges can be encountered by both software vendors and supply chain practitioners alike. So Joannes, why is bad data such an easy excuse?

Joannes Vermorel: First, because data can’t complain. Nobody is going to defend it, so you’re blaming something inert, which is better than blaming a colleague who is going to take it personally and fight back. But the reality is that when you go to the root cause, it’s always people who are responsible for the problem. Blaming data is kind of skipping the step of identifying the root cause of the problem.

Kieran Chandler: It’s definitely easy to have a go at something that’s not going to fight back. So how can we be more precise? What are some of the challenges?

Joannes Vermorel: Data-related problems are probably the number one cause for failure in supply chain optimization projects. But there are some misconceptions. When people say “bad data,” they mean corrupt or incorrect numbers. However, for most Western companies, they have had very accurate data for decades. Almost nobody is entering incorrect part numbers or making typos. They use barcodes, barcode scanners, and other tricks like RFID. So the amount of truly bad data is usually a very thin fraction, and it’s not sufficient to explain why most of the initiatives that fail due to data-related problems actually fail.

Kieran Chandler: If the vast majority of Western companies are collecting pretty good data, what are some of the challenges that we can encounter which actually make us think that data is not so good?

Joannes Vermorel: The first problem is getting access to the data. You’d be surprised. Companies have been running with various flavors of ERPs, WMS, TMS, or other enterprise software to run and operate their companies on a daily basis for decades. But most of those systems are not very user-friendly when it comes to exporting data. In some cases, you have systems that are so ancient that you don’t even have a proper relational SQL database backing the system. In this sort of situation, it’s really difficult to extract the data because the backend is typically completely obsolete and proprietary.

Kieran Chandler: So who is in charge of doing that?

Joannes Vermorel: There are multiple responsibilities here. First, you can have the software vendor who did not provide any meaningful documentation about the system. In worst-case scenarios, you open your database and realize that your ERP contains 2,000 tables, each with 20 to 200 fields, and that’s a nightmare. It’s completely huge, and you don’t know where to start. Even though where to look for you can have a problem with the vendor, then you can have a problem with the integrator. The problem with the integrator is that you might have a conflict of interest. Some integrators might have a vivid interest in selling you their own recipe for supply chain optimization, for this module or this other module or something. And when you ask them to basically do a data extraction for you, your internal teams or whatever initiative you want to carry with another vendor, the integrator can have – it does happen, we’ve seen done many times – is just plainly uncooperative. Because, again, for them, it’s just in the strategic interest to be non-competitive. And here, you have like a hostage situation where the company is taken – is matter-of-fact taken hostage by the integrator. The company, the IT company responsible for configuring, sometimes hosting, and overall, you know, maintaining the ERP or the other computer system of the company. So that’s another type of data problem. But you see, it has very little to do with the data.

Kieran Chandler: Yes, definitely. Not being able to access your data sounds like a fairly huge blocker. And how about some of the other sort of challenges that can kind of occur? And a big sort of headache that a lot of our clients kind of have is when they migrate from one ERP system to another ERP system. So what can that do to the data?

Joannes Vermorel: It cannot kind of close cause problems. So that’s the sort of situation where you can have another type of bad data. But here, it’s really the data that is bad is when you have like garbage integration. So I say typically data entries are correct, but when you move from one ERP to another, what may be the vendor or maybe the integrator, maybe your internal IT department, what they are going to do, try to do is basically to migrate the historical data from the old system to the new system. The problem is that you don’t have like one-to-one matching between, you know, what was like a sells received in the old system and what is a sales received in the new system. Maybe things are just organized differently, and so there is no clear way to basically report AR from the old system to the new system. And then you end up with maybe tentative integrations and where it’s, when can lead to data corruption, is that if you, I would say, do an improper reintegration of your history, it will not prevent your company from operating day-to-day. You see, if the East Oracle they are is like incorrectly imported into the new system, for most of the daily operations, it will not have any impact. And even if it has an impact, usually someone will just do like a quick fix for something that is incorrect and proceed. So it might be a source of ongoing friction, but first, it’s disappearing fast because if people are stumbling, for example, let’s say you have like supplier codes that are incorrectly, you know, that have been incorrectly imported. I mean, your chances are that you don’t have like a million suppliers, so your most frequent, you know, your top 100 most frequent suppliers are probably going to be fixed in terms of fixing the incorrect data entries within two weeks from the date where you start using the new system. And maybe, you know, three months down the road, you have virtually fixed every single incorrect supplier entry. But the problem is the history code they are, people are not going to go back and fix, you know, the historical data. So let’s say you had like five years worth of history, maybe three

Kieran Chandler: In the future, how easy is it to spot these issues that might have occurred in the past?

Joannes Vermorel: It’s easy to spot those issues when you have visible problems, such as missing data for a few months. However, there can be subtle changes that are harder to spot, like differences in how sales are counted, whether or not fraud or returns are included. This can lead to a lot of problems that are hard to spot in historical data because the very definition of the data you’re looking at has changed over time, and it’s not obvious unless there’s a noticeable spike or bump.

Kieran Chandler: Another issue that’s common with the customers we speak to is scalability. As a company grows, their data starts to get messier. What are the issues that scalability can introduce?

Joannes Vermorel: When you don’t have any scalability problems, you can just copy all the data from the company every single day. For smaller companies, this may be manageable as their entire history might be less than 10 gigabytes. However, as you grow to larger companies, you end up with much more data, and you need to go for incremental data retrieval. This means extracting a portion of the data every day, and some systems are not designed to handle this efficiently or accurately. So, you need to do complicated things to build a clean daily extraction, and in the process, you expose yourself to potential issues.

Kieran Chandler: So, in the end, you end up with bad data just because you want to do data extraction in an incremental way, and it’s tricky because the system might not have been engineered for this task. When you think about debugging, you just want to copy data from one place to another, and it can be a very mundane problem. If the process takes a minute, someone in your IT department can spend five minutes, trigger the process, and be confident that it works. However, if the process takes six hours to complete, it becomes a more tedious process. Can you explain the challenges in this situation?

Joannes Vermorel: Sure. Imagine you have a system where the process takes six hours to complete. In your IT department, someone is going to start the process, wait for 10 minutes, realize it’s taking too long, and do something else. They might even forget about it. The next day, they might notice a small bug that caused a crash after six hours. To reproduce the problem, it takes another six hours of delay. As a result, you end up with problems that should be fixable in just a few hours, but due to more complexity and longer processing times, it turns into a very tedious process where the total delays become weeks. Not because it’s weeks of effort, but because people launch the process, forget about it, and come back the next day. This makes for very slow iterations.

Kieran Chandler: How widespread would you say these problems are? Are there a lot of companies out there that actually believe they’ve got very good data, but in reality, when you look at it under the surface, it’s not that great?

Joannes Vermorel: Yes, there’s another problem we haven’t discussed, which is the semantics itself. Many companies believe they have good data, but in reality, the data has unknown semantics. What I mean by that is, for example, when we talk about an order date, there are many potential interpretations. It could be the client’s order date, the time the order was placed on the website, registered in the system, or even when the payment was confirmed as valid. There could be 20 different interpretations of what this order date means.

When we start working with clients, we typically encounter tables and columns with little documentation. But when we’re done preparing the data, we have nearly one page of documentation per field per table. A typical supply chain situation has about 20 tables with 20 fields, so we’re talking about 400 pages worth of documentation just to clarify what the data means. People are usually very surprised by this, but it’s necessary to understand the data properly.

Kieran Chandler: Joannes, can you talk about the importance of properly understanding the data in supply chain optimization?

Joannes Vermorel: Yes, just the complexity of your supply chain that is reflected in this data, and if you don’t do this work, you end up with data that you don’t understand properly. Thus, it’s garbage in, garbage out. It’s not that the data is garbage, in the sense that the numbers are wrong, but you don’t know what the data means. So, if you have a date that you don’t understand properly, whatever calculation or modernization you’re going to do, it’s going to end up in something misleading. So, the semantics of the data are the key conclusion, and the documentation has to be in place before you can even start a project.

Kieran Chandler: So, who is to blame when it comes to semantics?

Joannes Vermorel: I would say that supply chain practitioners should be in charge. Most of them would say it’s an IT problem. But, how you see the semantics of the data really depends on the process you have. If you have the process where you’re scanning products at the entrance of the warehouse because you have this process, extracts around the IT department. They are not on the ground in the warehouse, so they don’t know exactly how your process is set up. The only people who know exactly because this data is just the result of a process that generates, they are in the first place in the system. So, my point is don’t expect IT, who is just managing the machines, making sure that the software has enough computing memory bandwidth and disk, to have the insights, skills, and understanding to understand what the data means. What the data means is typically a very business-specific problem. It’s not an IT problem at all. So, typically, the blame also frequently lies on the practitioner’s side. Practitioners have not spent enough time to properly qualify it with their own words and their own understanding. Thus, when there is this supply chain optimization, you end up with a vendor that ends up treating this data half-blind. That ends up with garbage in, garbage out.

Kieran Chandler: So, can the vendor be at fault as well?

Joannes Vermorel: Yes, obviously, the vendor can be at fault as well. Companies like Coke Ad who are doing supply chain optimization, and typically when the vendor is to blame, it’s because the vendor is trying to be sleek. Usually, they’re trying to minimize their challenge because they’re trying to sell a problem. They’re discussing ways like, “Trust us. It’s going to be a piece of cake. We’re going to do that in a matter of weeks. Boom, we’re going to do it like that, and it’s going to work.” The reality is that if you tell a supply chain director, “I’m afraid that just to qualify your data is going to take six months, and sorry, you should have done it, but you did not, so we will have to do it for you,” obviously, it’s hard to close this sort of deal. So, it’s much easier to be overly optimistic, but then that’s a recipe for failure. Then the vendor has to take the blame because they should know better. Maybe the client doesn’t know better because that’s the first time they’re trying to do a predictive quantitative supply chain optimization project. But then the vendor, who by definition, it’s probably not their first time, they’re doing that. They should know better. Thus, if the diagnosis situation where this sort of data is inexistent, then they should

Kieran Chandler: Then, they should basically warn the client that they are up to maybe multiple months of effort just to clarify the semantics of the data so that the data can be qualified as good. But it wasn’t that it was really bad at first. So, good is not the opposite of bad in this situation; it’s just that good is more like the opposite of dark data or unqualified data or messy data.

Joannes Vermorel: Okay, and to conclude today, there’s a wide range of different problems that actually come under that umbrella of bad data. I would say try to make sure to identify the root cause of the problem, and usually, it’s people. I mean, obviously, when I say it’s people, you don’t want to blame James from the IT department for being responsible for the mess. But when I say the problem is people, you need to understand exactly who has the ownership of the failure, and maybe this person was actually put in a situation where they could not do anything but fail.

You see, you can have the conclusion that James from the IT department has failed, but also that the organization itself has put this poor James in a position where he had no other option but to fail realistically speaking. So, it’s interesting that you start to see the problem from an angle that at least gives you clues on how you’re going to fix it as opposed to saying the data was bad, too bad, bad data. And then, if you were to do another initiative, you would just repeat the very same problem, the very same errors, and thus end up with the same failure at the end of the day.

Kieran Chandler: Okay, well, if James’s boss is watching, I hope he’s being sympathetic. Anyway, that’s everything for this week. Thanks very much for tuning in, and we’ll see you again next time. Bye for now.

Back to Lokad TV ›

PREVIOUS EPISODES