If you want to manage the quality of your data, it’s important to notice and act on any troubling changes in your data’s fingerprint, a new concept we introduced last week.
In response, we received a lot feedback from data professionals—architects, scientists, IT—who said they were challenged more and more with data quality because of the growing complexity and volume of data, and were looking for new solutions with new technology that could actually keep pace with their data.
So we wanted to share a customer story that illustrates another pervasive data quality problem that can be solved elegantly with new automated machine learning technology: redundant data.
The customer is an international bank; let’s call it Giant International Bank (GIB). The enterprise data team at GIB is responsible for quality, integrity, and risk management of their data assets. They are the keepers of the data and, because they provide it for strategic decision-making, quality issues at the source can create a harmful, ripple effect across the entire organization.
GIB—like so many other data teams—receives data from multiple internal and external sources that they then funnel into a data repository. As a result, they are challenged by data redundancy—or “fatty” data—where the same information appears in multiple systems and in multiple formats across the company and simply does not tally from system to system. Making matters worse, the GIB team mixes and matches data across sources to create data products for downstream use in analytical tools and reporting, and this work is fertile ground for the redundant data problem to spread exponentially.
Identifying redundant can be tricky, because it very often comes from multiple sources and may have different headers for the same content because of nomenclature inconsistencies. And near-redundant data is also a challenging problem, and is much harder to identify. This occurs frequently when the same data is collected/stored in different metrics. For example, two columns that store the same value but in different metrics like dollars and pounds are actually the same data, even though the content looks very different. Copies of data are also created during data transformations, which occur continually. For example, customer annual income is divided by 12 to compute monthly income. The two columns are replicas, since knowing one you can compute the other.
For GIB, this fatty, redundant data is not just inconvenient. It undermines reporting initiatives, impedes efforts to make sound strategic decisions, and poses a serious risk to the business. In most companies, data redundancy ranges from 10% on the low side to 30% on the high side. The impact of redundant shows up across the entire organization, frequently adding as much as a 30% cost surcharge on all data-related activities, and a much higher cost related to the risk side of the business.
So it’s critically important that GIB monitors, flags, and removes redundant data. And for GIB, this is not a one-time job. Addressing redundancy to manage data quality is a continuous effort. Unfortunately, the process of identifying redundant data has been very manual, resource-intensive, and very error-prone.
Here’s where new technology can make a significant impact. Emcien’s machine learning-powered data discovery tool automatically and instantly identifies and flags all of GIB’s redundant data for them, in a continuous fashion. It easily identifies redundant data with different headers and near-redundant data. As a result, GIB is able to isolate or eliminate redundancy much more quickly, efficiently, and with greater efficacy, creating a leaner data repository that consistently feeds high-quality data to the enterprise for high-quality decision-making. It also decreased risk, and decreased cost by minimizing the burden imposed on data storage, computational processes, and team time. And it allowed the data team could work on other, more interesting projects.
How fatty is your data?