Strong data quality key to success with machine learning, AI or blockchain
Poor data quality could cost organizations an average of $15 million per year in losses, according to Gartner. Organizations know data is a key factor to their success, and for enterprises reliant on this data to make strategic business decisions, bad data can have a direct impact on their bottom line.
In some industries, such as financial services, banking or insurance, biased data can also result in hefty fines or penalties. But for organizations experimenting with or exploring emerging technologies, bad data presents an even bigger risk.
As organizations harness value from an increasing variety of data sources and formats, ranging from data residing on legacy platforms to new data sources such as the Internet of Things (IoT) and streaming data, there’s an increased focus on analytics platforms, compute frameworks and new methods of deriving insights. This can make data quality an afterthought.
With access to more data, businesses are trying to predict things like customer behavior, customer churn and demand as well as analyze risk, detect fraud and determine deficiencies around service or product quality. All these initiatives require the ability to analyze exploding volumes and varieties of data.
These use cases require new ways to be able to analyze all that data, which is a challenge. A SQL query, for example, can no longer be used to drive insights from petabytes of unstructured data. This is where machine learning and predictive analytics come into play.
As enterprises adopt machine learning to garner insights on complex datasets, they often miss a vital step – ensuring the quality of their data – as artificial intelligence (AI) and machine learning algorithms are only as effective as the data they use.
Enterprises must be skeptical of data as it essentially determines how the AI will work and bias in the data may be inherent because of past customers, business practices and sales. Historical data used for training the model impacts how algorithms behave and new data used by that model impacts future decisions. Bad data in each fundamental stage can have a significant impact on the business insights driven from the predictive model or the automated actions taken by the system.
For supply chain use cases in blockchain, bad data can be even more costly. While blockchain has the promise of providing transparency, traceability and visibility to processes and operations, bad data populated in the distributed ledger is challenging to address as the transactions are immutable.
For example, the geocode or location data about a food supply chain must be correctly populated to guarantee traceability from its origin. If garbage is going in, garbage is coming out, which is why enterprises must have a handle on data quality before investing in these technologies. Otherwise, initiatives that promise big business benefits not only waste time for IT and users, they potentially put organizations at risk.
As emerging trends such as machine learning, AI and blockchain become mainstream and organizations have more production use cases, data quality becomes more important. Data scientists need to make sure the data is comprehensive and from diverse set of sources, not just data that is high in volume.
The data has to be integrated, de-duplicated, cleansed and matched across a variety of data sources and prepared for downstream analytics. The data quality needs to be part of the business plan. If insights are garnered based on poor, dirty or inaccurate data, machine learning or AI technology may learn bad habits, gain bias or pollute outcomes overall. Bad results and biased data pose a great risk to the enterprise, as they may have dangerous implications for brand reputation, revenue growth or compliance requirements.
To improve their data resources, data output and strategic decision making, companies must make an ongoing commitment to data quality, and this begins by creating an overarching strategy put in place before developing projects. A strategy must examine compliance requirements and considerations, as different data quality measures are needed for different purposes and results.
Customer data for marketing purposes requires different processes than fraud detection or anti-money laundering, for example. There’s also an elevated regulatory component for industries handling sensitive customer data like financial services and healthcare, as businesses could unwittingly break the law by discriminating against someone because of biased data.
For accurate insights with machine learning or AI, companies must use all types of data, whether streaming, IoT or legacy data from mainframe environments. Enterprises may be using third-party data sets or large volumes of data with duplicates, as a key necessity of AI use cases is a large volume of data. For either method, dirty data may cloud the output. No matter the type or volume of data, an effective policy will consider missing, outdated or redundant data to create verified, clean data for analytics.
Data quality for emerging use cases and advanced analytics also requires flexibility. Accurate, complete and validated data requires continuous examination of quality, both for data at rest and data in motion. The priority should be placed on data ingested to a centralized data hub, such as the data lake or data repository and not as an afterthought once it turns into a data swamp. The same holds true for data in motion that is populating blockchain.
One unique differentiator for blockchain is that changes to data and adjustments to the data quality processes can’t be made while the data is on blockchain. This means it’s more important that data quality processes take place before data is populated on the distributed ledger, reiterating the importance of strong, solid policies with agreed-upon rules, automation with IoT sensors and standards for data. While the typical measure is data quality while data is both moving in and out of its repository, with blockchain, data quality upfront is the most important aspect.
As companies leverage new technologies and analytics methods for real-world applications and emerging use cases, data quality continues to grow in importance. By developing a strong strategy that takes into consideration flexibility and a variety of data sources and is platform agnostic, enterprises can set themselves up for successful analytics and can feel confident in using insights for strategic business decisions.