If you put garbage in, do you get machine learning out?
Machine learning and artificial intelligence promise to be transformative technologies, but despite many businesses rushing to integrate machine learning, they still struggle with setting the proper foundation for these technologies: Controlling the quality and accuracy of their data.
In fact, a recent report found that nearly half of businesses do not have the technology in place to leverage their data effectively. That same report noted that obtaining accurate data was one of the largest challenges businesses face when it comes to data management.
New open source technologies now enable companies of any size to implement advanced analytics, but most companies fail at the basics of collecting and storing their data. It is the old “garbage-in, garbage-out” problem, but now poor data is driving machine learning or artificial intelligence projects.
Relevancy and timeliness of data is critical to effective application of machine learning for business outcomes, both in training and using the model. That said, the timeliness needed depends on the use case. It could be in seconds, minutes, hours or days.
Not all data needs to be refreshed in real time. Historically, data collection and curation have been batch-oriented. The increasing corporate appetite for real time analytics is changing that, and the abundance of elastic computing and storage is making the change possible.
Once the sole province of companies such as Amazon, Citibank or PayPal, various proprietary and open source technologies are now available to help organizations of any size tackle these challenges. Data pipelines, asynchronous messaging, micro batches, stream processing, time series and concurrent model iterations are representative techniques that are being deployed successfully.
Apache Streamsets, Kafka, Spark, Time series databases, and Tensorflow are some of the foundational open source tools and technologies in the forefront of this shift to real time data collection and curation.
But no matter how sophisticated the technology, it still comes down to the relevance and timeliness of the data. It is the foundation of any digital transformation effort, and companies must take a disciplined and structured approach to managing their data if they want to properly leverage machine learning and AI. This involves:
An understanding of the business case. What are the business goals and objectives? What data are relevant to achieving understanding if those goals are to be met? What level of timeliness is needed? Without understanding the answers to these questions, any effort to leverage data will likely fail to reach its full potential.
A full inventory of data sources. This includes structured data from internal transactional databases; external sources, such as credit scores from TransUnion or Experian, to augment the internal data; and then open source and internal, unstructured data on user behavior and social media. Many companies think their internal structured data is enough, but the unstructured and third-party data can be just as critical.
A strategy for storing the data properly. For many companies, important data is distributed in silos across the enterprise. For example, the customer onboarding system is disconnected from the website shopping cart, while the sales team is working with the CRM system to manage cross-selling. Implementing a data lake will help pool these different data sources into a single view across the enterprise. In addition, groups across the enterprise will make decisions based on the same source of data, eliminating redundant and inconsistent actions.
Leveraging the data for visualization. Once the basics of data collection are established, then companies can move towards using the data for visualization, where reports and dashboards enable people to make decisions and take actions based on the data. This is the first step in providing meaningful interpretation of data in a form that is actionable.
A move to automated decision making and machine learning. With clean, timely, and relevant data – and a solid understanding of how the data can be used to make decisions – it now becomes possible to forecast and predict in real time. Rather than having to conduct interpretation of the data manually, companies can let machines use the data to automate some of the decision making. Additionally, unsupervised machine learning also enables to uncover insights which previously have not been hypothesized.
A commitment to on-going data governance. It’s important to establish policies and processes that maintain a high level of data consistency and cleanliness, otherwise companies will find the quality of their analytics will degrade as the quality of their data degrades. When that happens, it opens the door to a sub-optimal decision making process and has an adverse impact on clients.
There is no silver bullet for this, nor should companies expect to implement a comprehensive data strategy in one fell swoop. Rather, this is a long, slow process, assessing where the company is today in its maturity curve and what it needs to do to get to the next step.
If there are 50 to 100 data sources that ultimately need to be integrated, don’t try to incorporate all of them at the same time. Instead, focus on the two or three that will have the greatest impact on the business outcomes and work those through the full end-to-end process of data assessment, enrichment, visualization, and ultimately machine learning.