Importance of Data Modeling and Subject Matter Experts In Machine Learning
A friend of mine recently reminded me of the notorious quote from Frederick Jelinek (the father of modern speech recognition), "Anytime a linguist leaves the group the recognition rate goes up."
I remember being quite upset about it, back during my lnguistics studies. Is it really so that if we exchange the domain experts (i.e. phonologists in his case) with pure engineers, the performance of the system will improve?
This led me to think about my domain this time. Given a system that heavily utilizes machine learning (ML), what are the things that make its performance go up: the domain experts or the lack of these? For me it is clearly the former and here I will defend why.
Just to set up the scene, I work in legal arena, a highly specialized domain with well-defined tasks, where we provide technology to support, augment and increase the productivity of legal teams and departments. We utilize both supervised ML techniques (i.e. we have access to labeled data, we know what is true and false) and unsupervised ML (i.e. all we have is just raw data).
Let us focus on the first case only - you have a task to solve and you have labeled data to train a ML system. That applies to many highly specialized domains where data is unique.
So here is my list of personal favorites, influenced by my experience, and I will highlight the two that are really crucial for the success of such a product.
A system to scale up testing and training
It is not only Google, Microsoft or Facebook that can afford such a system. It pays off to have an internal platform at hand where the engineer(s) can quickly test new hypothesis, try out or implement new algorithms, run anything from simple Bayesian classifiers to the more time consuming deep learning. And as we focus on narrow domains here, Andrew Ng, the chief scientist at Baidu recently said: "Most of the value of deep learning today is in narrow domains where you can get a lot of data."
Data model and Ontology
You need an accurate model and a holistic view of the data with which you work. This for me is the most crucial part for the success of ML systems. This will, to a greater extent, steer the choice of algorithms you use, the performance metrics of the system as well as its perception by the users.
By knowing the nature of your data you will avoid mistakes like in the case of the chat-bot Tay that had to be closed 24 hours after launch. Therefore, a good ontology of the domain in which you work will immensely improve the task you try to solve.
In the world of supervised ML "garbage in, garbage out" is a well-known notion. You cannot expect great results if your data is messy, incoherently marked up or highly noisy. If you have a way to handle those errors and mistakes in the pool of data, this will quickly pay off in terms of improved performance. By far the best way is to run your data already in the beginning through subject matter experts. Rely on them throughout the process of preparing, correcting and improving the data.
Ask questions. Do you have a framework to evaluate your models? What is the ground truth? How do you ensure that the ML is solving the problems? These are all questions you need to ask. Mostly, this is available to engineers but some of these questions and their answers will surface to the real users.
Visualization and UI
Last but not least, create levels of visualizations for different types of users. Not everyone needs to know everything about the performance and results of the system but get to know and adapt it to the needs of different target groups. Each will speak its own language and have its own needs. The success of your product will depend on your success in anticipating, understanding and addressing these concerns.
"Know thy self, know thy enemy" says Sun Tzu in 'The Art of War'. Creating a domain/data model and relying on subject matter experts throughout the development of the ML system will certainly guarantee you a success. And finally, your product, albeit highly specialized, will not be a threat to the domain experts, but rather a natural extension of their abilities. True experts cannot be substituted.
(About the author: Svetoslav Marinov is head of the Gothenburg Machine Learning Team at Seal Software)