Business leaders are increasingly employing data-driven strategies for success. The problem is that most data still fails to meet basic quality standards - in fact, only three percent of companies produce data that is error-free and legitimate.

Any data analyst or scientist can attest to the fact that the production of quality data doesn’t happen by accident. There are arduous processes that must be followed to generate data that meets the standards needed to properly train a predictive model. Organizations are also recognizing that the skills needed to make a proof of concept are very different to the skills needed to scale an idea to production.

For machine learning (ML) projects, there is a massive difference between generating the input data needed by a model to prove out a concept one time, and the data required to scale it to production. In order to arrive at a solution for moving ML from the desktop ‘proof of concept’ level, to 24/7/365 production, it’s important to understand where the challenges lie.

“Ludic Fallacy” means your model can never be fully accurate

The concept of Ludic Fallacy assumes that “flawless statistical models apply to situations where they actually don’t.” In order for a computer system to programmatically handle data, it needs a simplified, ‘gamified’ view of the world. This means gravitating toward mathematical purity and failing to account for real world scenarios that are hard to handle. This approach creates risks for ML, as real world data may be more prone to modeling issues than the training data used for the Proof of Concept (POC).

Adding more detail to the model - more fields, tables, relationships, etc. - is a solution, but the more detailed the model, the harder it is to work with and understand. This also assumes that data is accessible – in a lot of scenarios ML is expected to extract value out of existing, legacy streams of data, which is no easy task.

The data in your model will always be slightly inaccurate

It’s extremely rare for a large set of real world data to deliver a completely accurate reflection of reality. Real world data streams are always imperfect. In addition to human error, internal problems occur.

For example, a hardware reboot might erroneously send data from Jan 1, 1970 until its clock is reset, causing any number of problems. Specifications are interpreted creatively, the wrong zip code is used, unique records aren’t, unique identifiers change, a data item sometimes has trailing spaces – the list of ways data can be ‘off’ is more or less infinite.

Merging multiple data streams is difficult and prone to error

Sometimes models require that the analyst merge multiple data streams from different sources, a task which is merely awkward at a desktop level but can become overwhelming when real world volumes are involved. The single most commonly overlooked challenge is when one stream is ahead or behind another, creating all sorts of opportunities for chaos.

To see is to believe: it can be hard to understand just how difficult it is to merge three streams, one of which is 30 minutes behind, and another that stops and starts randomly. Add to that the fact that sometimes the same data is represented differently in different systems. Spellings (and misspellings), codes and accented characters can all differ between streams.

The second biggest challenge is joining streams of data that were never intended to be joined. The causes of failure are many and varied, but subtle differences in the models used to define the streams are usually the prime reason.

Say, for example, you’re working with shipments, but the definition of ‘shipment’ is loose - in one instance a ‘shipment’ is a collection of physical items from a warehouse that is put in a box and sent to a customer. In another area of the same company, a ‘shipment’ is a contractual relationship with a customer which might even use custom part numbers created by adding a contract number to a base part number. Joining the two streams can prove a nightmare.

As data volume increases, so does complexity

While people generally understand the direct implications of higher volumes, there are indirect side effects that can be deeply problematic. The first is that development and testing cycles become much, much longer, simply because of the time required to harness all of the data.

Many performance problems are identifiable only when working with very large subsets of the data, so fitting everything onto a laptop will no longer be an option. The economics of these very large subsets may become a gating factor, as there are fewer test environments than developers. It’s like offering valet parking at a restaurant versus and offering valet parking for oil tankers – the task is the same, but increasing scale makes it much harder.

Taking a ML model from desktop POC to production implies a massive, continuous effort

Let’s assume we’ve managed to cope with the previous challenges - what’s next? The reality may be that we spend far more time and energy harnessing the data so it’s usable by our ML engine than we did on the engine itself. In a scenario where a model needs a number of data points from many sources to generate one output, you’d suddenly find yourself on an endless treadmill managing and wrangling these data feeds — not just once but in real time, 24/7/365.

When time is of the essence

Everything preceding this section is based on an unstated assumption that we can tolerate lags anywhere from thirty minutes and two hours between when an outside event happens in the world and when it becomes usable in a feed. But what if we need it in less than a second?

An internal, continuously-consuming service can set up the importer to import to staging and production database instances in a few seconds. But below that you’ll need technology capable of absorbing hundreds of thousands of messages per second while still providing the sub millisecond responses that may be required to turn a ML engine into a source of income, and ultimately profit.

While the conceptual foundations of ML are fairly solid, without tools to support a move from the desktop ‘proof of concept’ level to 24/7/365 production in a millisecond time range, the likelihood that the ML project will fail is high. To achieve success, it’s critical that high volume processing technologies shift from status quo to next-level speed - complete with the power to work in timescales of under a second, and ingest, transact and provide event-based decisions quickly on high velocity data feeds.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access