Making sense of unstructured data: Skills for success

Register now

One of the biggest challenges data scientists face in the business world today is making sense of unstructured data. The volume, variety and velocity of data continues to explode and it is up to data scientists to figure out how to transform the information into something meaningful and useful. But how do you begin to make sense of data when such vast quantities are being generated in different formats, often without existing labels? There’s so much information being gathered that it is difficult to keep track, let alone determine which data to keep and analyze.

Part of the problem is that many organizations are collecting data when it isn’t necessarily clear what insights they might glean from it. Take the NFL for example. The league is now using mobile devices to track the X/Y/Z coordinates of every player on the field, and the ball to a resolution of inches. Multiply that by 18 stadiums, 32 teams and multiple games – and you can start to see the challenge facing data scientists that have been hired to figure out how to put all that new data to good use.

It is evident to me that in order to keep pace, data scientists will need to invest in ongoing education or risk falling short of expectations. What skills will it take to succeed? Strong computer science and programming fundamentals are a must – that’s always been the case. But to tackle larger, more complex data sets and take advantage of the full power modern computers have to offer will require knowing how to program in modern languages, such as Python, R and MATLAB.

These languages have been used successfully in scientific computing and highly quantitative domains such as physics for years. For example, Python was used to improve the Space Shuttle mission design and has powered much of Google's internal infrastructure. Now data scientists can leverage them to provide insights to companies of all sizes in every industry.

But the work doesn’t end there. In addition to learning modern tools and techniques, professionals in the data science field must also move quickly to expand and solidify foundational math and statistics skills, as these provide the underlying theories for many of the new methods being used today.

Recommendation systems, for example, have become a primary way to discover relevant information from vast amounts of data. Leading organizations are using algorithms to design and develop powerful systems that sift through vast quantities of data and discover patterns and latent structures. Examples include media recommendations by YouTube and Spotify and online dating suggestions by Tinder.

What professionals need to understand is that there are certain principles and algorithms for designing and developing recommendation systems, and they can be learned by examining these success stories. How is it possible for Facebook to suggest, very accurately, who to tag in any given photo using people's photos?

You might be surprised to know that some of the seemingly complex methods behind these personalized systems, have a somewhat simple - or at least very reasonable - explanation for how and why they work. If data scientists learn the principals and algorithms behind them, they can apply the same models and techniques to solve new and similar challenges.

What’s more, probability and statistics are also the basis for machine learning, including new methods, such as deep learning. Machine learning methods can be applied to solve almost any prediction problem, but to do so properly, requires learning about regression and how different things interact, such as how graphical models and network models differ. Having a solid understanding of these things at a core level will no doubt give professionals a leg up in the future.

Prediction is going to get more important as big data and technologies evolve. Familiar approaches may become outdated, even obsolete, as people change the way they make decisions, exhibit preferences or take actions. Data scientists must move quickly to acquire the skills and knowledge necessary to keep pace. By gaining a greater understanding of data science fundamentals, professionals will be well prepared to address their company’s most complicated data analytics challenges in years to come.

For reprint and licensing requests for this article, click here.