Harvard Business Review proclaimed the data scientist “The Sexiest Job of the 21st Century,” and for a good reason: McKinsey & Company projected that, in the U.S. alone, we will face an excess demand of “1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
From the moment the “data scientist” was coined in 2008 by D.J. Patil and Jeff Hammerbacher – then the big-data leads at LinkedIn and Facebook, respectively – the business world attempted to capitalize on this game-changing global trend.
But as we dig deeper – it’s not just about “big data.” It’s about doing something useful with data, however big. Turning data into useful data depends a lot on hiring skilled data scientists, but it also depends on giving them the right environment to put those skills to use. Environment is what many organizations have neglected.
Within the datasets available to organizations lie answers to some of the most pertinent questions and ways to drive and validate important decisions. But how do you get there quickly and consistently when there are anywhere from megabytes to petabytes of data between you and the answers you need?
The challenge with data is that businesses don’t know what they don’t know. But data scientists won’t be able to find answers if they aren’t empowered to be scientists – to observe, hypothesize, experiment and develop theories. Put simply: organizations need a data laboratory.
How to Create a Playground in the Cloud
IT is a resource-constrained organization – meaning dedicating physical servers and support for a laboratory with uncertain ROI may be challenging. In addition, the nature of experimentation is that it ebbs and flows, diminishing the need for permanent resources. Against that backdrop, the cloud just might be the perfect place for your data science experimentation.
Designing a data lab in the cloud means that it can work with large quantities of data. It integrates well with existing technologies. It can do complex machine-learning tasks. And it can also be quickly stopped or repurposed. It is flexible enough to allow the frequent adjustments required by research, but it is also designed for production workloads. With a click of a button, the entire technology stack can be replicated, scaled and rolled into production.
As I see it, you cannot build a powerful cloud data lab without three main components:
- Data lake – a Hadoop-based storage repository where structured and unstructured data can be stored side by side.
- Apache Spark cluster – a scalable compute instance capable of machine learning, graph processing and statistics. Preferably, it should also integrate a familiar notebook interface for Scala, R and Python programming languages.
- Docker-based containers – used to multiplex machines and run hundreds of independent workloads and applications on the same physical hosts without compromising performance or touching operating systems.
These components can help a data scientist conduct data research at scale without the need to involve IT, but within an environment IT can fully control.
A 2016 McKinsey study shows that big-data investments yield a multiple of 1.4 to 2.0 on the level of investment, increasing profits 6 percent on average. The study also shows that rapid experimentation and learning are critical to effectively leverage big data, and that early adopters have a clear advantage, especially since less than 1 percent of all data collected today is ever analyzed and used.
If a company is fortunate enough to find the data scientists required to leverage its data, it must make sure they have the tools necessary to do the best possible work.
Regardless of industry, the data scientist belongs in a data laboratory.
(About the author: Lucas Roh is the chief executive officer at Bigstep. Prior to Bigstep, Lucas founded Hostway, a global hosting company.)
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access