The work of interpreting data to help decision-makers goes back some 5,000 years to the bureaucrats and businessmen of ancient Sumer. But dealing with the astronomical size and complexity of modern data sets requires a new, multifaceted set of computational, statistical, and communication and people skills, called “data science.”
Data scientists are among the most sought-after professionals today, and the need for them will only increase as our world grows ever more complex and interconnected.
As Mark A. Smith of Ventana Research blogged in January, a key challenge for organizations in 2014 is acquiring expertise at extracting the best possible insights from big data. That means finding people with a uniquely balanced set of data science skills that enable them to take on today’s challenge of making sense of petascale and larger data sets from the enterprise, the cloud and the Web.
But what do these well-rounded data scientists look like, exactly, and where can they be found?
The function of the data scientist is, in a word, sensemaking — providing a clear understanding of an organization’s universe through data analysis, helping improve decision-making and supporting leadership.
The best data scientists also will be intensely curious and interested in discovering new insights. They will be creative in their approach to identifying and solving problems.
Their professional expertise stands upon three pillars: (i) deep theoretical knowledge of statistics and computability, (ii) practical knowledge of diverse data science tools (and the ability to create them when needed), and (iii) an ability to communicate effectively with people with no technical background about very complex technical material.
Specifically, well-rounded data scientists have the following skills:
- They can use a variety of existing data analysis tools (such as R, SPSS, or MATLAB) but also have the statistical knowledge and programming skills needed to build their own. Many data analysis problems can be solved by applying existing tools; data scientists should have a large toolbox that they can bring to bear. They also, however, need to know how to build new tools for novel problems; this knowledge will also help them use existing tools more creatively.
- They will be familiar with the many varieties of useful data — they will know how to collect relevant data of the right type for a given problem and how to structure and clean the data so it can be properly analyzed. Data in the real world is never provided on a silver platter, and data scientists must know how to hunt down the right data and prepare it for use.
- They will possess strong communication skills, including active listening, storytelling, and visualization abilities. Communicating analysis methods and results can be tricky, as there is not just a bottom-line answer, but also important caveats and limitations based on complex statistical assumptions and the nature of the data used. Data scientists must be prepared to aid decision- makers in refining and clarifying their thinking about a problem, helping them turn large and wooly business problems into specific and meaningful questions that data can give insight into.
- They will be comfortable working with diverse individuals from different backgrounds and with varied skill sets. A typical data science project brings together the decision-maker with statisticians, software engineers, strategy analysts, subject matter experts and more. Data scientists must be able to effectively work with all of these stakeholders and coordinate their work to reach an effective solution.
- They can quickly get up to speed on a new application area and determine how to properly approach data collection, analysis, and interpretation. Even data scientists working for a single organization will come across a wide variety of problems needing their help. They must be able to get a clear understanding of the unique nature of each new problem and not just rely on an abstract conception of how the data “should” look.
The first generation of data scientists was largely self-taught. They started from backgrounds in physics/science, statistics, mathematics, or computer science, and learned the other necessary skills and knowledge along the way. But universities (including Illinois Institute of Technology) are providing new multidisciplinary degree programs to teach students data science, which should help to take the guesswork out of finding data scientists.
These programs go beyond traditional degrees in statistics, mathematics, computer science, and business intelligence by teaching a broad set of both technical and soft skills to prepare students for careers in data science.
But not all such programs are created equal, and it is important to be aware of the differences among them.
Some programs focus on teaching students specific tool-based skills and application areas. These programs can produce graduates who can start work on well-defined projects fairly quickly. Other programs with deeper theoretical content produce graduates who will be able to more easily work outside of their initial comfort zone, and who can learn, grow and adapt as the field changes.
Similarly, programs that focus deeply on mathematical and computational content may produce more technically knowledgeable graduates, but unless they entered the program with already excellent communications skills, these graduates may not be well-suited to real-world data science jobs where communicating with nontechnical people is an essential part of the job. Programs that more fully integrate soft skills into the curriculum will produce more well-rounded data scientists who can take on leadership roles.
Data science brings a distinct set of challenges to business communication. How does one explain statistical evidence and analytical results without oversimplifying or creating confusion? Students need to learn how to weave results into a coherent story, how to explain statistical assumptions and caveats clearly, and how to create data visualizations that give insight and are not just pretty pictures. The only way to learn these skills is by practicing them at the same time as learning the related technical material.
Finally, a critical component of any quality data science education program is some sort of practical experience component, whether it is a student project, an internship program, or a guided practicum. Until they have worked on real-world data science problems, students will not fully understand how to perform central data science activities that cannot be taught in a classroom setting: struggling to define the analytical problem correctly, dealing with real data complexities and inconsistencies, and communicating results in a clear, enlightening, and satisfying fashion to non-technical, non-academic clients.
Our model is to place students into teams of two to four individuals who work on projects for industrial partners with academic guidance. In such team-based work, students are forced to work together, exercising interpersonal communication skills. They can learn from and teach each other, seeing in the process how people with different talents and knowledge can together achieve more than they could as individuals. And by working with real clients from industry, students get direct experience and feedback on both their technical performance as well as their communication skills.
Understanding what makes for a good data scientist and how to evaluate different educational programs is essential for effective recruiting. There is no question that it will become easier as the field matures and general standards for data science education start to emerge.
The field of data science will of course continue to evolve and change, as the nature and complexity of data continue to evolve and change. But one constant will remain: well-rounded data scientists will be needed to help us all make sense of our changing world.