Every organization, from Internet behemoths to consumer packaged goods and most importantly, governmental agencies, needs to make data-driven decisions. But the bigger challenge is finding a way to scale those data-driven decisions. That's where self-service data preparation solutions enter the picture.

In the corporate world, the big priority is achieving a competitive advantage. Companies in the top third of their industry in the use of data-driven decision making were, on average, 5 percent more productive and 6 percent more profitable than their competitors, according to a Harvard Business Review article ("Big Data: The Management Revolution").

In government, data-driven decisions also are taking hold. While the National Security Agency has always been a data-driven operation, other government agencies now realize they too need their own skilled people to sort and make sense of the flood of information. Failure to do so will ultimately overwhelm them and worse, put citizens at greater risk, the New York Times reported, based on the recently released F.B.I. 9/11 Review Commission's report.

As our world began to transition heavily to an information-based economy, a term emerged that has been popular in organizations of every size: “information workers.” Information workers are a type of employee that relies heavily on having access to the right information to do their work, in contrast to task workers who are skilled at operating equipment such as in a manufacturing setting. As with all groupings, a stratification of skills emerged and today, sitting at the apex is the “data scientist”.

Hot Title

Harvard Business Review recently declared the role of data scientist as the sexiest job of the 21st century, and for good reason – scarcity means higher paychecks. There are only 200,000 legitimate data scientists worldwide, according to Kaggle, a platform provider for data science competitions. Moreover, data scientists with just two years' experience can earn between $200,000 and $300,000 a year, according to The Wall Street Journal

The paychecks have been well-deserved as these specialists have accumulated a unique set of credentials, combining deep technical expertise in statistics and modeling with very specific business domain knowledge.

Limited Talent Pool

However, even when organizations can afford to hire a data scientist, they are few and far between. Why? Let’s start with how a data scientist is born: more often than not, they have a background in statistical modeling and imperative programming in languages like Python or R. While in school, they developed expertise in different types of regression modeling techniques, deep learning, classification and data management technologies, which have all become important parts of the data scientist’s tool kit.

As they entered the professional world, their role grew rapidly from predictive analysis and code writing to include data management skills such as data modeling. As a result, their role evolved into someone who could put tools and data together to create highly valuable business information. Suddenly, everyone was turning to these data scientists with the burning questions no one else could answer. Given enough time and resources, the organization would get their answers and business decisions could be made.

That Won't Scale

However, new pressures posed by big data make this a completely unscalable approach. The sheer volume of data being collected, and the ever-changing ways the data has to be transformed to meet different analytic requirements makes it impossible for any data scientist to keep up. At the same time, non-technical business teams have discovered the power of visualization tools and want to manage their own question-answer cycles without requiring assistance from model-builders, coders or scripters.

Beyond data scientists, the lines are blurring when it comes to who is responsible for managing the data and deriving its meaning. There are no less than six new and emerging roles within any organization, with data developers/engineers and business analysts being two of those, according to a recent Forrester webcast ("Information-Related Roles and Processes Are Changing: What Are You Doing About It?").

The pool of data developers and engineers is roughly three million worldwide. These individuals count data modeling as a core skill; where data is in their DNA and the IT department is their home. Data developers have Excel, SQL, Microsoft Access and declarative dataflow diagrams down cold. They can work in declarative programming metaphors, draw dataflow mapping diagrams of what they want the system to do, but don’t necessarily do a lot of coding. The challenges this group faces are similar to those of the data scientist. They understand the growing number of new data types, scale issues, and new system requirements, but they don’t have the luxury of time since most of their work revolves around operational analytics. Data developers face another distinct disadvantage, as they have to decipher the meaning of the data, but don’t necessarily have the business knowledge to accurately apply meaning.

Self-service tools are a great solution as they allow data developers to give business partners within their organization secure access to the data to manipulate themselves, but with full governance capabilities, so they can track usage completely.

Business analysts, or Excel power users as they are often called, number 15 million to 30 million strong worldwide and are emerging as the new voice of the business. While they are not technical, they understand the data and the business issues better than anyone else. Data analysts started their journey by preparing reports and dashboards either as part of IT dedicated to a business unit, or within a business team. In both cases, they have outstanding domain knowledge since they are the ones getting asked data-dependent questions from their leaders. Because of this, they’ve spent hours handcrafting datasets, mastering vLookups and pivot tables and developing SQL skills, such as creating select statements to help find answers.

Over the past 10 years, the role of the data analyst has also evolved. They fueled the success of the analytics and visualization vendor community, such as Qlik and Tableau, which automate and simplify manual analytic processes and allow them to quickly manipulate data without being highly technical.

Time to Adjust

So what exactly does it take to shift this army of millions to something closer to data scientists?

First, we need to remove the areas of friction remaining in the analytic process. Across the data-driven organization, there is still a portion of every data exercise, which requires both technical expertise and time to complete: data preparation. Regardless of skillset, anyone who works with data will admit that the majority of their time, up to 80 percent, is spent cleaning, aggregating and organizing it prior to performing any visualization or analytics.

This has given rise to a new class of self-service data preparation solutions, which simplify, automate and reduce the manual, error-prone steps which used to slow down the analytic process. This new self-service data preparation toolset enables analysts, developers and data scientists to collaborate and dynamically govern the data integration, data quality and enrichment processes at scale. With the power of machine learning and sophisticated algorithms, users can be proactively guided through a process that helps them aggregate and enrich their data sets, identify patterns and relationships, find and fix quality issues. Suddenly, all the time spent manually preparing data can now be spent on rich visualizations, discovery and storytelling.

Next, we must think about self-service data prep as part of the agile BI stack. We must isolate siloed activities that create lags between getting data, asking questions and finding answers. For most organization, that lag occurs between the traditional ETL process and sharing data with the business. In a recent report, Gartner warned that “Data Preparation Is Not an Afterthought” and advocates the “use self-service interactive data preparation tools to enhance analyst productivity.”

Finally, we must empower those who actually work with the data. The data-driven organization now has access to better tools and automation across the entire analytic process. Data developers and data analysts now need to unleash new skillsets, to combine their expertise in reporting and visualization with the bandwidth and motivation to take on the complexities of the new NoSQL databases like Hadoop.

Self-Service: A Closer Look

Organizations should also examine how much time is spent on coding, drawing mapping diagrams, scripting, and sampling. It is wise to consider adaptive data preparation solutions to help reduce that time in half or more. A self-service data preparation tool should:

  • Provide an interactive and visual experience for non-technical users yet scale to big data volumes of billions of rows without needing to sample.
  • Allow users to work within an Excel-like interface with single-click manipulation.
  • Support model-free exploration – no schema definition up front.
  • Have probabilistic algorithms for clustering data and merging datasets.
  • Capture steps through end-user manipulation of data – rules are created transparently in the background into scripts that can be replayed, reordered or refined.
  • Offer visual representations of semantic and syntactic data quality.
  • Manage pervasive versioning and audit throughout the system.

So while we may be some years off from realizing the true Information On-Demand dream state, self-service data preparation is taking its spot in the agile BI stack by removing the largest remaining area of friction in the analytic process. In doing so, millions of data analysts and developers will have the power to transform into a hybrid data scientist-analyst – with more time in their day to unleash expertise, imagination and business value.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access