Organizations achieve a number of tangible benefits by automating aspects of data engineering, the lengthy process of preparing data for consumption. They accelerate time to action, democratize data-driven processes, and free the business from the constraints of waiting on others before they can access data.

Historically, the necessity of data engineering was only matched by its tediousness. Preparation for data analytics and application use involved some wrangling that produced two undesirable side effects.

First, wrangling measures like cleansing, transforming, integrating and curating raw data traditionally monopolized data scientists’ time. Secondly, the complexity and lengthy duration of these tasks often alienated the business from using data.

However, a number of advancements in data engineering have now decreased data preparation time while increasing time for exploration and applications. By automating aspects of the wrangling process, expediting data quality measures, and making these functions both repeatable and easily shared with other users, alternative solutions to this problem are “empowering your more business type users with functionality that maybe would have only been available to a database administrator or DB doers,” explains Noah Kays, director of content subscriptions at Unilog, which offers a product information management platform.

A number of use cases of alternatives to manual data preparation support this view. The Associated Press doubled its production of data, more than doubled its customer reach, and reduced costs for providing localized data from national data sets to members via alternative data preparation measures.

Meanwhile, Unilog drastically accelerated the speed at which it extracted, integrated and compared data on hundreds of thousands of items for e-commerce sites without manual data engineering.

In both cases, users who were previously dependent on IT teams for data are now readily exploring data in a self-service manner to further their business objectives.

Data Integrity

Data integrity is essential to AP’s business model, which is largely based on locating national or regional public data sets, vetting them and making them available for AP members to identify local points of interest. AP has teams scattered throughout 100 countries in over 250 locations and provides this service to more than 3,000 publishers.

Data sets are typically structured spreadsheets that might have any assortment of quality problems. However, once AP’s data journalism team parses these data sets for what Troy Thibodeaux, AP’s interactive newsroom technology editor, termed “data integrity issues,” it can publish results to members on Data.world, a collaborative, integrated hub for distributing data. “Then the journalists from our member organizations can easily hop in and get to work analyzing it without having to do quite as much of the labor of the upfront costs of vetting the data,” Thibodeaux explained.

Data Quality

Despite multiple means for retroactively assessing data quality, this data management domain remains a vital output of the engineering process. Ensuring data quality is an integral aspect of Unilog’s business, which is focused on building e-commerce websites for B2B customers and providing data to run them.

Unilog’s master catalog data for a sizable buying group contains 4 million items, which exacerbates data quality metrics (such as identifying duplicates) because of scale. The company’s Excel deduplication procedure “worked OK when we had a couple hundred thousand items, but once we started getting a million, it became a bigger problem,” Kays recalled.

The primary problem was extracting data for quality measures, which sometimes consumed up to eight hours. But by leveraging Paxata’s Adaptive Information Platform, which is centered on multidimensional self-service data preparation, “that same extract took five minutes,” Kays said.

The automation capabilities of this approach include automatically formatting spreadsheets and the master catalog as well as using numerous comparison methods for deduplication. The solution also employs algorithms that normalize attributes according to Unilog’s specifications, as well as lookup functions for determining, for example, if a batch of new items is already in the master catalog under different descriptions.

Previously, these processes would necessitate extensive data engineering involving code or other methods. But with the new platform, “I just upload my file into Paxata, do one of my lookup functions in there and I have all the data at my fingertips—and I didn’t even have to open Excel,” Kays added.

Data Integration

Traditional data integration is frequently laborious and requires either building a common data model between sources or constructing joins between tables. Contemporary data engineering methods automate this process for unparalleled speed and flexibility. Kays noted intelligent algorithms can suggest ways to join tables with visual approaches so “it’s just really extrapolating a lot of the DBA functionality to the level of a business analyst.”

Scalable solutions enabling users to load all their data and see them in a single view aid integration efforts. Even better, once users “finalize their process they can automate it,” Kays revealed.

Modern integration methods also relate to flexible access. According to Thibodeaux, AP benefited from the large number of tools users have for manipulating data, including options as diverse as Excel, Python or Tableau.

“When we were doing this on our own, we had to make decisions about the technologies members would use about the way we cast the data and presented it,” Thibodeaux said. “Data.world having all these [tool] integrations takes that question away.”

Data Discovery

Successful data engineering also aids data discovery, a key precursor to loading applications or analytics. Centralizing data in a single place significantly improves data discovery, and is the foremost way in which modern data engineering platforms impact this domain. AP members maximize this benefit by availing themselves of prewritten SQL queries designed for understanding local implications of national data sets.

“It gives them the power of SQL without having to know SQL,” commented Thibodeaux. The reusability of these prepared queries (in which journalists simply insert the name of their city or county) illustrates the most compelling facet of modern data engineering — its reuse.

Both the Unilog and AP use cases prove that after someone has done the difficult work of cleansing, qualifying and integrating data, both those data sets and that preparation work is reproducible without having to be repeated manually. As Kays observed, modern platforms for data engineering automate aspects of integrating and preparing data. Subsequently, even business users reap the benefits of data-driven processes without lengthy, manual data engineering obstacles.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access