Data has been the U.S. Bureau of Labor Statistics' stock in trade since 1884. And as analog gave way to digital, the BLS evolved from in-person surveys recorded on paper to collecting information by telephone, fax and email. Now, in an era of big data, the bureau is relying more on the Internet, databases and analytics to produce the reports that government officials and business leaders use to gauge the health of the economy and labor force.

“We’ve really moved away from doing a lot of face-to-face collection to much more electronic collection ... and getting data from alternative sources,” says deputy commissioner Bill Wiatrowski. “Where we’re able to get data directly off the web or directly from employers in electronic form, that really helps us to speed up the process of getting the data in the door.”

All in, the BLS handles some 720 terabytes of information—and counting.

Also See: NWS Boosts Big Data Systems to Improve Forecasts

A unit of the U.S. Department of Labor, the BLS gathers statistical data on the labor market, working conditions and price changes in the economy. It collects and disseminates data to government agencies and the general public. The BLS has eight offices across the country and about 2,400 employees (actually, being the BLS, the bureau provides very precise data: 2004 full-timers and 401 part-timers as of May 15). That includes 393 IT specialists. 

The bureau’s reports cover everything from employment trends to workplace injuries to the costs of consumer goods and include titles such as the Consumer Price Index, Producer Price Index, Employment Situation and Employment Cost Index. In sum, the BLS website lists more than 30 reports that it issues annually, quarterly or monthly.

With its vast mission, the BLS needs to be able to accommodate various data formats and feeds—making data collection one of its most complex challenges.

Data Collection

Data comes to the BLS from a wide variety of largely voluntary sources, from individuals to businesses to state government. For instance, employment data comes from individual employers and state and local agencies. That data includes employment numbers, salary levels, job listings, workplace injuries and more.

Employers provided numbers through phone calls, fax, a BLS Electronic Data Interchange (EDI) connection or the BLS Internet Data Collection Facility website.

Data is increasingly collected electronically, resulting in faster data collection and better data quality. Jay Mousa, associate commissioner for field operations, points to the Survey of Occupational Injuries and Illnesses as a particular example.

“We have almost 85 percent electronic collection,” Mousa says. The annual report, based on data most employers are legally mandated to provide, is available about six months earlier than in the past, an improvement the bureau attributes in large part to Internet data collection methods that eliminated a considerable amount of data entry and associated data editing.

BLS is also embracing new data sources, says David Friedman, associate commissioner for prices and living conditions. For the CPI, for example, retail pricing data is now widely available online and, in some cases, the digitization of data has improved collaboration with businesses.

“We’ve looked at alternative sources of data for quite some time, but it’s only been in recent years that we’ve really been exploring the really large datasets,” Friedman says, such as those coming in from some retailers.

The bureau now is also collecting new, supplemental data, such as baggage fees from the airlines. These extra charges have become common in recent years, but weren't showing up through the BLS's collection methods. “We tend to get [fares] off the internet, but there’s a dataset that is available from the Department of Transportation that has baggage fees for all airlines, and so we get that dataset, too.” Economists incorporate the baggage fees alongside data about airline ticket prices, to produce a more accurate measure of travel costs.

The Data Engine

The BLS does not have a single, centralized database, but uses a collection of mostly Oracle databases, with a few legacy Sybase databases, for data review and processing. The BLS’s infrastructure also uses Blade servers and enterprise storage systems. Data collection review interfaces are typically developed using Java or .Net.

Once collected, data is ready for estimation or aggregation. Java- or SAS-based processing solutions extract and process the data, then store the results securely back in the Oracle databases. Once the results have been reviewed, applications developed in Java or .Net, sometimes combined with SAS analytics software, are used to create the various publication outputs.

But, Friedman says, there are many other technologies and tools used throughout the various reporting programs, and across research and production work.

“To a certain extent, it depends on each survey and what the measurement objectives are of a particular survey,” he says. “When you go into the individual programs, the kinds of software languages that are being used varies between programs, but for estimation, we’re a pretty big SAS shop in general.”

Looking to the future, Wiatrowski says the BLS will continue to adopt alternative data sources and computer-assisted text analysis. “For example,” he says, “we are actively exploring the use of web scraping, APIs, and other techniques to capture prices from retailers.”

 The BLS makes all of its data available to the public. Most is provided through its Labstat—the bureau's central data repository and its database of record—which generates HTML tables, but the data can also be downloaded into spreadsheets. Large volumes of data can be more readily obtained via the BLS API.

Increasingly, the BLS is using visualizations. The Economics Daily (TED) and Spotlight on Statistics stand out as two of the most visually oriented publications, according to Jay Meisenheimer, chief of the BLS's Division of New Media.

“As the technology became more widespread,” says Meisenheimer, who also serves as TED's managing editor, “the user expectations changed in terms of having not just an image, but an image with some interactive features that would give you a little bit more data than you otherwise could get in just a regular static image.”

Interactive charts and maps are created with a JavaScript-based tool called Highcharts, while static map images usually are made with ArcGIS and static charts are made with Excel. Deputy commissioner Wiatrowski notes that the BLS is increasingly using interactive charts because they help tell more compete and interesting stories with data. “The interactive features also give readers more power to focus on data that interests them,” he says, “and not just what the authors focused on.”

Mousa calls the bureau's combined approach to data collection “high tech, high touch.”

“The high tech is providing [respondents] with options to supply us data electronically, to make it easier for them, reduce their burden,” Mousa says. “And then the high touch is explaining the survey to them, reducing their burden, bringing them on board with the participation, and it saves them and us a lot of work down the road.”

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access