Back in 2010, Riley Newman was one of the first dozen employees at Airbnb, which refers to itself as “a trusted community marketplace for people to list, discover and book unique accommodations around the world.” Today, that online community has grown to 80 million guests and a globally dispersed staff of more than 2,000. In the process, Newman says, his role as Airbnb's chief of data science, has evolved to help the online hospitality company maintain that sense of community by understanding its users en masse, while still serving them individually.

“Data helps you scale those connections,” Newman says, “helping the employee base understand the top concerns of guests and hosts. It keeps the whole community connected."

Also See: Allina Health Boosts Clinical Operations With Big Data Partner

Airbnb wrangles a lot of data—roughly 11 petabytes. Much of it, such as a guest’s lodging preferences and whether a host likes to be continuously booked or prefers having a few days free between visitations, helps the online marketplace’s search algorithm determine the most likely match between guest and host.

Preferences of this sort fall into one of four data categories, Newman says:

  • Behavioral, which describes user behavior as they interact with the Airbnb website;
  • Dimensional, which covers user attributes including access device used, language and location;
  • Sentiment, which reflects lodging reviews, ratings and survey results;
  • Imputed, which infers user behaviors, such as “this guest always travels to big cities, whereas this other guest always travels to small coastal towns.”

The hybrid team model

 To collect, process and analyze all this data, Airbnb relies on a team of about 100 people. These include around 20 engineers who support the computing infrastructure and Newman's 80-person data science team. He compares the division of labor to a software stack: At the base is the infrastructure and the software engineers who focus on the maintenance of Hadoop and other core data processing applications.

The next layer consists of the engineers who build tools for manipulating all that data, followed by a layer of data engineers responsible for data warehouse and ETL (extract, transform, load) pipeline design. A level up from there are the data scientists who build data models and perform predictive analytics. And at the apex of the stack are data analysts who communicate the results of all these efforts across the Airbnb organization.

In lieu of either centralizing its analytics team, or embedding its data scientists within various business groups, Airbnb organizes data teams around “outcomes” or goals. Each outcome is pursued by a cross-functional team with a data scientist as its lead. But these team leads also work with more centralized groups of data scientists and engineers that support the goal-driven teams.

The cross-functional teams bring a long-term focus to specific business challenges. “That’s where some of the best ideas originate,” Newman says. “But some companies take it too far and go completely embedded, and then the data scientists have no career path and no opportunity to learn from each other. So we’ve tried to strike a balance [between] centralized and embedded teams.”

Data at Scale

Airbnb's data lives in Amazon's public cloud, on an infrastructure based on the Hadoop open-source programming framework for working with big data in distributed environments. The lodging company prefers working with open source tools and relies on a variety of Apache Project software.

Airbnb's HDFS (Hadoop Distributed File System) data is divided into two independent clusters, dubbed Gold and Silver. Gold is the “source of truth,” where the most important data lives, according to Newman. Silver is a replica, where data scientists and analysts can query without interfering with mission-critical jobs. It's also where most of Airbnb's machine-learning models run.

Most ETL runs through the Python programming language and Hive, a data warehouse infrastructure built on Hadoop, and last year Newman’s team built an open-source product called Airflow, a workflow management tool that manages critical ETL jobs.

“We use Presto [an open source SQL query engine created by Facebook] for the majority of our ad hoc analysis,” Newman says. “Presto then feeds a lot of our analytical tools.”

Spark, an Apache big data processing framework, is used in the Silver cluster for high-speed analytics related to machine learning. Two other important analytics tools include Airpal, Airbnb’s own open-source creation, and Presto, an Apache SQL query engine that allows users to run queries against the silver cluster. 

“We also use a number of dashboard technologies for custom visualizations,” Newman continues. These include industry-standard dashboards built in Tableau and custom visualizations built using the R, Python and JavaScript languages.

Another recent in-house creation is Knowledge Repo,described by Newman as one of his favorite projects. The application is a GitHub repo that provides a process and tool to make data research easier to find, reproduce, refine and learn from. The repo is accessed using Jupyter notebooks that run a web app for creating and sharing documents that contain live code, equations and visualizations and create files in R markdown or plain markdown formats. A Flask web app then organizes these files as an internal blog.

“Data scientists and analysts are publishing analysis that product managers, engineers and others can review and comment on,” Newman explains. “This creates a dialogue around the state of knowledge relative to [a given] topic.”

That's critical, Newman adds, because finding code for a project that has been lying idle for month—or whose original engineer has left the company—creates confusion for any data science shop. But with Knowledge Repo, “You can check out a notebook and instantly rerun all the code,” he says. “You can figure out the point where you want to go a little bit deeper, and then expand upon what was done.”

On the horizon is a new tool called Caravel, which Newman describes as a complement to Airpal that will pull data into Druid (a Java-based, open source data store) to quickly provide rich visualizations. “It will help us slice and dice data [and] construct charts and graphs pretty quickly,” he notes.

A Stable Base for Experimentation

 The outputs from Airbnb's big data machine have been considerable. While innovations are often driven by machine learning, Newman says they frequently originate from a human inspiration. Among the better-publicized examples:

  • Host preferences: An insight from a data scientist, who had himself been an Airbnb host, led to an improvement in the way that guests and hosts are matched.  Experimentation with this approach led to a 4 percent rise in bookings, when Airbnb's search algorithm took into account a prospective host's calendar and past preferences for being tightly booked or buffering stays with a few days in-between guests.
  • Reviews and ratings: A data scientist had a hunch, later proven through experimentation, that giving guests and hosts a two-week time limit to write reviews, and keeping them unpublished until both sides had submitted, overcame hesitation about posting first or providing an honest but non-glowing review. The new process resulted in more and better-quality reviews.
  • Pricing tips: Driven by machine learning and more than 5 billion data points on regional demand, location and other lodging characteristics, Airbnb's pricing tips have led to a 17 percent greater likelihood of a listing being booked.

Human hunches that lead to better business practices are common on his team, a phenomenon that Newman attributes to Airbnb’s outcome-oriented data teams. Since it’s never obvious how a team will reach its goal, it works to come up with a hypothesis. Then, he says, “You do as much analysis as you can with the data to estimate the impact, doing predictive analytics and that sort of thing, and then you give it a shot.”
So far, those kinds of data-driven insights have helped Airbnb connect more than 60 million guests with hosts in 34,000 cities and 190 countries. Like the underlying data points, those numbers are still climbing.