Web content: The overlooked external data source

Register now

With the gold rush to mine external data from social media posts, open data sources, and syndicated data providers for improved diagnostic, predictive, and prescriptive analytics, many organizations are turning to yet another source: harvested web content.

No doubt there is untapped value in the billions of social media posts each day, the 10 million or so open datasets published by government organizations and others, and from the thousands of data product companies aggregating and selling hundreds of thousands of data sets. However, when it comes to understanding what’s happening in your industry, or any industry, one of the best and most up-to-date sources of information can be others’ websites.

Harvesting (or “scraping”) content from websites can yield invaluable insights into the activities of competitors, partners and suppliers, such as product pricing, product specs and descriptions, job openings, newly introduced offerings, and so forth.

For example:

  • An investor in hotel properties generates and updates a reference dataset of daily hotel pricing for every hotel in over two-dozen European cities over a 30-day horizon.
  • A major rental car company optimizes its pricing from harvested market data including base price, fees, insurance, and additional information across thousands of rental date and duration scenarios. They receive four billion data points weekly and can analyze the market across countries, destination cities, pricing, vendor, and car type.
  • A travel marketing company that views attractions as a competitive differentiator extracts attraction data from competitor websites across 1000 cities to provide specialized recommendations for unique and niche attractions and events.
  • A ride-sharing company looking to broaden its offerings and expand into other markets harvests content and produces a weekly scorecard to track the rental car space. It collects 200,000 data points daily from routes in 25 key markets.
  • A major cruise line harvests content from online cruise booking websites to ensure ongoing competitive pricing across all dates for the upcoming calendar year.

But all web content harvesting solutions aren’t necessarily about pricing. The not-for-profit, Thorn, harvests suspicious online ads and uses text analytics to identify instances of child sex trafficking and exploitation. Already they have identified hundreds of victims and dozens of sex traffickers.

Also, JPMorgan Chase has deployed a solution to automate the monitoring, extraction, and review of bank documents such as commercial loan agreements--thereby reducing its time to analyze documents from 360,000 hour per year to just a few seconds. And SJV Criminal Data Specialists automates web data extraction to scour court records across thousands of US jurisdictions to automated and expedite criminal background research. In doing so, SJV has eliminated tens of millions of hours in manual records search per year for over a million dollars in savings, leading to over 50% business growth and near perfect accuracy.

Until recently however, the challenge with web content harvesting has been the manual effort behind accessing and gathering up mostly unstructured information and converting it into something useful. Today, tools and services such as those from import.io and others exist for automating this effort, and most of the steps involved in web content harvesting are straightforward enough for any non-technical business professional to manage.

They include the following steps:

  • Identifying what content to retrieve. This involves selecting the URL or URLs of the websites you want harvested, and pointing-and-clicking on the specific data you want collected.
  • Extracting the content. Extraction processes run based on a schedule you set, pulling extracted data from public-facing websites, from behind logins, and requiring interactions such as drop-down menu selections.
  • Preparing the content. Extracted content typically is organized into a structured format

enabling it to be explored, refined, cleansed, and enriched with other data.
Then the content can be integrated into business applications and workflows via APIs, loaded into a data warehouse or data lake, or simply analyzed standalone.

Automated web content harvesting may not be right for all situations. For low numbers of websites, low volumes of content, or harvesting scenarios that cannot be adequately mechanized, contracting individuals from a “mechanical turk” or freelance gig-worker site like Fiverrr might be a more cost-effective and expedient solution. But automated solutions are becoming a mainstream solution in the toolbag of companies looking to keep tabs on what’s happening outside their own four walls, and for all variety of data monetization opportunities.

For reprint and licensing requests for this article, click here.