A few years ago, I argued that Web-based information resources should be incorporated into the data warehouse.1 In fast-paced business environments, data from external sources, rather than internal systems, should be more relevant to managing your business. The challenge is to refine that information by discovering, acquiring, structuring and disseminating in a systematic manner. This activity was labeled Web farming.

Although surfing the Web has become a popular (and occasionally useful) pastime, the systematic usage of Web-based resources for the data warehouse has not materialized. It is not that such resources are irrelevant to the business. The problem has been that the data delivery from Web-based resources is not reliable or stable.

To obtain data, a program called a Web scraper or spider has to "scrape" the HTML from a Web site and extract the data with handcrafted pattern matching. Moreover, such extraction is usually performed without the consent of the content provider because the intended use is for browsing, not application processing or database storage.

Tim O'Reilly, founder and president of O'Reilly & Associates, remarked that, "Web spiders are becoming ubiquitous, as hackers realize they can build unauthorized interfaces to the huge Web-faced databases behind large sites."2 At O'Reilly's book publishing company, they have evolved a spider program called amaBooks that has become a valuable marketing tool for them. Every three hours, it scrapes megabytes of HTML from the Amazon Web site to extract a few data items about books by O'Reilly and their competitors. Over time, the resulting data warehouse is able to analyze trends in pricing, popularity and market share by publisher and topic. New trends will appear to them within days, rather than months. This is definitely a competitive advantage. As O'Reilly states, this "far outstrips what's available from traditional market research firms." Unfortunately, this valuable tool rapes data from the Amazon Web site in a way not intended by its designers.

With the adoption of Web services (WS) by content providers, this situation is rapidly changing. On April 11, Google, a major Web search service, began to offer WS access via a published API into their main database of more than 2 billion Web documents.3 It is a beta-test experiment at the moment, with free access limited to 1,000 inquiries per day and no future promises.

The services supported are typical search queries, retrieval of archived Web pages cached by Google and phrase- spelling correction. For instance, the phrase "Bill Clinten" is corrected to "Bill Clinton." A discussion forum foams with techy jargon among developers exploiting Google's riches. This initial enthusiasm must mature toward proven business applications that tap new revenue streams for Google.

The Google Web service experiment is a good model for other content providers. The modus operandi is: Publish the service definition in Web services definition language (WSDL), require registration that generates a unique access key, limit the usage making no service promises, establish a discussion forum to build community and watch carefully for innovative applications. This is a low-cost, low-risk approach to researching the business development opportunities of a commercial WS offering.

What are the implications? If you are a content provider, the implications are clear. Try it (using the Google model); you'll like it!

If you are a major corporation with a stable Web presence, consider opening a portion of your data warehouse to public WS access (using the Google model). You might be positively surprised, discovering new revenue sources and greater brand recognition. Remember the lesson of American Airlines and the Sabre Reservation System: The information business created greater wealth than the core business. Currently, creating extranets with known business partners is very popular. However, by opening WS access to your data warehouse, you may cultivate partners in ways you could never have imagined.

If you are a major BI vendor with strengths in ETL middleware and DW infrastructure, consider augmenting your offerings with WS access capabilities. Note the special requirements of the Google model, such as usage governor, e-mail registration, key generation and activity monitoring. Provide the full toolkit to your DW clients.

If you are a BI systems integrator, survey your existing clientele, particularly those with a rich DW environment and stable Web presence. Suggest a business development program based on the Google model, supplying both the technical and marketing know-how.

These developments will be exciting. It is clear that Web farming for the data warehouse is coming! This could be the year.


  1. See http://www.webfarming.com/ and Hackathorn, R. Web Farming for the Data Warehouse, San Francisco: Morgan Kaufmann, 1999.
  2. See http://www.oreillynet.com/lpt/a//network/2002/04/09/future.html < /A>.
  3. See http://www.google.com/apis/.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access