Many companies ask me how to effectively integrate, analyze and act upon online data in an efficient manner given its massive size – where to start, what solutions are viable and what skills are needed? The skills needed are typically tied to the solution of choice. Many times, current skills may drive the solution set choice, but in this case, do not let that constrain your organization. You need to be nimble and evolve quickly in an online environment.

Online data is powerful. It contains customer and prospect behaviors and interests that may be used to increase communication relevance within all channels. Unfortunately, most online data environments are extremely large and not well integrated with the other data sources. In this article, I focus on the environment options to tackle three steps for enabling true online analytics. First, the online data must be integrated. Second, it must be analyzed; and finally, the insights must be made actionable within all channels.

Where to Start

There are four initial requirements:

  1. Determine the online data that is available and develop links (user IDs, cookies and customer IDs) to offline sources, so that an integrated environment is possible. Also, ensure that your overarching strategic goals (i.e., minimize churn, increase customer profit) drive this effort as the goals will determine which data is needed and the supporting analytics required.
  2. Online data is huge, but it’s just like any other source where some data is useful and some is not. Determine the attributes that are critical in phase one to minimize the initial data size to get started as soon as possible. Attributes can be added as you evolve. While database administrators might be cringing, remember - you should be nimble and data models must evolve. In many environments, trying to digest all the initial online data will turn into a cyclical data nightmare with minimal business impact for a long period of time. Start small, learn from the initial set of data, develop quick win goals, gain buy-in internally and then evolve.
  3. Do not discount anonymous user data. Anonymous users contain the most valuable data available such as behavioral, interest and timing data. This may be used to develop robust segmentations and models that in turn will drive more relevant online communications to those prospects.
  4. Determine an effective and efficient way to capture and load the online data within one integrated environment. Options include receiving the data manually from your online analytic vendor, Web log processing, packet sniffing and newly available hybrid solutions like PION and Speed-Trap. Options will vary based on the environment and your online analytic provider, but regardless, careful consideration should go into how the data will be sourced and loaded as sourcing online data is cumbersome.   

What Solutions are Viable

Three predominant models exist:

1. A separate relational database and analytic environments. For example, Oracle or SQL Server databases tied with SPSS or SAS analytic solutions.

  • Advantages: These solutions are well known in the database and analytic communities, and plenty of skills are available so the learning curve is small typically, thus allowing for a quicker development timeline.  
  • Disadvantages: This approach is difficult to scale to huge analytic data environments and could be costly depending on the database and analytic solutions chosen. In most cases, separate environments/servers will be needed for the database and analytics.

2. Database appliances possibly coupled with an analytic solution. For example, database appliances Netezza, Teradata, Neoview or Kickfire coupled with either their own internal analytics, if available, or a separate analytic solution like SAS or SPSS.

  • Advantages: Database appliances are optimized for large data environments and the processing of large data for analytic efforts. These solutions are also typically preconfigured to a certain degree, which allows for a quicker development uptime and less maintenance compared to relational databases. Appliances are also starting to become better coupled with analytic environments so the database may take on some of the production analytic tasks.  
  • Disadvantages: Database appliances are less well known than RDMS databases. However, they are becoming mainstream, so this is less of a concern. These solutions are typically the most costly, but sometimes justifiably so in very large data environments given the power.

3. Open source. Open source solutions are popular and cutting-edge. For example, Hadoop, an open source solution for scalable and distributed data storage and data processing, is growing in popularity as it has shown the ability to handle massive data while using cheap commodity hardware (computers) similar to cloud environments. R, an open source analytic solution, is widely considered one of the most robust analytic tools available. Given that both solutions are open source, both communities work happily together to integrate. Other open source solutions that may help with additional tasks include the data integration toolkit Jitterbit and traditional open-source databases such as MySQL.

  • Advantages: These solutions are able to efficiently scale into massive data environments more than any other option to date and have very good analytic capabilities. The solutions are evolving quickly, easily integrate with cloud solutions and the software is free.
  • Disadvantages: Open source may require a steep learning curve if companies do not currently have the skills in house; and depending on the environment, this solution may require a farm of commodity hardware to maintain. Integration options with reporting/BI tools and analytic solutions other than R are low at this point, but development is in the works.

The proper solution varies based on a company’s environment, but some rules of thumb apply for small and very large environments. For small online data environments (less than 10 million page views per month), it may be best to go with a relational database and traditional analytic solution if the skills are already available, because the cost of a database appliance or the time to ramp up on Hadoop and R may not be justified. For large online data environments (more than one billion page views per month), it may be best to go with a database appliance or Hadoop solution given the size of the data and the need to embed analytics directly within the data environment.
Google, Yahoo! and Facebook, along with other online companies, are leading the charge in the use of open source online analytics. This is still a relatively new world with few off-the-shelf solutions available that can scale to organizations’ needs. Storing online data for basic aggregations and reports has been around for a while, but using the online data to affect customer and anonymous user interactions within channels is still evolving. This requires massive data processing capabilities with strong embedded analytic capabilities. Your environment is likely not as big as Google’s or Yahoo!’s, but you may benefit from their ongoing development and learning.

As you choose a solution, it may be one, two, three or some hybrid of all three options. Be sure to carefully evaluate each option. Insist on keeping open source in the evaluation mix even if your company does not have the skills in-house. Beyond the obvious factor that open-source solutions are free, they are powerful and evolving quickly. In the end, you may not choose open source but the evaluation and learning process will better equip you to source and integrate online data, regardless of your solution of choice.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access