SEP 1, 2005 1:00am ET

Related Links

Visiting Nurse Service Cares About Cloud Security
October 25, 2011
Light at the End of the Silo
October 28, 2010
Pitney Bowes Releases Enhancements to MapInfo Professional
September 13, 2010

Web Seminars

The Data Supply Chain – Managing your Company’s Data Assets
Available On Demand

Solving the Challenges of Exponential Data Growth

Print
Reprints
Email

What are some of the challenges facing IT departments today? One way to answer that question is to look at what IT has to deliver - and that is mission-critical applications. Examining these applications in depth, the underlying architecture can be described using Bill Inmon's Corporate Information Factory (CIF). Designed to deliver business intelligence and business management capabilities, the CIF is a technical architecture that is driven by data from business operations. It has proven to be a stable architecture for any size enterprise building strategic and tactical decision support systems (Imhoff, 1999). Within the factory, there are often large volumes of data that have to be processed from application to application or from repository to repository, creating a flow of data across the enterprise.

Figure 1: Bill Inmon's Corporate Information Factory

From Figure 1, it is easy to see that data volumes can easily grow while demands on the data increase as well. For example, this can occur when processing the data and putting it into another form or database (whether for an online data store where it will be important to have an efficient delta processing capability or for additional business intelligence data marts where aggregation of the data in the warehouse can easily take hours upon hours to produce summary tables). The management of all of this data movement and the time it takes to process all of the information creates a new set of IT challenges.

Increasing Demands on IT Departments

Due to the growing data volumes and business need for more information and analysis, there has also been an increase in the number of applications being developed and supported in most organizations. This has placed significant demands on IT professionals since there hasn't always been a corresponding increase in staff. Companies are also increasingly focused on controlling costs. Management wants to solve the challenges caused by data volumes and demand for real-time information with the minimum possible expenditure of money and resources.

In order to find a solution, it's first important to have a better understanding of where the volumes of data are coming from and why they are increasing. Here are some examples of areas that generate large volumes of data:

  • Banking, insurance, financial transactions - checks, ATMs, credit cards
  • Consumer buying behavior - supermarket check-out scanner data
  • Healthcare - pharmaceutical records
  • Communications - call detail records
  • Internet and e-commerce - (Web logs, clickstream)

How have these data volumes grown? From ATM withdrawals to deposited checks, consider how many banking transactions are conducted in a month. Or think about how many items are purchased at the supermarket using a bonus card. Then look at the number of visitors who browse the company Web site in a given day. Each time there is more data to aggregate, filter, reformat, analyze and so on. Of course, as the value of the information increases, it also becomes necessary to utilize more and more historical data.

If a business inherently involves large data volumes, whether the data consists of call detail records or Web clicks, there is the potential for performance bottlenecks in the CIF. Below is a list of some of the places where bottlenecks most frequently occur today:

  • Impact on I/O time and network resources of large volumes of data. In order to accomplish this, a company needs optimized file and database access methods.
  • The aggregation of data to produce summaries at user query time, which is typically not feasible with the immediacy of response time that users want. Because of this, many data marts are designed to precalculate and pre-store the summaries. This can take huge amounts of processing and data movement. Optimized aggregation algorithms are needed to enhance performance.
  • The amount of raw data that comes from Web servers. This data is typically very verbose; for example URLs and CGI parameter strings contain large amounts of text, with only a few characters being of relevance to a specific application. Web data parsers can consume huge amounts of CPU processing. Optimized pattern matching is needed.
  • Databases that do not have highly optimized access methods for extracting data. Obviously the database vendors are more motivated to get data in than out, but the organizational data flow typically isn't as simplistic as that - usually there are numerous databases that are different and the data needs to be moved around. Optimized database access methods will achieve this.
  • The amount of data for database loads. These loads involve index creation, which requires sorting. Because of the amount of data used in typical loads, the fastest possible sorting algorithms are necessary.

Figure 2: Hardware Can't Solve Elapsed-Time Problems

IT departments are continually encountering situations where a system is not designed for performance. Frequently this is because the system was implemented at an early stage of the business when there wasn't much data volume. Another problem is that the designers thought that volume growth could be taken care of by simply upgrading the system, or adding additional systems, or that hardware capacity would just continue to increase in the future the way it has for the last 20 years. There are several reasons why this typically doesn't work.

  • Runtimes are almost never linear with data volume growth - when the amount of data is doubled, the processing time more than doubles.
  • Hardware increases are non-linear in the other direction (doubling the number of CPUs doesn't double the throughput) - the combination of these two means that hardware upgrades don't give nearly as much as expected.
  • Adding processors only improves performance if the application was well parallelized to start with.
  • Increases in hardware capacity (faster CPUs, bigger disks, faster networks, grids) allow new applications to be automated. These new applications typically generate higher volumes of data than the old ones, and then the data needs to be processed into useable information.
Filed under:

Advertisement

Where do young IT professionals (30 and under) obtain information to aid with daily role responsibilities and career development?

Trade publication websites 14%
Social media 23%
Vendor websites 4%
Vendor/community forums 7%
Newsletters 1%
Trade conferences/meetups 2%
RSS feeds 6%
Web search 44%

 

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.