Deloitte Director Greg Szwartz outlines the following six things you need to consider when approaching a big data integration project.
Data Architecture
There are many decisions to be made when considering the information architecture design as it relates to big data storage/analysis. These include choosing either commodity or special-purpose hardware; relational or non-relational data stores; virtualized on-premise servers or external clouds; in-memory or disk-based processing; uncompressed data formats (quicker access) or compressed (cheaper storage). Companies also need to decide whether or not to shard – split tables by row and distribute them across multiple servers – to improve performance. Other choices to be made include either column-oriented or row-oriented as the dominant processing method and hybrid platform or greenfield approach.
Stock Exchange
Column-oriented Databases
As opposed to relational, row-based databases, column-oriented databases group content for storage that share similar attributes, e.g. one record contains the zip codes for every customer. This type of data organization is conducive to performing many selective queries rapidly, a benchmark of big data analytics.
In-memory Databases
Another way to speed up processing is to turn to database platforms using CPU memory for data storage instead of physical disks. This cuts down the number of cycles required for data retrieval, aggregation and processing, enabling complex queries to be executed much faster.
In cases of semi-structured, inconsistent or sparse data, “Not Only SQL” provides a foundation. Although it's not structured data, it does not require fixed-table schemas, avoids join operations and can scale horizontally across nodes (locally or in the cloud). To boot, NoSQL offerings come in many shapes and sizes, with open-source and licensed options and the needs of various social and Web platforms in mind.
Database Appliances
These are self-contained combinations of hardware and software to extend storage capabilities of relational systems or to provide an engineered system for new big data capabilities such as columnar, in-memory databases.
This technique is used for distributing computation of large data sets across a cluster of commodity processing nodes. Processing can be performed in parallel, as the workload is reduced into discrete independent operations, allowing some workloads to be most effectively delivered via a cloud-based infrastructure. All images from ThinkStock. Used with permission.