It has been found in industrial circles that enterprises are struggling to make sense of all the information they have. In the recent past, their focus on data has grown manifold and data analytics has become more effective since enterprises have access to Big Data.
Big Data is a collection of different data management applications that support multiple analytics uses. Analytics without the ability to manage large data is not effective, and so, enterprises explore to leverage Big Data technology such as Hadoop.
Hadoop is an open-source framework for storing data in distributed computers and processing this data in parallel on clusters of commodity hardware (i.e. computers). Analytics can be relevant if there is data to work on and, in today’s world, data itself is huge.
This data, stored in various database management systems can go up to several petabytes. It is a moving target and no threshold has been defined as such.
The architecture of Big Data consists of several racks of storage nodes with many components. The architecture may vary from user to user depending on requirements. Commercial interest in this area can be gauged by the fact that private equity and various venture capital funds are investing in Big Data related initiatives.
Big Data investments are expected to grow at CAGR of 17 percent over the next three years, eventually accounting for $76 billion by 2021.
Another interesting development is that of acquisition of pure-play Big Data start-ups. The competition in the area among Big ITES enterprises is growing and that is the reason for the run for new start-ups and acquisitions.
At this stage, hardware and infrastructure sales in Big Data accounts for nearly 70 percent of total investment and that is due to the requirements for large scale servers, routers, gateways and variety of network components.
In the following we outline important developments in Big Data, including technologies in data storage and processing, which have had a remarkable imprint on Big Data technology decision-making. Our observations are indicative in nature and shed light on the trajectory these developments are taking, but are not meant to be comprehensive and do not encompass all the facets of Big Data technologies.
As data keeps on growing so is the need to find cost effective solutions to store this data and use it when necessary. In this context, the concept of data lakes has become practical and useful solution.
A data lake is storage that holds raw data in its native format and uses a flat architecture to store data. Going forward, the share of software vendors will increase who would build products to read raw data from data lakes in a more cost effective and efficient form.
Traditionally, SQL (Structured Query Language) is a query language used by Relational Database Management Systems (RDBMS). Relational databases rely on tables & columns to store and extract data.
In recent developments, NoSQL database management systems have evolved where NoSQL databases do not rely on these structures and use more flexible data models. NoSQL is useful for storing unstructured data (images, video & audio recordings, text), which is increasing more rapidly than structured (numeric) data and does not fit the relational form of RDBMS.
Common types of unstructured data like weblog files, chat files, cellphone messages, data from the Internet of Things (IoT) devices and video and images.
Let us look at a specific example where the capabilities of NoSQL database are used. A customer profile today has a variety of information that includes data from different touch points of the customer. Customer data could include images, recordings, text data, locations, web browsing history, customer links (friends & family) and customer service data.
This data is crucial to the enterprise in order to more effectively serve customer. NoSQL databases can provide much faster data extraction and loading to specific portions of customer data. This happens due to distributed data management system which enables management of large volumes of data. Data is not just stored in one server or computer but multiple servers which provide a highly fault tolerant system ensuring that data is always available and in real time.
Challenges and Opportunities
No doubt, Big Data is a disruption to business and management that has widespread impact. Consequently, it also poses significant opportunities, challenges and the emerging implications for theory in information systems due to Big Data's disruptive effects. In fact, it would have a remarkable imprint on research for recording data as well since data abundance would require lesser need for sampling.
There is a well-established tradition of depositing data into a public repository, and also of creating public databases. Executives find that major issue is to harvest the value Big Data brings to businesses and enterprises.
While enterprises have understood that Big Data is not just one software or application, they wish to evaluate the effort and investments required in this initiative and the return on investment that it would bring. Its success would depend on business outcomes, talent availability, data and infrastructure.
Challenges in Big Data relate to parallel computing, clarity in understanding the systems and processes in Big Data ecosystem (Hadoop) and data quality. Executives also encounter heterogeneity of data, data inconsistency and data incompleteness and have to prepare data in a useable form before insights could be drawn from it.
The demand for advanced features is forcing technology vendors to mature faster and traditional technology providers to close functional gaps quickly. In order to face constant technology innovation, many enterprises are hoping to lower operating costs while also integrating innovations through the use of Big Data.
Parallel Computing Framework
Another development which has really pushed Big Data in commercial use is that of parallel computing. We now have more powerful processors which are affordable and also carry more storage capability. Technology is thereby allowing the analysts to put exascale data in-memory for faster processing.
Exascale computing refers to computing systems capable of at least one billion-billion calculations per second. Parallel computing enables a cluster of computers to carry out many calculations simultaneously.
Large problems can often be split into smaller ones, which are then solved at the same time. It saves time by distributing tasks and executing these simultaneously and solves Big Data problems by distributing data and processing it across multiple computers.
The biggest advantage is that it harvests the resources of our desktop computer and has the capability to scale up to clusters and cloud computing. Big Data ecosystems such as Hadoop provide the features of parallel computing and are a very useful tool for enterprises.
Diversity of Applications
There are variety of applications that constitute Hadoop ecosystem and they have to work in close coordination. A major challenge is to understand how these individual applications work within Hadoop ecosystem and implementation of Hadoop.
Clarity in understanding the logistics related to moving data in and out of Hadoop efficiently is another issue faced by executives. Such areas haven’t received adequate attention before and enterprises hope to go beyond the basic use of Hadoop and choose right solution provider for their needs. A number of firms (such as Cloudera, HortonWorks, and Oracle) provide Hadoop implementations. It is difficult for clients to choose the right solution provider.
Data is usually extracted from different sources – ERP, Point of Sales, Radio Frequency Identification, biometric data capture systems, telecommunication data systems etc. In such scenarios, data is in variety of format and complexity.
Before any meaningful data analyses can happen, the challenge is bring it all together in a uniform format. Data also needs to be consolidated since it may come in pieces.
Software vendors have identified this as a mine of business opportunity and created solutions around it. There could be several ways data may not be ‘ready’ for business needs. For example sales service representatives might misplace values among various fields.
Consumers are also likely to submit incomplete records via web services of E-Commerce platforms. In such cases, customer’s data needs to be edited and standardized for names and address. Likewise, product catalogue data also needs to be standardized for product codes, brands, model number, catalogue numbers etc.
To address these challenges, many software vendors such as Trillum Systems and Winpure have created solutions for data cleaning and standardization. It is a mine of opportunity for companies in data cleansing space.
Several enterprises among Fortune 500 state that their enterprises will reap immense benefits if they are able to utilize Big Data. This data driven world has the potential to improve the efficiencies of enterprises and improve the quality of our lives.
Enterprises that are seeking to harvest the value that Big Data brings to their business do face certain challenges. These challenges are in terms of data quality, redundancy, and talent scarcity and information usage at corporate level.
Talent is also found to be scarce as Big Data demands skills in newer technologies like NoSQL and parallel computing. Executives are generally unaware of how each application within the Big Data ecosystem interacts and what value the applications bring for business. A major barrier to implementing Big Data technology is therefore identifying which solution provider to choose from.
(About the authors: Nitin Singh serves on the board of Intellution LLP as vice president for business analytics. Dr. Kee-hung Lai (Mike Lai) is associate professor at the Department of Logistics and Maritime Studies, Faculty of Business, The Hong Kong Polytechnic University (PolyU). T.C. Edwin Cheng is dean of the Faculty of Business, Fung Yiu King - Wing Hang Bank Professor in Business Administration, and Chair Professor of Management at The Hong Kong Polytechnic University.)
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access