Many organizations have deployed Hadoop to gain additional value from big data. Most envisioned they could use Hadoop to easily combine and analyze large data sets to identify new market opportunities or better detect fraud. However, combining data sets within Hadoop is easier said than done. Most organizations, thus far, have faced obstacles and failed to achieve estimated return on investment from their Hadoop deployments.

Organizations can easily use Hadoop to process or model larger volumes of data from single sources. However, companies experience challenges when attempting to combine and analyze different data sources due to their varied structures, making connecting and mapping them together extremely difficult. Even data from a company’s internal accounts and systems may include multiple account numbers for the same customers, and the problems become even more complex when an organization makes acquisitions or merges with other companies.

Challenges of Combining Multiple Data Sources Within Hadoop

Attempts to use custom scripts to combine and analyze data from multiple internal sources stored in Hadoop have failed, because programmers must guess and experiment with logic in determining the fields to use to connect sources. Their guesses are often incorrect and result in faulty analysis. As an alternative, some organizations have attempted to leverage existing data integration or management applications to handle a wide variety and volume of sources in Hadoop. This approach has proven inefficient because they cannot be applied at scale. These applications were designed for much smaller, trusted and governed data sources. Therefore, when used with less trusted, lower quality data, it results in improperly merged records that should not be combined.

Combining multiple data sets becomes more problematic if organizations try to add data from external sources. Although these sources can significantly increase visibility into leading market indicators or threats, external data presents even greater technical challenges than internal data. The information can be fragmented or dirty, and from multiple types of sources -- structured, semi-structured and unstructured. External data also contains information about irrelevant entities, which makes it difficult to detect which information is about an organization’s customers, so it can be correlated with internal sources.

To realize the promise of Hadoop analytics, organizations need a better way to gain new insights by connecting the dots across many internal and external data sources. They need to be able to combine data from multiple sources and easily resolve entity issues. Organizations can do this by using statistically-sound methods that do not rely on foreign keys or specific internal identifiers, ultimately leading to better understanding of the relationships between entities. Important entity-related data, such as name, address and date of birth, in unstructured and semi-structured data must be correlated with structured sources. This type of correlation and analysis must deliver performance at any scale without impacting an organization’s systems of record and other critical enterprise applications or workflows.

How Companies Benefit From Hadoop Analytics

Companies can use Hadoop analytics to increase the value of sales and marketing programs, and improve outcomes of fraud detection, risk analysis efforts and other activities. For example, a consumer products manufacturer that sells direct to consumers or through retail channel partners can use Hadoop analytics to better target customers who would otherwise be difficult to identify. Or, a financial services company can use Hadoop analytics to improve its fraud detection efforts by better identifying perpetrators masking their identities and transactional patterns.

The manufacturer may want to offer a special promotion to customers who regularly buy online, but have recently abandoned several online transactions. If the company can easily identify those customers, and which retail locations near them have the abandoned items in stock, the company could send those customers an email or text message with a special offer to visit the store and get a promotional price on those items. Without Hadoop analytics, creating that type of offer would be an extremely complex and expensive process. The company would have to connect data stored in its enterprise data warehouse, master data management and log analysis systems, and then create a complex query of data to build the list of customers, products and retailers it wants to include. With Hadoop analytics, the company can quickly and easily create a complete view of customer, channel, location, and product entities and identify the relationships between them, connect these entities to transaction data, and create a list of customers and the offers they should receive at specific locations. This aggregated data set lends itself perfectly for a targeted and personalized marketing campaign, increasing sales and proving business value.

The financial services company can create complete views of people, organizations, locations, products and events by combining internal data from multiple lines of business. In addition, the organization can compare geographies with external data sources, such as watch lists, and enrich these views with information about the relationships between entities. Business fraud rules can then be applied to the data to look for behavioral patterns that may indicate fraud or compliance violations, and provide a subset of entities that match those patterns. Without Hadoop analytics, this process would be cost and time prohibitive. Using Hadoop analytics, companies can quickly integrate and analyze data from transactions, watch lists, customers, and new accounts, and produce the information they need to enable faster investigations and better results.

Requirements for Successful Analytics Using Hadoop

Realizing significant value from Hadoop analytics requires the resolution of entities across disparate data sources. This is an extremely complex process, especially with the data volumes and variety of internal and external sources. To combine unstructured and semi-structured data with structured data, unstructured text needs to be converted into a structured form that enables it to be correlated with structured data used by MDM, CRM and many other enterprise applications and systems. During the entity-extraction process, the characteristics of annotated data values must be determined, such as whether a string of numbers is a phone number or an identification number. Entity-record fragments of data attributes must be created from the data and correlated to create real-world views of person, location, organization, or event and their relationships. This information can then be analyzed to detect behavior patterns or trends, target subsets of entities, or distill data for other uses.

Understanding relationships between entities, including transactional relationships, yields more accurate entity resolution and better context about the relationships and hierarchies within formal or informal groups. Since two individuals can share the same name, date of birth and city of birth, it is important that additional information, such as mother, father, spouse or employer relationships, be used to clearly determine precise identities.

In addition to providing a clear understanding of entities and their relationships to one another, Hadoop analytics must be able to rapidly combine and analyze data from a wide variety of sources and serve customers with a wide range of requirements. Some customers will want to use Hadoop analytics on a small cluster and two or three data sources, while other will have thousand-node clusters, 10 or more data sources and trillions of records.

Hadoop analytics can help companies realize greater value and a much larger return on their Hadoop investments. Indeed, Hadoop can provide better clarity about customers and enterprise operations, and actionable insights that improve sales and marketing, fraud detection and risk reduction efforts.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access