Big data is often characterized by three S’s:
- The size or quantity of the data in question (generally huge);
- The speed at which new data is gathered (typically extremely fast);
- And how the data is structured (which it often isn’t, meaning it’s completely unstructured or semi-structured data).
Big, fast unstructured data is the undeniable truth of the times, with all kinds of data captured in all walks of life including emails, texts, video surveillance cameras, phone calls, point of sale transactions, medical records or, in other words, anything and everything that captures information via various mechanisms.
How Does Big Unstructured Data Affect Businesses?
What does this mean for businesses? Companies can now capture data from every aspect of their operations. For example, technology companies can track every single keystroke and mouse click to better understand how their products are used. Auto insurers have the ability to track driving speeds, stops and brake frequencies for every single car they insure. And retail outlets have video recordings of which products their customers examine for every single aisle in every single store.
What can anyone possibly do with so much data? It's not even a question of quantity anymore - it's more a question of feasibility. One can put up a thousand powerful computers in parallel and crunch huge data sets to derive results. But what if the data is also unstructured? What if the problem is not in finding the solution but in finding the correct questions to be asked in the first place? Everybody can obtain a huge data set, and almost anybody can acquire the right set of tools to analyze that data, but very few “somebodies” possess the right mindset to use the data to begin solving business problems.
How Can Unstructured Data Problems be Easily Solved?
Solving a big unstructured data problem is like trying to navigate a maze. To negotiate a maze, the first thing required is a flare or light to determine direction. In the world of big data, direction is determined not by numbers but by visuals. Visualizing the data is the first step to making unstructured data useful. Visualization might not yield accurate answers, but it is the most powerful way of understanding where to look for those answers or, more precisely, what questions to frame.
Why is Visualizing Data so Helpful?
Information visualization is the branch of art and science that tries to visually represent the structure of a data set via graphical mechanisms. Visualization has been man’s friend from the caveman days with the earliest of human beings using four bars and a strike to keep track of time. This has evolved to include the representation of numerical data (line charts, bar charts, pie charts), geographic data (heat maps) and the relationship of different data types (graphs).
Computers and technology have added animation and interactive visualization to the tools set, and systems have evolved that generate the mot juste display automatically, using algorithms that consider the preferred drawing style and aesthetic criteria such as area minimization and symmetry maximization. An example of this is Microsoft Excel generating relevant graphs for selected data.
Making use of a discovery-driven, fail-fast approach to problem solving, the following visualization techniques are the usual suspects for analyzing unstructured data to better comprehend business problems:
Use: Determining the topic relevance of voluminous texts and commentary
Association: Lengthy texts, social media discussions, customer feedback
Modus Operandi: Word clouds express the occurrence of words in text form, with the size of the text corresponding to the frequency with which the word or phrase is used. An example is word clouds used as tag clouds on the Web to highlight the most prominent tag or topic of discussion on a Web page. Color and size can be used to distinguish different word clouds.
Decoding: The bigger the size of the text, the more important is the occurrence of the text. Figure 1 is an actual word cloud of the entire text present in this article. One can clearly see that the words “visualization” and “unstructured” steal the show.
Incidence: Word clouds can be used to very quickly analyze the main focus or topics of discussion for any social media content or online feedback mechanism such as a survey.
Figure 1: Example of a word cloud for determining word frequencies in large amounts of text
Use: Understanding word associations in large quantities of text
Association: Social media text, customer feedback, news articles
Modus Operandi: Latent semantic analysis is applied to extract and represent the contextual meaning of words using statistical computations. The fundamental idea behind LSA is that all the word combinations in which a given word does and does not appear can determine the similarity of word meanings. Association trees are a way of defining and understanding these similarities.
Decoding: In Figure 2, certain words seem to have a positive association with other words (depicted in green) and a negative association with certain other words (depicted in red). The lines indicate that a relationship exists between these words and the thickness of these lines quantifies the extent of the relationship.
Incidence: A brand’s claim to fame and infamy along with the reasons for both can be derived from customer commentary. Positive and negative comments can be sorted through and the word associations or sentiments behind each of these can be identified.
Figure 2: Example of an association tree for understanding word relationships within large quantities of text
Use: Analyzing video and audio file content
Association: Video and audio files
Modus Operandi: Horizongraphs are a generalization of stacked area graphs where the baseline is free. By shifting the baseline, it is possible to minimize the change in slope (or “wiggle”) in individual series, thereby making it easier to perceive the thickness of any given layer across the data. This can also help analyze the output content from a video/audio streaming data.
Decoding: In Figure 3, the time series is from an audio output file. The graph tracks the occurrence of specific word sets over time and this can be used in an interactive fashion to follow a discussion across a time period, such as a day.
Incidence: By analyzing call center audio recordings, a company can determine the type of complaints that are most widely reported, the intensity of the customer response (via semantic intonations), and how these vary at different times of day or month -- without having to transcribe them into text.
Figure 3: Example of a cubism horizongraph used to track and analyze audio and video files
Self-Organizing Maps/Topological Data Analysis
Usage: For visualizing relationships and gaining insights from multidimensional data from multiple sources
Association: High-dimensional, multisourced unstructured data
Modus Operandi: Topological data analysis works two ways. First it applies a shape or commonly understood geometric structure to the data points. Next, it analyzes the interaction between these shapes via algorithmic constructs to unearth connections and insights. The best part about these maps is that they are driven by an artificial intelligence that learns from any large chunk of data.
Decoding: In Figure 4, the self-organizing map shows the voting patterns for both the branches in the U.S. Congress. Overall, the color red means a yes vote while blue means a no vote. Multidimensional data is broken down at the level of each individual variable and the voting patterns are analyzed via shapes to understand commonality on ideology for each party.
Incidence: To completely comprehend the customer, a 360-degree view of the customer is required. Apart from traditional transaction data, there also exists social media, customer profile, survey and feedback data from which a holistic view of the customer can be obtained. Self-organizing maps and topographic data analysis can be used to understand customers’ buying behavior based on these multiple dimensions.
Figure 4: Example of a self-organizing map that tracks voting patterns by party for the U.S. Congress
Use: Unearthing semantics and relationships from a large contextual data set
Association: Contextual, relationship-rich, network data
Modus Operandi: Network graphs are designed to measure and quantify the relationships between different vertices or nodes on a graph. These network graphs can be directional or nondirectional based on the business requirement. Graphing tools along with quantification algorithms are used to obtain graphs.
Decoding: In Figure 5, the entire text of the play Les Miserables is analyzed to quantify the relationships among various characters of the play. It is a nondirectional graph, but the size of the bubble signifies the importance of the character, while the width of the network line represents the extent of the relationship between two characters.
Incidence: Physician prescription and practice data can be analyzed together to identify key opinion leaders among physicians and also to understand the degree of influence each physician wields over another. This can come in quite handy for refining physician-targeting marketing methods.
Figure 5: Example of a network graph that quantifies the relationships among the characters in the play Les Miserables
According to market researcher IDC, the volume of business data will double every 18 months. As more companies start realizing the value of unstructured data, they will turn to data visualization methods to make the insights captured by that data accessible to a bigger audience than ever before.
At the outset, analyzing unstructured data can seem an intimidating task, but having the right skill set and tools in place will make the whole undertaking pretty straightforward. The next time you hear the words “unstructured data analysis,” remember to go beyond traditional bar graphs and pie charts and use these visualization techniques on your social data, customer feedback and audio calls to solve mysteries therein.