Last September, two computer science students from the University of St. Andrews in the U.K. attempted to pin down a definition of Big Data, publishing “Undefined by Data: A Survey of Big Data Definitions” in the open-source journal arxiv.org. Their round-up included:
- Gartner Group: The “Four V’s” definition: volume, velocity, variety, veracity
- Oracle: The derivation of value from traditional relational database-driven business decision-making, augmented with new sources of unstructured data such as blogs, social media, sensor networks, and image data.
- Intel: Generating a median of 300 terabytes of data weekly. Includes business transactions stored in relational databases, documents, e-mail, sensor data, blogs and social media
- Microsoft: The process of applying serious computing power, the latest in machine learning and artificial intelligence, to seriously massive and often highly complex sets of information.
- The application definition (arrived at by analyzing the Google Trends results for “big data”): Large volumes of unstructured and/or highly variable data that require the use of several different analysis tools and methods, including text mining, natural language processing, statistical programming, machine learning, and information visualization.
- The Method for an Integrated Knowledge Environment (MIKE2.0) definition: A high degree of permutation and interaction within a dataset, rather than the size of the dataset. “Big Data can be very small, and not all large datasets are Big.”
- NIST: Data that exceeds the capacity or capability of current or conventional [analytic] methods and systems.
Doug Fridsma, M.D., chief science officer for the ONC, has a definition that will resonate with almost everyone: “More data than you're used to--some people deal with petabytes and it's easy, but if you're a small practice, just your own data is more data than you're used to,” he says.
This piece was originally published by Health Data Management. Published with permission.