Last September, two computer science students from the University of St. Andrews in the U.K. attempted to pin down a definition of Big Data, publishing “Undefined by Data: A Survey of Big Data Definitions” in the open-source journal arxiv.org. Their round-up included: 

  • Gartner Group: The “Four V’s” definition: volume, velocity, variety, veracity
  • Oracle: The derivation of value from traditional relational database-driven business decision-making, augmented with new sources of unstructured data such as blogs, social media, sensor networks, and image data.
  • Intel: Generating a median of 300 terabytes of data weekly. Includes business transactions stored in relational databases, documents, e-mail, sensor data, blogs and social media
  • Microsoft: The process of applying serious computing power, the latest in machine learning and artificial intelligence, to seriously massive and often highly complex sets of information.
  • The application definition (arrived at by analyzing the Google Trends results for “big data”):  Large volumes of unstructured and/or highly variable data that require the use of several different analysis tools and methods, including text mining, natural language processing, statistical programming, machine learning, and information visualization.
  • The Method for an Integrated Knowledge Environment (MIKE2.0) definition:  A high degree of permutation and interaction within a dataset, rather than the size of the dataset.  “Big Data can be very small, and not all large datasets are Big.”
  • NIST: Data that exceeds the capacity or capability of current or conventional [analytic] methods and systems.

Doug Fridsma, M.D., chief science officer for the ONC, has a definition that will resonate with almost everyone: “More data than you're used to--some people deal with petabytes and it's easy, but if you're a small practice, just your own data is more data than you're used to,” he says.
 

This piece was originally published by Health Data Management. Published with permission.