Apache Spark is the go-to tool for Data Science at scale. It is an open source, distributed compute platform which is the first tool in the Data Science toolbox which is built specifically with Data Science in mind. In this blog, I want to talk about why I think Spark is the future of Data Science at scale and why Capgemini is supporting the Spark London Meetup Group.

We all know that data volumes are growing at an alarming rate and in order to get the best value out of these datasets business need to be able to analyse the full breadth and depth of this data. Traditionally this has been achieved with the various NoSQL datastores like Hadoop, MongoDb, ElasticSearch and countless others. What has been lacking is the ability to process this data for analytics. Analytics has either been achieved by writing complex MapReduce jobs or by picking particular aspects to analyse with Python or R. This works well in a lot of use cases, and typically a machine learning application only need be trained on a small part of the data or the feature engineering and population work means this happens naturally. However, when the need does arise to work with big datasets, (and this is only likely to grow), data science has been at a bit of a loss. That is no longer true with Apache Spark.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access