Data scientists pride themselves on knowing every programming language under the sun, every library available to man and the ability to work in only code like the Zion operators from the Matrix. They are also control freaks who like to tweak and fine tune their models like F1 racing cars. So, with this in mind, how do “Drag and Drop” tools fit within the Data Scientist's toolbox?
With the recent developments in “self-service” technologies servicing a majority of tasks that a Data Scientist will typically encounter, there is much being offered that can significantly speed up how a Data Scientist “wrangles” data, experiments with models and share insight. Combine this with the ease of deploying some of these “self-service” technologies and the ability to remotely access such services over a web-based interface, significant user exposure and efficiency gains are potentially within “easy” reach.
A number of vendors providing “self-service” drag and drop tools / services have partnered with public cloud service providers such as Azure and AWS. In addition, some cloud service providers offer their own pay as you go Software as a Service options adding to the flexibility of using a range of different tools whilst being in control of costs (i.e. turning on and off services) and compute resources.
Anyway, back to the question at hand, how do these fit into a Data Scientists tool box? Well that depends on the type of Data Scientist you are, whether you are a seasoned pro or a budding novice and anything in between, and the type of task you are exploring. Below are a few examples of where these “self-service” drag and drop solutions could fit in and help drive efficiency.
Data “Wrangling” and / or preparation
A major drain on a data scientists’ time is that spent on preparing data - from re-formatting fields to mapping datasets and conducting data quality checks. Traditionalists will generally dive right into programming these tasks in languages such as Python and feed the scripts into an End-to-End (E2E) automated workflow. The trouble with doing this programmatically is that coding errors can be made that consume time to debug, data processing / transformation steps are less transparent and implementing fundamental changes increases the risk of code re-writes (amongst other potential issues).
In addition, programmatically you only see the data in its raw format which does not lend easily to spotting data quality errors being able to quickly identify how the data is distributed without running further code to either visualise the data or to automate the identification and logging / alerting of potential errors / outliers. This additional code also brings along the potential issue mentioned above.
With many of the “self-service” tools geared for Data Wrangling (Trifacta, Dataiku to name a couple), they automatically allow you to visualise a sample set of the data to provide an early indication of data quality (i.e. how complete the data is), auto assign data types to fields where schemas are not available and provide quick views of how the data is distributed. This is all without writing a single line of code speeding up the time to start informing decisions around how to handle the data or what approach to best apply to the data to achieve the desired outcomes.
They also provide “drag and drop” functionality to implement typical transform tasks such as joining, aggregating, text manipulation and many others. For more complex data, they also offer some clever algorithms to detect how the data should be structured and automatically suggest transformations. Trifacta even employs machine learning to better improve its suggestions.
The development of the steps and process to prepare the data is recorded as a visual workflow, which provides a nice and easy visual way to check that the steps are logically ordered and to quickly get an overview of what is happening to the data. This can make identifying where problem manipulations occur that little bit quicker.
For complex data wrangling tasks, being able to break down the solution into visible chunks can make the development process slicker and make these tasks less daunting for those developing their Data Science skills.
This “workflow” will then auto generate SQL code specific to the backend database connection selected. Depending one the chosen solution, this code can be run in the solution itself or exported (with auto generated annotations / comments) and used to orchestrate within a complex environment utilising other tools such as ActiveEON etc.
Rapid Model Prototyping and Selection
Within some solutions (Dataiku and Azure Machine Learning for example) the data flow can be extended to feed into various predefined machine learning engines without any need to write a single line of code. What this allows for is the ability to quickly test out various algorithms, compare the performance in different engines and languages (for both accuracy and speed) helping to hone into the best-suited approach to different problems.
Again, the auto generated code can be exported and / or amended / customised to fine tune the models performance or tailor it to a specific solution. This also provides a nice starting point for budding data scientists or those looking to implement new methods not previously implemented.
For the traditionalists who like to write customised models, customised code written in a variety of programming languages can be imported and included in existing workflows making the process of switching between out of the box and custom models a simpler affair. This helps promote and de-risk the ability to experiment with different techniques and support rapidly prototyping different solutions without the need for significant re-engineering of code and workflow.
Whilst creating the workflow, you are also inherently orchestrating the tasks taking away the need to do this when it comes to putting your solution into production. As previously mentioned, a nicely formatted and commented export of the code for your entire workflow can be exported and included as part of an orchestration tool such as ActiveEON, or the code taken apart if there is a need to orchestrated each element separately.
Data and Insight Driven Visualisations
Ok so you now have a way of classifying your data or have a nice set of insightful data. How do you go about surfacing this in a meaningful way for decision makers? One of the quickest ways to get across the message from your insights is through user driven visualisations. Writing lines of code to correctly visualise ad-hoc charts and plots in R or Python comes with its challenges and ultimately consumes time.
This is where self-service data visualisation / visual analytic tools come into play (e.g. Power BI, Tableau, Qlik Sense etc.). These tools can connect to nearly all mainstream data sources or any data sources that are ODBC / JDBC compatible and quickly visualise the data through a “drag and drop” interface. A majority have server or cloud hosted versions that you can easily publish to and share your creations accessed via web browsers.
These tools make chopping and changing between chart types, formats etc. a breeze making it a simple task to cycle through a range of typical charts and plot types. In addition, they inherently create interactive visualisations that can be all connected to quickly enable end users to drill down into the specifics they are interested in.
With these tools, prototypes can be developed rapidly that in turn support the de-risking, guidance and refinement of user requirements on how best to communicate visually to the business the outputs from all the clever backend data cleansing, wrangling and modeling elements.
So has the age of “drag and drop”, “code free”, “self-serve” data science arrived? Will data scientists never need to touch or learn Python again?
Fear not fellow data scientists, whilst these new tools offer significant time saving and efficiency benefits, they will never replace the toolsets of a true data scientists – only complement them.
However they do offer a nice and attractive way into data science for those who want to tread down that path. For seasoned Data Scientists they provide a capability to rapidly prototype approaches to data science related problems. They also add a bit more transparency to data science projects by laying out the end-to-end process in a clear and consistent way exposing how each step feeds into other steps (that is assuming there is a transparent process and method to the madness being deployed!).
(About the author: Andrew Wang is a senior data scientist working in the Capgemini UK Data Science team. This post originally appeared on his Capgemini blog, which can be viewed here)
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access