Data Science Platforms
With the availability of high compute, storage and statistical advancement leading to powerful algorithms, Data Science (Machine Learning / Deep Learning / Artificial Intelligence) has been one of the key technological advances of the past few years. So far, computers have extended human capabilities around memory, compute, or communication (internet); however, Data Science aids human capabilities by extending their intelligence. Its no wonder that Data Science has been placed at the ‘Peak of inflated expectations’ in Gartner’s technology hype cycle for last few years with multiple themes around it(Embedded AI, Explainable AI etc) finding their place in the hype cycle.
With so much industry attention, it is but obvious that there is plethora of Data Science platforms in the market today. Majority of these Data Science platforms are low code platforms which aims to ease journey of pre-processing, training, and deployment of the models. It is at times overwhelming to decide which one to use for your enterprise/project. Hence in this article, I would cover few leading (as per analysts’ reports such as Gartner/Forrester’s) Data Science platforms. I have picked the platforms that have some sort of community edition allowing us to try them out before purchasing a license and can work independently with multiple data sources.
Citizen Data Scientist — Before we start analyzing various data science platforms, lets revalidate our understanding of the term Citizen Data Scientist. While a Data Scientist is a person who knows the data science concepts, statistics, algorithms and is also good in coding language (python, R, spark etc.) and can use the libraries like sci-kit, Pytorch, Tensorflow, Spark MLlib etc. Citizen Data Scientist is primarily a person who is good in domain and data science concepts but not so good in coding. He is not well versed with the coding languages or the libraries that comes along with it. He is basically a domain expert who have mastered data science concepts and is looking to apply data science and domain knowledge using No-code GUI based platform and create machine learning models.
Majority of the No Code Data Science platforms available in the market today tend to focus more on modelling part but lack on feature engineering, exploratory data analysis and data preparation part. While Drag and Drop interfaces are good for simple problems; it tends to become over complicated when encountered with enterprise level real life problems. Overall good for prototyping but questionable when it comes to solving enterprise grade real life problem.
KNIME, RAPID Miner and DataIKU-DSS are key No Code/Low Code Data Science platforms with part of software open sourced that have managed to secure mention in Analysts’ Reports (Gartner/Forrester) over the last few years. These do stand out from others in terms of features supporting data manipulation (wrangling, transformation, cleansing, imputation) techniques.
DataIKU — DataIKU has been named as Leaders in Gartner magic quadrant from past couple of years. It is a low-code/no-code Data Science platform which is very suitable for Citizen Data Scientists. Citizen Data Scientists can upload any data set and perform Exploratory Data Analysis, Feature Extraction, Algorithm Selection, Model Creation, Visualization and Model Training using GUI based drag and drop interface. It does have low learning curve and a good community support. DataIKU is very feature rich and keeps adding new features. While its Enterprise edition is a paid license model, there is also a limited feature community edition which is good for doing basic prototyping or evaluating the platform. DataIKU can be accessed over http [no desktop installation required]
KNIME — Knime is another No Code/Low-Code Data Science platform which is very easy to use. It can as well be used by Citizen Data Scientist who do not know/want to code. We can just drag and drop widgets to create ML/AI flows. It has a low learning curve and there is a good online help available from community. There are widgets added by community itself which are available in Knime marketplace. However, it starts to become a bottle neck for complicated enterprise grade ML/AI problems, and one ends up writing code-snippets in python. Also (at the time of evaluation of platform), there was a limited integration with big data software e.g. Cassandra and HBASE.
Knime has two parts -
a) Desktop based client which is used to create data pipeline and train the models. This is available freely and any one can download to create their own machine learning models.
b) KNIME Server — This is where the models are deployed and managed. There is a license cost for the KNIME server
RapidMiner is another low-code/No-Code Data Science platform (with some part of it freely available) that has found mention in Gartner’s quadrant. It has basically two flavours -
RapidMiner Go — ‘RapidMiner Go’ is a cloud hosted SaaS offering of No Code/Low code Data Science Platform with web interface. Though it is not open source but has a very low pay per use cost ~$10/month/user. The learning curve is low for citizen data scientists as well. It is Medium to rich from functionality perspective. It allows citizen data scientists to train and deploy the models.
RapidMiner Studio — It is also a No code/Low Code data science platform with open-source version available [individual user needs to sign in]. However, to deploy the model, one need to have license of RapidMiner Studio Server. As with other No Code/Low code platform, the learning curve is low and does have good online support. From functionality perspective it is medium to rich especially when it comes to support of exploratory data analysis and feature engineering.
So overall between DataIKU, KNIME and RapidMiner, DataIKU is functionally very rich with new features being actively added. It is followed by KNIME and then RapidMiner. All 3 have basic community edition for prototyping and evaluation and need license for enterprise level deployment.
Key big players (Google, Amazon, MICRSOFT) have introduced Data Science Services as part of their cloud offerings with AutoML which tend to automate the overall pipeline — this gives the best of both world — prowess of coding as well as automation. However, these are paid services and are best suited when the data is already available within the cloud platform.
While you can select one of the above platforms but what if you are looking for Data Scientists to work on complex models involving too many complex data transformations on Big Data scale. This bring us to the last of platforms to be mentioned in this article.
Databricks — Databricks is especially suited when large complex data is in play. It provides Jupyter hub notebook and allows one to code in multiple language python, R, Scala, spark, SQL etc. It is offered as PaaS service hosted with Azure, AWS and GCP and the trained model can be deployed on cloud. It provides complete MLOps life cycle allowing us to track the model right from data used for training to its performance and how it is operating once put into production. Being Cloud based solution, we can use the compute (CPUs, GPUs) required for training whenever we need. It means we can Scale up/down depending on the use case and pay only for what we use. This is especially beneficial since compute requirements for training the model varies a lot from use case to use case and buying all the hardware upfront increases the Total Cost of ownership. Databricks platform is offering from the originators of Spark and MLFlow which ensures new features being made available in the platform before these are released as open source. We can train the models with Big Data (Petabyte scale) and track the experiments against the created models. These integrates very well in Databricks. So, for complicated Data Science problems, Databricks is the way to go.
Abhinav Ajmera
Senior Data Scientist, Cloud and Data Architect
The opinions of the author are personal, and author is in no way associated with any of the Data Science Platforms.