June 6, 2018 at 11:59AM
In May, we reported initial results on 19th annual KDnuggets Software Poll:
Python eats away at R: Top Software for Analytics, Data Science, Machine Learning in 2018: Trends and Analysis.
Here we take a more detailed look at which tools go well together.
emerging ecosystem of open-source Python friendly Data Science tools
we identified last year has received a new entry - see below.
We provide a link to anonymized dataset at the end of the post - let me know what else you find in the data, and please publish or email me the results.
First, we look at which tools go together, and to make the charts understandable,
we selected the tools with at least 400 votes. There were 11 such tools, and this selection also makes sense because there was a big gap between n. 11 (Apache Spark, with 442 votes) and n. 12 (Java, 309 votes).
There are many ways to measure the significace of associations between two binary features, like chi-square or T-test, but we used the same Lift measure as in our
We then grouped together the tools with the strongest association, starting with Tensorflow and Keras, until we arrived to the figure 1 below.
To reduce clutter, we also filtered it to show only associations with abs(Lift1) > 15%.
Fig. 1: Data Science, Machine Learning Top Tools Associations, 2018
The bar length corresponds to absolute value of lift1, and the color is the value of lift (green for more Python, red for more R).
We note a group of 6 primary tools that together make the modern open source data science ecosystem:
Python, Anaconda, scikit-learn, Tensorflow, Keras, and Apache Spark.
Rapidminer has a small negative association with all of the tools above and does not
go strongly with any other tools.
R has small positive associations with Apache Spark, SQL, and Tableau.
The second group that emerges are the 3 supporting tools for Data Science and Machine Learning, which are frequently used together:
SQL, Excel, and Tableau
We note that although chart below is symmetrical relative to diagonal (top right triangle is equal to bottom left), the patterns are easier to see in the full chart, rather than half.
Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )
where pct(X) is the percent of users who selected X.
Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)
To make the differences from one easier to see we define
Lift1 (X & Y) = Lift (X & Y) - 1
Python vs R
Next we examine Python vs R.
= % of tool X usage with Python, and
% of tool X usage with R.
To visualize how close is each tool to Python or R, we used a very simple measure
Bias_Py_R(X) = with_Py(X) - with_R(X)
, which is positive if tool is more used with Python and negative if it is more used with R.
In Fig. 2, we charted the bias of most popular tools with at least 100 votes,
and as we can see, almost every tool is biased towards Python. The only 2 exceptions are IBM SPSS Statistics, and SAS Base. For comparison,
there were 10 such tools: SAS Base, Microsoft tools, Weka, RapidMiner, Tableau, and Knime, and almost all became more used along with Python.
Did Python declare victory over R?
I don't think so, because R is an excellent platform with tremendous depth and breadth, which is widely used for data analysis and visualization, and it still has about 50% share. I expect R to be used by many data scientists for a long time, but going forward, I expect more development and energy around Python ecosystem.
Fig. 2: KDnuggets 2018 Data Science, Machine Learning Poll: Python vs R bias
Big Data and Deep Learning
Big Data (Spark / Hadoop tools) were used by 33% of respondents in KDnuggets 2018 Software Poll, exactly the same fraction as in 2017.
This suggests that most Data Scientists work with medium / small data that does not require Hadoop / Spark, or they use other data in the cloud solutions.
However the fraction of Deep Learning tools grew to 43% from 32%.
For each tool X, we compute how frequently it is used with Spark/Hadoop tools (vertical axis), and how frequently it is used with Deep Learning tools (horizontal axis).
Here is a chart with top tools (with over 100 votes), excluding Deep Learning
and Big Data tools themselves.
Fig. 3: KDnuggets 2018 Data Science, Machine Learning Poll: Deep Learning vs Spark/Hadoop affinity
We note that Scala is the most used language with both Deep Learning and Big Data. The chart is heavy on the lower left side, with almost every tool being used more with Deep Learning than with Big Data tool.
Here is the link to
in CSV format, with columns
- Nrand: record id (randomized, records not in order of voting)
- region: usca: US/Canada, euro: Europe, asia, ltam: Latin America, afme: Africa/Middle East, aunz: Australia/New Zealand
- Python: 1 if Votes (last column) includes Python, 0 otherwise
- RapidMiner: 1 if Votes includes RapidMiner, 0 otherwise.
- R language : 1 if Votes includes "R Language", 0 otherwise. We used "R Language" instead of R for ease of regex matching
- SQL Language: 1 if Votes includes "SQL Language", 0 otherwise.
- Excel: 1 if Votes includes Excel, 0 otherwise.
- Anaconda: 1 if Votes includes Anaconda, 0 otherwise.
- Tensorflow: 1 if Votes includes Tensorflow, 0 otherwise.
- Tableau: 1 if Votes includes Tableau, 0 otherwise.
- scikit-learn: 1 if Votes includes scikit-learn, 0 otherwise.
- Keras: 1 if Votes includes KNIME, 0 otherwise.
- Apache Spark: 1 if Votes includes Apache Spark, 0 otherwise.
- With DL: 1 if Votes includes Deep Learning tools, 0 otherwise.
- With BD: 1 if Votes includes Big Data tools, 0 otherwise.
- ntools: number of tools in Votes
- Votes: list of votes, separated by a semicolon ";"
Let me know what you find!