ARTIFICIAL INTELLIGENCE

Machine learning versus data science – demystifying the scene

When it comes to machine learning versus data science, it’s helpful to know the differences and where to find common ground.

12 December 2022

James Tyrrell

@JT_bluebird1

james.tyrrell@hybrid.co

All stories

Analytical duo: statistical machine learning algorithms highlight patterns in data, but it still takes human intuition to join the dots. Image credit: Shutterstock.

Getting your Trinity Audio player ready...

Who better to ask about artificial intelligence (AI) than the current darling of the scene, ChatGPT? Its answer (‘Machine learning and data science are closely related fields, but they are not the same thing’) is a useful starting point. But unpacking the differences between machine learning versus data science requires human effort, for the time being at least. Until the machines take over.

Business machines play checkers

If you are new to machine learning, it’s worth skipping back to the late 1950s to gain an understanding of its origins. During this time, Arthur Samuel – a computer scientist working for IBM in the US – popularized the idea of machines that had ‘the ability to learn without being explicitly programmed to do so’. A good illustration of this was the checkers-playing computer devised by Samuel. At the heart of the machine, was a search tree algorithm capable of exploring possible states within the game and selecting the one with the greatest reward.

There are a whole host of different machine learning algorithms, each with its own strengths and weaknesses, that have been deployed to make sense of available data. Examples include clustering tools such as K-nearest neighbours and ensemble methods, which include random forests. The latter – a collection of decision trees – can be used for regression tasks, forecasting future events based on past data.

Typically, machine learning algorithms learn from a portion of the data (the so-called ‘training set’) and then apply those settings to generalize previously unseen information. And while getting a close match between predicted and actual values during training could be something to celebrate, there is a risk that the algorithm has ‘overfitted’ the data. In this case, when shown new data, the machine learning model may struggle to accurately describe the behaviour, relying too heavily on patterns seen in the training set.

These classical machine learning approaches form part of a data scientist’s toolkit. However, analytical roles don’t necessarily have to feature any machine learning whatsoever – one of the distinctions between machine learning versus data science. For example, data scientists may spend their days – at publications such as The Economist, The Guardian, The World Bank, or other information-heavy organizations – preparing data visualizations.

Tools such as R (a statistical computing project) and data science and machine learning libraries available for Python have made it possible to summarize vast data sets graphically. Excel’s maximum worksheet size is 1,048,576 rows by 16,384 columns. But, thanks to efficient array management and information-holding structures known as Data Frames, data scientists running R or using Python libraries can easily navigate millions more data points compared with using a spreadsheet.

Cleaning in progress

A common task for data scientists, when faced with a new data set, is to carry out data cleaning. Large data sets can have missing or mislabelled data, which needs to be corrected – either to make the visualization accurate or improve the performance of machine learning algorithms fed with the data. Another valuable skill that shows some of the common ground when picturing machine learning versus data science is so-called feature engineering.

This activity also brings human creativity into play – identifying ways of enriching existing data so that machine learning algorithms will be better able to cluster around key features. An illustrative example can be found in ship boarding data from the Titanic, which features in a popular kaggle competition (predicting survival on the Titanic) – one of many contests on the site that help learners get to grips and explore the power of machine learning.

In this example, the accuracy of the machine learning algorithm can be improved by feature engineering the cabin numbers to show on which side of the ship they are located. On its maiden voyage in 1912, the Titanic struck an iceberg on its starboard (right) side while travelling in the North Atlantic Ocean.

Smart-thinking data scientists are in demand for their skills across a wide range of applications. Areas include helping companies boost profits by making sense of marketing data and customer analytics. When talking about machine learning versus data science, having humans in the loop can be a big value-add. Machine learning can be deployed to automate tasks – for example, using object recognition or other classifiers. But many decisions require human experience and insight. Computers are great at digesting billions of data points, but they lack creativity. However, pair the two and you have a powerful combination – something that can be seen in the use of AI to fight financial crime.

There’s much more to be said on the topic of machine learning versus data science. But, in a nutshell, the former tends to refer to the use of a set of statistical algorithms to extract features from data and make predictions. Whereas data science is the broad skill set required to make the most out of machine learning, bringing human curiosity and creativity to bear.