EVALUATION OF UNSUPERVISED LEARNING
MODELLING WITH PARALLEL PROCESSES
Quynh T Nguyen1,2, Satyam Vatts1, Avneet Kaur1,
Tatjana Jancic-Turner1, Raouf N.G. Naguib3
1Mathematics & Statistics Department, Langara College,

Vancouver, Canada
2Department of Business Administration and Management, Dai Nam University, Hanoi,
Vietnam
3School of Mathematics, Computer Science Engineering, Liverpool Hope University,
Liverpool, UK

UNSUPERVISED MACHINE LEARNING
Clustering Based Approach

APPLICATIONS
CLUSTERING

Healthcare: Identifying subgroups of diseases or
patients for better diagnosis and treatment

Technique to form segments of observations
based on variations and similarities among
them
Heavily utilized to unravel hidden patterns &
trends

TECHNIQUES
K-Means

Market Research : Customer Segmentation to
discover groups of similar customers

K-Means++
K-Means Parallel

2

OPTIMIZED TECHNIQUES

SUSTAINABLE FUTURE

Working Towards A Reliable, Optimal & Sustainable Approach
PROBLEM

Clustering is a CPU-Intensive task

MAIN GOAL

PURPOSE

MOTIVAT I ON

To utilize the available compute power optimally for clustering
processes

Aimed at reducing the carbon footprint, less expenditure on
infrastructure & making right decision in selecting the hardware

Help individual and small enterprises with limited compute
infrastructure to develop data-driven products
Help in identifying appropriate approach for expanding or
upgrading existing technological infrastructure

3

EXPERIMENTATION IN DIVERSE ENVIRONMENTS
A Comparative Study To Research A Generic Solution

ON-PREMISE

AZURE CLOUD

Ubuntu OS 22.04.1 LTS

Ubuntu 20.04.4 LTS

Intel i5-10 th Gen Processor 8-Core CPU

Intel Xeon Platinum Processor 8-Core CPU

16-GB RAM

16-GB RAM

256GB SSD Storage

64GB Storage

4

WORKING WITH DIFFERENT DATA NEEDS
Considering Large & Small Datasets from different domains

DATA S E T

O B S E RVAT I O N S

AT T R I B UT ES

DOMAIN

Glass
Identification

214

10

Chemistry

Wine Origins

178

13

DATA S E T S I Z E C O M PA R I S O N

Accelerometer

40209

Nevus Skin Lesion Images

4692

ISOLET

7797

Water Treatment Plant

1382

Wine Origins

178

Glass Identification

214

Chemistry

Water
Treatment
Plant

1382

19

Environment

ISOLET

7797

618

Technology

Nevus Skin
Lesion Images

4692

784

Public Health

Accelerometer

40, 209

5

Technology

0

10000

20000

30000

40000

5

PARALLELIZED EXECUTION
Maximizing Core Utilization

PARALLEL EXECUTION
APPROACHES

EVALUATION
METRICS

Python built-in multiprocessing module to
enable process-parallelism

CPU Cores v/s Execution Time

Scikit-Learn K-Means implementation
provides OpenMP-based mechanism for
shared-memory multiprocessing

CPU Cores v/s CPU Utilization
Comparing performance for different combinations of:
Number of Clusters (k) – 2, 4, 6, 8
Datasets
Parallelization Approach

6

EXISTING HARDWARE OVER CLOUD SPENDINGS
Outcome of Comparison between Azure & On-Premise Systems

UTILIZING THE EXISTING INFRASTRUCTURE
The result of 4536 trials on each environment indicated that a
moderately strong existing on-premise infrastructure provides fairly
good performance relative to Cloud

COST SAVINGS
Need to spend extra dollars to get better performance on cloud, a
potential deal breaker for small enterprises, students & researchers
with existing feasible hardware

7

UNEXPECTED TRENDS IN PROCESSING TIME
Execution time did not decrease with increase in
number of CPU cores

•

Intermittent patterns of execution duration appeared
in both cloud-based and on-premise environments
and regardless of dataset sizes

•

CPU Usages indicates under utilization of available
Computation power since it doesn’t increase
dramatically

•

No common thread to explain circumstances in which
the increased number of cores resulted in prolonged
processing times or decreased CPU usage

•

Lack of Control over embedded implementation of KMeans and its variants obscuring the cause of
unexpected trends

Azure Cloud

•

On-Premise

Abstracted Implementation Of Complexity Resulting in Uncontrolled Core Utilization

8

EXISTING METHODS RELIABILITY
Conclusions…

SCIKIT-LEARN
Although Scikit-Learn provides out-of-the- box parallelism
capability that must reduce the processing times on high number of
CPU cores, the research outcomes indicate otherwise

DATA SCIENCE V/S COMPUTER SCIENCE
Investigation of this irregularity requires a perspective of a
computer scientist rather than a data scientist to make best use of
available hardware through optimal software

REINVENTING THE WHEEL V/S ACCESSIBILITY
People working with Data need to focus on analysis & insights and
thus, accessible, reliable & optimized software to work rather than
worrying about the software optimization itself

9

INNOVATING OPTIMIZED
APPROACHES

MORE EXPERIMENTATION
& INVESTIGATION

MOTIVATION TO INNOVATE
This Research provides us with a motivation to innovate more streamlined
clustering implementation that are optimized for varying needs & infrastructure

EXPERIMENTATION IS THE KEY
Challenging the existing approaches through experimentations in varying
environments is the key to innovate better approaches, hence, a need of rigrous
investigation through trials.

IMPLEMENTING MODERN TECHNIQUES
Advancements in computing over the years such as quantum computing & more,
can be put to use for implementing modern solutions to clustering problems

10

THANK YOU!
11