New submission from ARC Award Final Report
webteam@langara.ca <webteam@langara.ca>
Sun 12/4/2022 9:55 AM

To: Scholarly Activity <scholarlyactivity@langara.ca>
Name of Researcher
Thi Quynh Nguyen
Department/Faculty
Mathematics and Statistis
Position in Department/Faculty
Instructor
Project Title
CLUSTERING PERFORMANCE EVALUATION IN CHRONIC CARE MANAGEMENT
Term of Project
Fall 2021 to Fall 2022
Please introduce yourself – include pertinent background information relating to the topic of your research project.
I am an instructor, teaching Data Analytics courses and a researcher in applying machine learning models in
Environment and Health domains. The research project focused on optimising computational resources to possibly
reduce a running time when using unsupervised learning, especially K-means and variant of K-means.
Please discuss your educational background and your work experience that led you to taking on this research
project. If possible, include a quote that helps define your interest in this project.
In the last 2.5 years, I have encountered a few customers who wish to develop data-driven products. In order to do
that, the modelling process has to be optimised. The concept of resource optimisation has not yet introduced to the
PPD Data Analytics Program. Therefore, I believe this research project established a ground to draw attention from
both instructors/researchers and students.
Please summarize your project in plain language that others not in your field could understand.
Various medium and small enterprises in the healthcare industry, especially in the US and Canada foresee the
growing demands of accurately classifying healthcare customers by integrating geographic, demographic and
psychographic data with authoritative data from various sources in order to reduce costs from stakeholders (patients,
healthcare service providers, and healthcare insurance companies). While the current classification algorithms can
mostly detect and suggest meaningful patterns, they significantly consume resources (time and physical computational
resources – processing, display, and memory). Therefore, this project aims to conduct various experiments to classify
classes of skin cancer images by manipulating algorithms to consume the right balance of resources effectively.
Identify the project goals and objectives. Explain how the results may be used to solve a problem or inform further
research in the field.
Based on the aforementioned need, the main aim of this research is to compare the clustering performances of
parallel k-means and k-mean++ within two different programming environments, namely, Azure machine learning
cloud-based instance and local personal computer in order to suggest methods of optimization. The ultimate outputs
and findings of this experiment are to help individuals and small enterprises in which large investments for expanding

or upgrading technological infrastructures can be an obstacle for developing data-driven products. The main key
questions are below:
- Is there any difference in running time between two environments for different types of datasets?
- Would the existing infrastructure built-in with for Python interpreted language for parallelism improve computation
time?
Briefly explain the steps taken (methods used) to conduct the research, and describe the key findings.
In the last couple of decades, personal computers come with multiple cores built into their processors. To determine if
we can utilize these multi-core capabilities in unsupervised machine learning, we used OpenMP, multiprocessing
package and joblib library that are built in for Python to execute k-means, k-means++, and k-mean++multi in parallel.
For the multiprocessing package, so called process-based parallelism, there is “a process pool object which controls a
pool of worker processes to which jobs can be submitted. It supports asynchronous results with timeouts and callbacks and has a parallel map implementation.” [13]. With joblib-based parallelism, “the number of processes or
threads that are spawned in parallel can be controlled via the n_jobs parameter .” [14]. OpenMP is another design that
support shared-memory multiprocessing programming. The use of these libraries is warranted because Python has
built in Global Interpreter Lock (GIL) which allows only one thread to run at a time.
On the cloud-based environment, we use the same 3 packages with the similar configurations of multi-cores as a
personal computer, namely Azure machine learning instance. Since the target architecture we use for this experiment
is an 8-core general purpose computer, it is possible to distribute the computation task to a number of threads created
depending on the number of independent processing elements in that system [15].
Azure machine learning instance is selected with Ubuntu 20.04.4 LTS and Python 3.8.5. On the other hand, personal
computers or laptops are commonly used to process data and build models. Hence, a similar configuration to Azure, a
set of 8 cores CPU laptop with 16GB RAM and Intel processor is used for this experiment. The code is written and run
using Jupyter notebook.
For testing the computation running time, each dataset is used to run three different k-means algorithms with a
number of clusters from 2 to 8. The same tests are repeated varying the number of cores used for processing as either
2,4,6,8. Each experiment is repeated three times for each k-means algorithm/number of clusters/number of cores. This
aims to determine if within an environment, the difference in computation running time would exist.
Datasets used for clustering algorithms are obtained from UC Irvine Machine Learning Repository [16] and the
International Skin Imaging Collaboration [17]. Out of five selected datasets, one is a subset of cancer images for the
year of 2018, which includes only melanocytic nevus (NV) class. This class is selected due to its size. The images are
converted to gray-scale and vectorized. A number of dimensions and observations/records were one of the main
criteria for this experiment. Other requirements for the selection are popularity as well as cover different fields.
Through this mechanism, types of data are diverse, which help to investigate possible variance of computation times
when using three variants of K-means.
Who was involved in this project (eg. faculty, students, community partners)? How did their involvement contribute
to the project’s success? Were there any challenges to overcome?
In a total of 4 students involved in this project. One student was hired through the Work on Campus route. Three
students joined the project voluntarily due to the nature of the project.
DANA instructor team was the great help. I cannot complete the project with a great success without their support.

Please share any personal stories that made this research experience memorable/valuable.
It is tough to conduct research while teaching and studying. But together, we were able to present our work to an
international conference.
What are the next steps for this project and for you as a researcher?
My students and I would like to continue with the second phase of the project which is to design the infrastructure to
create a number of workers simultaneously in a real-time setting to achieve a lesson learn on using unsupervised
technique in the health-care domains.
Please upload any images that will help to showcase your project.
Unsupervised-learning-–-Evaluation-of-modeling-with-parallel.pptx
Langara Institutional Repository Consent
By submitting, I consent to uploading my ARC Fund final report to the Langara Institutional Repository (The LaIR).