New submission from ARC Award Final Report webteam@langara.ca Sun 12/4/2022 9:55 AM To: Scholarly Activity Name of Researcher Thi Quynh Nguyen Department/Faculty Mathematics and Statistis Position in Department/Faculty Instructor Project Title CLUSTERING PERFORMANCE EVALUATION IN CHRONIC CARE MANAGEMENT Term of Project Fall 2021 to Fall 2022 Please introduce yourself – include pertinent background information relating to the topic of your research project. I am an instructor, teaching Data Analytics courses and a researcher in applying machine learning models in Environment and Health domains. The research project focused on optimising computational resources to possibly reduce a running time when using unsupervised learning, especially K-means and variant of K-means. Please discuss your educational background and your work experience that led you to taking on this research project. If possible, include a quote that helps define your interest in this project. In the last 2.5 years, I have encountered a few customers who wish to develop data-driven products. In order to do that, the modelling process has to be optimised. The concept of resource optimisation has not yet introduced to the PPD Data Analytics Program. Therefore, I believe this research project established a ground to draw attention from both instructors/researchers and students. Please summarize your project in plain language that others not in your field could understand. Various medium and small enterprises in the healthcare industry, especially in the US and Canada foresee the growing demands of accurately classifying healthcare customers by integrating geographic, demographic and psychographic data with authoritative data from various sources in order to reduce costs from stakeholders (patients, healthcare service providers, and healthcare insurance companies). While the current classification algorithms can mostly detect and suggest meaningful patterns, they significantly consume resources (time and physical computational resources – processing, display, and memory). Therefore, this project aims to conduct various experiments to classify classes of skin cancer images by manipulating algorithms to consume the right balance of resources effectively. Identify the project goals and objectives. Explain how the results may be used to solve a problem or inform further research in the field. Based on the aforementioned need, the main aim of this research is to compare the clustering performances of parallel k-means and k-mean++ within two different programming environments, namely, Azure machine learning cloud-based instance and local personal computer in order to suggest methods of optimization. The ultimate outputs and findings of this experiment are to help individuals and small enterprises in which large investments for expanding or upgrading technological infrastructures can be an obstacle for developing data-driven products. The main key questions are below: - Is there any difference in running time between two environments for different types of datasets? - Would the existing infrastructure built-in with for Python interpreted language for parallelism improve computation time? Briefly explain the steps taken (methods used) to conduct the research, and describe the key findings. In the last couple of decades, personal computers come with multiple cores built into their processors. To determine if we can utilize these multi-core capabilities in unsupervised machine learning, we used OpenMP, multiprocessing package and joblib library that are built in for Python to execute k-means, k-means++, and k-mean++multi in parallel. For the multiprocessing package, so called process-based parallelism, there is “a process pool object which controls a pool of worker processes to which jobs can be submitted. It supports asynchronous results with timeouts and callbacks and has a parallel map implementation.” [13]. With joblib-based parallelism, “the number of processes or threads that are spawned in parallel can be controlled via the n_jobs parameter .” [14]. OpenMP is another design that support shared-memory multiprocessing programming. The use of these libraries is warranted because Python has built in Global Interpreter Lock (GIL) which allows only one thread to run at a time. On the cloud-based environment, we use the same 3 packages with the similar configurations of multi-cores as a personal computer, namely Azure machine learning instance. Since the target architecture we use for this experiment is an 8-core general purpose computer, it is possible to distribute the computation task to a number of threads created depending on the number of independent processing elements in that system [15]. Azure machine learning instance is selected with Ubuntu 20.04.4 LTS and Python 3.8.5. On the other hand, personal computers or laptops are commonly used to process data and build models. Hence, a similar configuration to Azure, a set of 8 cores CPU laptop with 16GB RAM and Intel processor is used for this experiment. The code is written and run using Jupyter notebook. For testing the computation running time, each dataset is used to run three different k-means algorithms with a number of clusters from 2 to 8. The same tests are repeated varying the number of cores used for processing as either 2,4,6,8. Each experiment is repeated three times for each k-means algorithm/number of clusters/number of cores. This aims to determine if within an environment, the difference in computation running time would exist. Datasets used for clustering algorithms are obtained from UC Irvine Machine Learning Repository [16] and the International Skin Imaging Collaboration [17]. Out of five selected datasets, one is a subset of cancer images for the year of 2018, which includes only melanocytic nevus (NV) class. This class is selected due to its size. The images are converted to gray-scale and vectorized. A number of dimensions and observations/records were one of the main criteria for this experiment. Other requirements for the selection are popularity as well as cover different fields. Through this mechanism, types of data are diverse, which help to investigate possible variance of computation times when using three variants of K-means. Who was involved in this project (eg. faculty, students, community partners)? How did their involvement contribute to the project’s success? Were there any challenges to overcome? In a total of 4 students involved in this project. One student was hired through the Work on Campus route. Three students joined the project voluntarily due to the nature of the project. DANA instructor team was the great help. I cannot complete the project with a great success without their support. Please share any personal stories that made this research experience memorable/valuable. It is tough to conduct research while teaching and studying. But together, we were able to present our work to an international conference. What are the next steps for this project and for you as a researcher? My students and I would like to continue with the second phase of the project which is to design the infrastructure to create a number of workers simultaneously in a real-time setting to achieve a lesson learn on using unsupervised technique in the health-care domains. Please upload any images that will help to showcase your project. Unsupervised-learning-–-Evaluation-of-modeling-with-parallel.pptx Langara Institutional Repository Consent By submitting, I consent to uploading my ARC Fund final report to the Langara Institutional Repository (The LaIR).