EVALUATION OF UNSUPERVISED LEARNING MODELLING WITH PARALLEL PROCESSES Quynh T Nguyen1,2, Satyam Vatts1, Avneet Kaur1, Tatjana Jancic-Turner1, Raouf N.G. Naguib3 1Mathematics & Statistics Department, Langara College, Vancouver, Canada 2Department of Business Administration and Management, Dai Nam University, Hanoi, Vietnam 3School of Mathematics, Computer Science Engineering, Liverpool Hope University, Liverpool, UK UNSUPERVISED MACHINE LEARNING Clustering Based Approach APPLICATIONS CLUSTERING Healthcare: Identifying subgroups of diseases or patients for better diagnosis and treatment Technique to form segments of observations based on variations and similarities among them Heavily utilized to unravel hidden patterns & trends TECHNIQUES K-Means Market Research : Customer Segmentation to discover groups of similar customers K-Means++ K-Means Parallel 2 OPTIMIZED TECHNIQUES SUSTAINABLE FUTURE Working Towards A Reliable, Optimal & Sustainable Approach PROBLEM Clustering is a CPU-Intensive task MAIN GOAL PURPOSE MOTIVAT I ON To utilize the available compute power optimally for clustering processes Aimed at reducing the carbon footprint, less expenditure on infrastructure & making right decision in selecting the hardware Help individual and small enterprises with limited compute infrastructure to develop data-driven products Help in identifying appropriate approach for expanding or upgrading existing technological infrastructure 3 EXPERIMENTATION IN DIVERSE ENVIRONMENTS A Comparative Study To Research A Generic Solution ON-PREMISE AZURE CLOUD Ubuntu OS 22.04.1 LTS Ubuntu 20.04.4 LTS Intel i5-10 th Gen Processor 8-Core CPU Intel Xeon Platinum Processor 8-Core CPU 16-GB RAM 16-GB RAM 256GB SSD Storage 64GB Storage 4 WORKING WITH DIFFERENT DATA NEEDS Considering Large & Small Datasets from different domains DATA S E T O B S E RVAT I O N S AT T R I B UT ES DOMAIN Glass Identification 214 10 Chemistry Wine Origins 178 13 DATA S E T S I Z E C O M PA R I S O N Accelerometer 40209 Nevus Skin Lesion Images 4692 ISOLET 7797 Water Treatment Plant 1382 Wine Origins 178 Glass Identification 214 Chemistry Water Treatment Plant 1382 19 Environment ISOLET 7797 618 Technology Nevus Skin Lesion Images 4692 784 Public Health Accelerometer 40, 209 5 Technology 0 10000 20000 30000 40000 5 PARALLELIZED EXECUTION Maximizing Core Utilization PARALLEL EXECUTION APPROACHES EVALUATION METRICS Python built-in multiprocessing module to enable process-parallelism CPU Cores v/s Execution Time Scikit-Learn K-Means implementation provides OpenMP-based mechanism for shared-memory multiprocessing CPU Cores v/s CPU Utilization Comparing performance for different combinations of: Number of Clusters (k) – 2, 4, 6, 8 Datasets Parallelization Approach 6 EXISTING HARDWARE OVER CLOUD SPENDINGS Outcome of Comparison between Azure & On-Premise Systems UTILIZING THE EXISTING INFRASTRUCTURE The result of 4536 trials on each environment indicated that a moderately strong existing on-premise infrastructure provides fairly good performance relative to Cloud COST SAVINGS Need to spend extra dollars to get better performance on cloud, a potential deal breaker for small enterprises, students & researchers with existing feasible hardware 7 UNEXPECTED TRENDS IN PROCESSING TIME Execution time did not decrease with increase in number of CPU cores • Intermittent patterns of execution duration appeared in both cloud-based and on-premise environments and regardless of dataset sizes • CPU Usages indicates under utilization of available Computation power since it doesn’t increase dramatically • No common thread to explain circumstances in which the increased number of cores resulted in prolonged processing times or decreased CPU usage • Lack of Control over embedded implementation of KMeans and its variants obscuring the cause of unexpected trends Azure Cloud • On-Premise Abstracted Implementation Of Complexity Resulting in Uncontrolled Core Utilization 8 EXISTING METHODS RELIABILITY Conclusions… SCIKIT-LEARN Although Scikit-Learn provides out-of-the- box parallelism capability that must reduce the processing times on high number of CPU cores, the research outcomes indicate otherwise DATA SCIENCE V/S COMPUTER SCIENCE Investigation of this irregularity requires a perspective of a computer scientist rather than a data scientist to make best use of available hardware through optimal software REINVENTING THE WHEEL V/S ACCESSIBILITY People working with Data need to focus on analysis & insights and thus, accessible, reliable & optimized software to work rather than worrying about the software optimization itself 9 INNOVATING OPTIMIZED APPROACHES MORE EXPERIMENTATION & INVESTIGATION MOTIVATION TO INNOVATE This Research provides us with a motivation to innovate more streamlined clustering implementation that are optimized for varying needs & infrastructure EXPERIMENTATION IS THE KEY Challenging the existing approaches through experimentations in varying environments is the key to innovate better approaches, hence, a need of rigrous investigation through trials. IMPLEMENTING MODERN TECHNIQUES Advancements in computing over the years such as quantum computing & more, can be put to use for implementing modern solutions to clustering problems 10 THANK YOU! 11