JOIV : Int. Inform. Visualization, 9. - January 2025 231-240 INTERNATIONAL JOURNAL ON INFORMATICS VISUALIZATION INTERNATIONAL JOURNAL ON INFORMATICS VISUALIZATION journal homepage : w. org/index. php/joiv Optimizing iCadet Assignment through User Profiling Peak-Fei Yap a,*. Choo-Yee Ting a . Hairul A. Abdul-Rashid b Faculty of Computing and Informatics. Multimedia University. Persiaran Multimedia. Cyberjaya. Cyberjaya. Selangor. Malaysia Faculty of Engineering. Multimedia University. Persiaran Multimedia. Cyberjaya. Cyberjaya. Selangor. Malaysia Corresponding author: *yappeakfei@gmail. AbstractAiIndustry Cadetship program is a program that assigns penultimate year students to companies matching their profiles, bridging academic learning and industry skills. Manual data analysis for assignments is time-intensive, prompting this studyAos objectives: . propose an algorithm to optimize student-company assignment by using the student and company profiles, . propose a method for the assignment of lecturers to company, and . use similarity measure techniques to recommend companies with similar Data was collected from a university student, company, and lecturer datasets. To assign students to companies, the Haversine. OpenStreetMap, and NetworkX were used to calculate the shortest geographical distance between the students and the evaluated based on mean, variance, standard deviation, and utilization rate. For the lecturer assignment, cosine similarity was applied to measure the similarity between domain descriptions and company or lecturer information after performing Voyage AI Lecturers are assigned to companies based on the highest domain similarity scores. The performance was evaluated using accuracy, precision, recall, and F1- score. Findings showed that embedding techniques significantly enhanced the matching process, with accuracy improved from 0. 464 to 0. 6071, precision increased from 0. 417 to 0. 5058, recall saw an equal rise from 0. 464 to 0. and the F1-score advanced from 0. 417 to 0. Longer descriptive inputs further improved performance, with accuracy rising from 6154 to 0. 7692, precision from 0. 5744 to 0. 7751, recall remaining steady at 0. 7692, and F1-score increasing from 0. 5807 to 0. This work can be extended to explore job portal dataset by aligning profiles with geography and specialization. KeywordsAi iCadet. user profile. company profile. similarity measure. matching algorithm. Manuscript received 14 Jan. revised 19 Aug. accepted 27 Nov. Date of publication 31 Jan. International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4. 0 International License. make the process faster and more effective, we used Artificial Intelligence (AI). Workplace location has been identified as a significant yet unexplored factor in job matching, according to recent findings . Moreover, geographical factors are crucial in the studentsAo ability to find jobs, as some regions have a higher demand for specific professions or industries, as discussed by . Firstly, there was a critical need for an efficient algorithmic framework to facilitate the assignment of students and supervisors to companies. Workplace location and geographical factors are significant yet underexplored aspects influencing studentsAo ability to find jobs, especially in regions with higher demand for specific professions . , . Addressing this, the primary goal of this study was to design and implement a system leveraging student profiles, company profiles, and lecturersAo profiles to ensure optimal matches between students, lecturers, and companies. Another critical aspect is to assign appropriate lecturers to supervise iCadets placements, matching their expertise with the needs of the companiesAo hosting students. This approach INTRODUCTION The Industry Cadetship . Cade. program is a new initiative with the aim to help undergraduate students of all faculties . This program helps narrow the gap between what students learn in school and what is required in the workplace. Those involved in the iCadet program will engage in a variety of activities, including industry visits, corporate social responsibility (CSR) initiatives, onboarding events, and company gatherings, to immerse themselves in the corporate culture . Historically, matching students to companies was a lengthy process that involved evaluating many factors like student skills and job requirements. It often relied on human input, which could be subjective and inefficient . Similar to the job matching problem, iCadet placement was formerly a difficult and time-consuming process that required taking into account several variables, including the qualifications of the student, the demands of jobs, the skills, and many more. performance in the workplace as a result of their inability to apply classroom theories in practice. The Students Industrial Work Experience Scheme (SIWES) helps students link what they learn in school with real jobs. gives them a chance to work with real machines and tools. This hands-on experience helps students learn in ways that they canAot in the classroom . ensured that students received guidance and mentorship, and enhancing their educational experience . Furthermore, the current placement system frequently falls short of offering customized advice based on each student's distinct profile and goals. Students found it difficult to locate and establish connections with businesses that closely matched their interests and career goals in the absence of this customized approach. To solve this problem, this study focused on using similarity measure techniques. The objectives of this project are as follows: To propose an algorithm for student-company assignments through student profiles and company To propose a method for the assignment of the supervisor to a company. To use similarity measure techniques to recommend companies with similar characteristics. Task assignment problems involve allocating a set of tasks to a set of agents in a way that optimizes one or more objectives, such as minimizing total cost, maximizing efficiency, or achieving a fair distribution of work. Assigning workers their interested tasks is critical to ensuring continuous worker performance. If workers are assigned uninterested tasks, they may complete them with poor quality or even impacting businesses. Consideration of worker preferences is thus a significant challenge . According to . , existing works shows that many studies on gender look at where workers are and what they to do Not understanding how these two things connect can lead to bad job assignments. To fix this, a new method combines location and preferences to assign tasks. It aims to pick workers who are nearby and willing to do the job. A new approach called MAJA helps to get the most tasks done while following certain rules. The researchers then mentioned that ways to avoid task starvation with low gain should be considered in future work. LSTM-based model for extracting workersAo latent feelings from historical data was proposed by . The researchers then developed an efficient greedy algorithm and a KuhnMunkras (KM)-based algorithm to achieve optimal task assignment, taking into consideration the workersAo feelings. This research highlighted that graduate employability extends beyond academic credentials to include unique market contributions. Moreover, the study revealed an early gender wage gap, where female interns earned more than noninterns but less than their male counterparts. Notably, the significant salary disparities between genders were attributed solely to the age of the respondents, with older individuals favoring male candidates . Odlin et al. found that internships located far from the home institution for in inherently riskier settings, such as factories or politically unstable regions, posed increased risks. Suck locations, crucial for specific fields like engineering or hospitality, often involve higher costs and potential isolation for students, reducing oversight and elevating risk. The findings of . showed internship experiences help students understand the work world better. Before these internships, many didnAot really know what to expect. According to . , employers identified graduates' poor User Features in Student-Company Matching According to the table below, the majority of the researchers used gender as a factor in internship placement. Gender distribution in internship placements can reveal patterns that can help guide diversity strategies in traditionally male or female-dominated fields. Next, age is another factor that researchers often consider. It can indicate how much experience someone might have. Some internship programs are set up for students in certain years, like those in their second-to-last year. Analyzing age distribution helps ensure that internship opportunities are appropriate for the target audience. Year of graduation Oo Soft Skills Oo SSE Education Level Oo Marital Status Work Experience Age Race Gender Author Total Major / Academic Program TABLE I USER FEATURES IN STUDENT-COMPANY MATCHING Oo Oo Oo Oo Oo Oo Oo Oo Oo Oo Oo Oo Oo Furthermore, majors reflect a studentAos likely skill set and area of expertise. Internship placements that are relevant to a studentAos major allow them to apply and improve the skills learned during their academic coursework, resulting in a more meaningful and productive internship experience. As a result, a few researchers use major/academic programs to assess the alignment between a studentAos academic background and internship requirements. User Profiling in Job Matching Previously, finding the right job mostly relied on people making decisions, with little help from technology. Digital platforms introduced a basic system that used keywords for attributes. The researchers discovered user preferences such as rating information, tag information, the number of users, and the number of products on the books. Furthermore, a Conversational Recommender System (CRS) called Estimation-Action-Reflection (EAR) . This system estimates user preferences for both items and item attribute and then uses learning dialogue policies to decide whether to inquire about attributes or recommend items based on ongoing conversation and user preferences. The underlying model consists of factorization machines that have been trained on user profiles and item attributes. It was discovered that in many existing CRS, many current systems have a fixed set of user intents. This means they rely heavily on background knowledge that is built by hand . In Natural Language Processing (NLP), figuring out what the user wants and picking the best response is really important. A new way to recommend jobs uses different machine learning models and language processing techniques . , . Researchers looked at user skills and job requirements to make better suggestions. They combined features to fix the problems with older recommendation The researchers found that the Random Forest classifier algorithm worked best for their main model. For their language processing needs, the Spacy Phrase Matcher did a great job. Additionally, a Collaborative Filtering method using the K-NN algorithm. This one helped find job fits for Informatics Engineering students by checking how close their skills were to tech jobs . Finally, researchers also used cosinesimilarity along with K-NN to match CVs with job descriptions . job matching. But these systems often missed the finer details of job descriptions and candidate backgrounds. They also struggled with the fast-changing job market where new skills and roles pop up all the time. That's where machine learning and AI come in. They can update job profiles in real-time, making sure matches are accurate and relevant. Previous research suggested that user profiling is vital for helping candidates find appropriate jobs. By understanding user backgrounds, job systems can give more tailored recommendations . The main goal is to investigate what job candidates prefer based on their past job interviews and Also, there is a growing interest in using natural language processing (NLP) to improve accuracy. Document ranking and comparing document similarities have been identified as major tasks in natural language processing (NLP) . Additionally. NLP is used to extract user profiles, such as skills, education, and experience from unstructured resumes, which results in a summary of each application. User profiling is critical to addressing the challenges that candidates face when navigating the complex landscape of job By thoroughly analyzing individual user profiles, job recommendation systems can help candidates identify and secure positions that are closely related to their field of interest and expertise. This tailored approach helps reduce the frustration and uncertainty often associated with the job search process. Candidates receive recommendations that are personalized to their skills, experience, and career aspirations . , . , . , . , . , . For job matching platforms, accessing user profiles enables systems to incorporate a wide range of data, such as age, country, past learning activities, and educational background. This information helps identify users with similar learning of professional preferences . These user profiles might be the important features for the model to recommend the job. For expert recommendation systems. BERTERS, a recommendation systems have been applied to identify patterns in candidatesAo expertise and preference. Additionally, the skills2job recommendation system, which begins with a set of user preferences for skills and identifies the most suitable jobs as users emerge from a large dataset of Online Job Vacancies (OJV. The researchers utilize European Skills. Competences. Qualifications, and Occupations Taxonomy (ESCO) to assess the similarity between occupations and users' skills. Cui and colleagues discovered a gap in the existing recommendation system, which always ignored the inherent relationship between the user's preference and time. In reality, the user's interest changes over time . , . To address the gap, the researchers proposed a novel recommendation model based on the time correlation coefficient and an improved Kmeans with cuckoo search (CSK-mean. Systematic experimental results show that their model is effective . Current recommendation systems rely on past interaction history to estimate user preferences, which limits their ability to capture fine-grained and dynamic user preferences . Thus, the researchers proposed Conversational Path Reasoning (CPR). It walks through the attribute vertices based on user feedback, explicitly using the user-preferred CPR reduces irrelevant candidate attributes, increasing the likelihood of identifying user-preferred II. MATERIALS AND METHOD Fig 1 indicates the flow of methods that are applied in this In this project, data preprocessing and missing data handling are performed to transform the raw data in as useful for assignment algorithms. Assignment algorithms are constructed after the missing data handling have been done. Fig. 1 Flowchart of methods Fig 2 shows a framework overview of student-com from data collection to data preprocessing including feature engineering, and missing value handling. Next, construct algorithms by setting up preferences using the latitude, longitude, and domains. The margin of error of latitude and longitude is set to 2km. The threshold of the distance for the assignment is set to 10km, 20km, 30km, and 40km. contains student info such as demographic data, and Lecturer will be the dataset of the lecturer with name, email address, expertise, and other's information. TABLE II FEATURES IN THE DATA COLLECTED Dataset ITP Historical Student Lecturer Features Faculty. Company Name. Company Address. Academic Program. Major Code Student ID. Latitude. Longitude. Program Name Name. Email Address. Expertise. Related Subjects. Best Domain Data Extraction & Data Preprocessing Before using assignment algorithms, the data set is preprocessed to remove duplicate rows, missing values, noisy data, and outliers. Real-world data is rarely clean or complete. Thus, data preprocessing is an important step in delivering processed data to improve assignment accuracy. ITP dataset consists of basic information about companies. Preprocessing steps like double backslashes, and extra spaces have been Company name and address standardization also have been performed. Then, the company's major, such as Software Engineering, is converted into a specific major code, making data processing and analysis more programmatic. Furthermore, the dataset Lecturer includes detailed information about each supervisor. All text has been converted to lowercase and stop words and extra spaces have been removed to ensure data consistency. Additionally, special characters such as "yC yC" have been meticulously Fig. 2 Framework overview of student-company assignments. The algorithm first checks whether the company domain is identical to the student domain, and if it is identical, then continues checking the companyAos available position. All the matching data will be saved into a data frame called a Matching Companies csv file. The algorithm continues with calculating the shortest path between the company and the student using Haversine. Then. OSMnx is applied to calculate the drivable route on the map. Finally, the student is assigned to the nearest company. Fig 3 shows a framework overview of lecturer-company allocation where the entire process passed through preprocessing pipelines for both lecturers and companies, whereby the domain descriptions, related subjects, and company descriptions were cleaned and tokenized. Several preparatory steps have been taken such removal of stop words, stemming, and lemmatization. Missing Data Handling For the ITP dataset, the missing company address was filled in by navigating Waze with Selenium and saving the company's location address. The missing longitude and latitude were filled in with a combination of Nominatim and ArcGIS techniques. This can remove the complexities of dealing with long, descriptive names that may contain special characters or spaces, which can be inconvenient in coding, database management, and reporting. This project is mainly focused on Malaysia country Therefore, in the Student dataset we removed 78 rows of students who live outside of Malaysia. Next, the missing value in the Lecturer dataset indicated that the lecturer did not have that information. Initially, the missing value dropped. However, to retain valuable information and improve the robustness of the analysis, we replaced the missing value with AuNoneAy. This ensured that no data was discarded, maintaining the datasetAos completeness. Fig. 3 Framework overview of lecturer-company assignments. The cleaned and normalized texts were sent to the embedding stage using Voyage AI for conversion into numerical vectors. Similarity was computed between texts using cosine similarity. Hence, the most relevant domains have been assigned to both companies and lecturers. The assignment of lecturers to a company uses the domains of both companies and lecturers to perform a matching. Algorithm Design & Construction A preferences-matching algorithm constructed in this project aims to assign students to companies based on major alignment and proximity, considering company capacity. First, it filters the company data to find companies that match the student's major code and check for the available space of that company. Then, it calculates the distance from each company to the student by longitude and latitude using the Haversine formula, it sorts companies by distance, and the unit of distance is set as Kilometers . The Haversine formula is shown below: Data Source In this work, four datasets are involved. Let ITP be the dataset that contains the company info that will be used for the matching for the student. Student will be the dataset that = 2 arcsin '!"#$%& = After sorting companies, now the algorithms are constructed to calculate the driving distance using OSMnx. The alternative way of calculating the driving distance will be Taxicab if the OSMnx cannot find the driving path. Then, the algorithm attempts to assign the student to the nearest company that matches their preferences within the distance The threshold of the distance is set to 10km, 20km, 30km, and 40km. It updates the company's available space, removes the student from the pool, and saves assignment After students with matching major codes have been assigned, another function is set to assign the remaining students to any company with available space, regardless of the major code. It updates the company's space and records each assignment, just like the first function. This is to ensure that all students are assigned to a company. The calculation of distance between locations is the most important factor in determining how quickly someone can find or arrive at their However, latitude and longitude coordinates facilitate the calculation of the distance between two locations on Earth. The best way to measure how far apart two places are on Earth is by using the great-circle distance. This gives you the shortest path across the globe. A popular formula for figuring this out is the Haversine formula. ItAos commonly used in navigation systems. The Haversine formula helps you find the straight-line distance between two points using their longitude and latitude. The Haversine Formula is an essential equation for calculating the straight-line distance between two coordinates on Earth using longitude and latitude parameters. OSMnx is a Python package built by geopandas, network and matplotlib to retrieve, model, analyze, and visualize street networks from OpenStreetMap. OSMnx was used for map building and visualization in their works by . , while . employed OSMnx to determine the geographic node distances from real road courses in the network area. In the iCadet assignment, the location profile was a critical consideration, necessitating the shortest possible distance between points. To address this, a novel method combining the Haversine formula and OSMnx was developed. Initially, the Haversine formula was used to filter candidates based on geographic proximity, providing a quick estimation of distances using latitude and longitude coordinates. This step aimed to reduce the time consumed in the initial screening. Subsequently, for more accurate road-based distance calculations necessary for final assignments. OSMnx was A Python function titled "Calculate driving distance" was created to determine the driving distances between companies and students. This function received two tuples containing the geographic coordinates of both entities. The algorithm first calculated a bounding box around these points by determining the minimum and maximum longitude and latitude values, incorporating a margin of error equivalent to about two kilometers. This margin ensured the inclusion of both locations and a surrounding buffer zone in the generated graph, facilitating accurate path calculation. Here is the Margin of error formula set for 2km, written in equations as !"#$%& = . ()* #",%"&- -. /,0&. Then, graph G is created within the bounding box around the two locations and is used to extract a driving network graph from OpenStreetMap data using the OSMnx library. This graph represents the network of drivable roads within the bounding box. Once the graph is obtained, the function locates the nearest nodes on this graph to the student's and company's coordinates. These nodes represent the closest points on the driving network to the specified locations. Using the NetworkX library, the function attempts to find the shortest path between the student's node and the company's node on the graph, weighted by the physical length of the If a path exists, the function returns the length of this path in kilometers . s the length is initially calculated in However, if no path can be found between the two nodesAi an exception raised by NetworkX as NetworkXNoPathAithe function falls back to an alternate method of calculating the shortest path. Taxicabs were used as an alternative way to find the shortest path. In summary, this function is an integral part of the geographic information system (GIS) analysis providing a practical tool for measuring the accessibility of iCadet locations for students based on real-world road Algorithm 1 CalculateDrivngDistance Input: 4567 Output: :% '- , 9567 Begin ilat Ia |Lat_s Ae Lat_. ilon Ia |Lon_s Ae Lon_. Lat_margin Ia C!%& Ia min C!"I Ia max K!%& Ia min K!"I Ia max '- , '7 Oe F GH'56& '- , '7 F GH'56& . L Oe F GH'5". L F GH'5". Lon_margin Ia O cos @ A5". ' M- Ia NM M N M O, - , '' M7 Ia NM M N M O, 7 , '7 Try P Ea Ia 4Ea M R Ea O, ' M- , ' Catch Return none EndTry Return :% End M7 . SMHGEa =T M'G EaT The proposed algorithm in the study assigns students to companies based on their profiles by considering the studentsAo majors, geographical proximity to the companies, and the available spaces at the companies. The algorithm StudentCompanyAssignment automates this process, taking in three arguments: 4% , containing information about students. 9% , containing information about companies. and a , a distance threshold value within which students are considered for The function begins by initializing an empty array called assignments to store details of the assigned students. It then iterates over each student in the 4% , retrieving their unique ID, major code, and geographic coordinates . lat, s. Using the student's major code, the function filters the 9% to find companies that match the student's major and have available capacity . space > . If no matching companies are found, the function continues to the next student. For students with matching companies, the function calculates the distance between the student's location and each company using the Haversine formula. The resulting distances are added to the matching_companies. M DataFrame as a new column. The companies are then sorted based on their distance from the student in ascending order. The function iterates through the sorted list of companies and identifies those within the specified distance threshold. For each company within the threshold, the function attempts to calculate the driving distance using the previously defined CalculateDrivingDistance algorithm. If successful, it normalizes this distance to kilometers. if not, it sets the driving distance as None. Once a suitable company is found, the function updates the space column in the 9% . DataFrame to account for the filled internship position and removes the student from the 4% DataFrame to prevent them from being assigned again. assignment record is created, capturing the student ID, the company's major code, company name, driving distance, count of internships, and remaining space. This record is added to the assignments array. Finally, the function creates a new DataFrame from the assignments array with appropriate column names and returns it along with the updated student_data, reflecting the students who have yet to be In conclusion, this algorithm demonstrates an efficient and structured approach to resolving the complex task of internship placements, ensuring students are matched with appropriate companies based on academic alignment and geographic accessibility. This method balances the needs of both students and companies by optimizing the placement process and ensuring that students are placed in relevant and accessible internships. :,#%m0 . Ia 9 _ M: HjH'G:H Catch: :,#%m0 . Ia N 'M P MIa . Ia 4% b } AIa oOb } Break Return End = 2 arcsin MIab OO 9% | . F a MIa g kK . For each OO g: If c. distance O : Try: M Oe1 Algorithm 3 PreprocessText Input: q% Output: q750"& Begin =F a - && M >. End If |M| = 0: Continue to next student For each c OO g: , . 7 Ia . Ia i jM H'M H ' MIa To find the driving distance between two places, we use Open Street Maps (OSM) and Python tool called OSMNX. OSM is a great source for detailed map info, including roads, buildings, rivers, and mountains. Many people help keep OSM updated. These include hobbyists, mappers, disaster risk experts, and GIS professionals. Since OSM is open to everyone, anyone can use its data. OSMNX takes this data and uses it to create network for different uses. For lecturer-company assignments, the framework is divided into several major components: preprocessing pipelines for both lecturers and companies, embedding processes, and a domain matching mechanism. WXY Z[\] 4 OO 4% do _ M' %. Ia . F a - Ia . F a , . - Ia The Haversine formula helps us find the distance between two points on the Earth using their latitude and longitude. The Haversine formula in navigation calculates the distance of a circle between latitude and longitude points, assuming the earth's radius R is 6367. 45 km. The Haversine formula's assumption ignores the earth's surface structure . alley depth and hill heigh. , which is quite accurate in most calculations because the ellipsoidal effect is eliminated. In this project, the Haversine method is used to calculate the straight-line distance between the student's location and each company, and the result is stored in a new column called After the calculation is performed, companies are sorted in ascending order based on the distance. This prioritizes companies closer to the student, making them more likely candidates for assignment. Here is the Haversine formula . written in equation . as follows: Algorithm 2 StudentCompanyAssignment Input: 4% , 9% HG'FM' Output: Begin Initialize AIa OI HG'FM' ' M ' M H' The area of expertise of lecturers and company description have been undergo preprocessing step where all numeric characters are removed from text to focus on textual data, nonalphanumeric characters are stripped out to standardize the The text is tokenized using spaCy, a powerful NLP tool, breaking it down into individual words or tokens. Common words that are irrelevant such as AutheAy. AuisAy and AuatAy are M' H'G q% Ia rMF jM4 PS q% Ia q sM'HtM q% q% Ia 4 MFFH'G q% q% Ia uMFF Ht H ' q% company assignments through encompassing variance and standard deviation analysis, utilization rate measurement, and spatial analysis. Besides, accuracy, precision, recall and F1score are used to evaluate the performance of the lecturercompany assignment. Algorithm 4 EmbedText Input: q750"&, g!6,05 , `. Output: x. Begin Ia MFkM yzq750"& {. F Return x. End M = g!6,05 . H'P_ = `. Variance Variance is used as an evaluation metric in this study to evaluate the goodness of an algorithm in the context of driving Variance measures the spread or dispersion of driving distance. A lower variance indicated that the driving distances were more consistent or stable. The equation of variance is written as where n is the number of data points. C% is each data point, and CI is the mean of the data. After preprocessing, the text data is converted into numerical form. The cleaned and tokenized text is transformed into vectors using Voyage AI embedding techniques using Auvoyage-large-2Ay. The embedding process captures semantic meaning and contextual relationships between words in the text. Algorithm 5 CalculateSimilarity Input: x. , x. Output: 4-%!%5"# Oc&%A C% Oe CI Standard Deviation Standard deviation is the square root of the variance and provides another measure of the amount of variation or dispersion in a set of values, which is driving distances. lower standard deviation could contribute to more reliable and stable outcomes as a low standard deviation implies that the values are tightly clustered around the mean. Begin WXY Z[\] }. A CA A AX 4-%!%5"# Ia 1 Oe H'M x. , x. EndFor Return AoeCACA[Y End There is the core component where the processed data is used to match lecturers to companies. The similarity between the lecturerAos and companyAos domain is calculated using cosine similarity. This measure helps in identifying how close or related two sets of text data are, based on their vector Based on similarity scores, the best matching domain for each lecturer and company are identified. This involves selecting the company domain that has the highest similarity score with the lecturerAos domain description. H'M HFH H ' M, AyI H K = A|A|A||I|| :MjH H ', = Oc&%A C% Oe CI Utilization Rate Measure The utilization rate of the company's resources was This rate was calculated by subtracting the remaining capacity from the initial capacity for each item in the company data, summing these values, and then dividing by the total capacity to find the percentage. The resulting utilization rate stood at utilization rate (%), which reflects the extent to which the company's resources were effectively employed during the period under review. A H Ht H ' r M = Algorithm 6 LecturerCompanyMatching Input: u% , 9% Output: o HG'FM' Oc 7"w"7%. vAoAoAo3Ao21 . "w"7%. vAuAyA2AoAoAoAoAe Oc 7"w"7%. v3Ai321 y 100 % . Statistics Variance depicted the spread or dispersion of values within each faculty's dataset, providing insights into how significantly the utilization rates deviated from the mean. Begin WXY Z[\] u OO u% AX WXY Z[\] 9 OO 9% AX A u, = 9, . ZA o HG'FM' Ia u%, , 9%. Break EndIf EndFor EndFor Return Assignments End Accuracy Accuracy measures the proportion of true results . oth true positives and true negative. among the total number of examine in cases. Lecturers are assigned to companies where their expertise best matches the companyAos needs. The final matches are compiled, typically in a structured format such as a data frame, showing which lecturer is assigned to which company along with the similarity scores and other relevant details. EAEA K = EAEAAeAeA Precision Precision measures the accuracy of positive predictions. illustrates the ratio of true positive to all positive predicted by the classifier. R M H H '= Algorithm Evaluation As the complexity of algorithms increases across various domains, evaluating their performance becomes critical to ensure efficiency and effectiveness. This section introduced the comprehensive evaluation of algorithms of student- EAe Recall Recall measures the algorithmAos ability to identify all relevant instances. It is crucial for cases where missing a positive is significantly worse than falsely identifying a E = E A eA Fig 6 displayed a dense concentration of red markers in the state of Selangor, indicating a high density of companies in a central location. This clustering of markers represented a significant number of opportunities for iCadets . in that area. Given the economic vibrancy and the abundance of corporate entities, students assigned to this region were likely to have access to a diverse range of professional experiences. F1-Score The F1-Score is the harmonic means of Precision and Recall, and it provides a balance between the two metrics. is particularly useful when the classes are imbalanced. F1-Score = EAeAeA RESULTS AND DISCUSSION In assessing driving distances and utilization patterns across faculties, significant variations were observed. Most faculties exhibited high variance and standard deviation in driving distances, reflecting the diverse locations of students and companies. Faculty 9 recorded the highest average driving distance at 0. 3605km, with a variance of 4. 5699 and standard deviation of 2. 1377, indicating a broad range of travel distances. Conversely. Faculty 6 experienced a notably low utilization rate, primarily due to mismatch between the majority of students majoring in accounting and the absence of companies specializing in this area within the dataset. This spatial mismatch, especially where companies were clustered in specific geographical areas, led to underutilization in some Fig. 6 Dense concentration of assigned student and company in Selangor After the implementation of the algorithms, students and companies were matched by prioritizing their preferences. Five CSV files were generated, assigned based on distance thresholds of 10 km, 20 km, 30 km, 40 km, and a final The final assignments CSV file included all assignments for students who were not matched according to their preferences. This file contained their student ID, student and company major codes, company name, driving distance in kilometers, company capacity . , and company current capacity . All students and companies had been successfully assigned at this stage. For the lecturer-company assignment, the tables above provide a clear comparison of performance metrics both with and without the use of embedding techniques in the domain assignment of lecturers. The results have demonstrated that incorporating embeddings significantly enhanced the effectiveness of the matching process between domain descriptions and related subjects. Specifically, the accuracy improved from 0. 464 to 0. 6071, precision increased from 417 to 0. 5058, recall saw an equal rise from 0. 464 to 0. and the F1-score advanced from 0. 417 to 0. TABLE i FACULTY STATISTICS Faculty Mean Variance Standard Deviation Utilization Rate (%) From Fig 5, markers were sparsely scattered across different states in Malaysia. Blue markers represented iCadets, whereas red markers with a briefcase icon likely signified The presence of markers in both East and West Malaysia suggested that the assignment encompassed a national scope. TABLE IV LECTURER DOMAIN ASSIGNMENT WITHOUT EMBEDDINGS Accuracy Without Embedding . paCy Metho. Precision Recall F1-score TABLE V LECTURER DOMAIN ASSIGNMENT WITH EMBEDDINGS Accuracy With Embedding (Voyage AI) Precision Recall F1-score Table VI presents that utilizing embeddings from 100 words description perform better in domain assignments compared to 50 words descriptions. The superiority of the Fig. 5 Geographical distribution of assigned student and company longer descriptions can contribute to their ability to provide a more comprehensive and detailed representation of a companyAos profile. There was a significant improvement in accuracy from 6154 to 0. 7692 when the description length was increased from 50 words to 100 words. This suggests that longer descriptions provide more detailed information, leading to better matches between lecturers and companies. Precision increased from 0. 5744 to 0. 7751 with longer descriptions. Higher precision indicates that a greater proportion of the matches identified were correct. Recall improved from 0. Higher recall means that a greater proportion of relevant matches were successfully identified. The F1-score, which is the harmonic mean of precision and recall, increased 5807 to 0. This comprehensive metric shows a balanced improvement in both precision and recall, demonstrating that longer descriptions significantly enhance the overall quality of the matching process. Next, this study underscores the significant benefits of employing embedding techniques in lecturer-company Through the application of machine learning methods, including the use Voyage AI embeddings and Cosine Similarity for domain matching, the use of embeddings to has markedly increased the accuracy of matching lecturers to This study documented improvements in all performance metrics including accuracy, precision, recall and F1-score when embeddings were employed. However, the algorithms developed and utilizes in this study as those for matching and distance calculations, might be tailored for specific conditions or datasets, which could limit their applicability in different contexts or with different data structures. For further improvement, a job portal dataset will be used to solve the remaining unassigned student issue, which currently is solved by assigning all those students to the company who have available space regardless of the student profile which is their geographical location and specialization. TABLE VI COMPARISON OF DOMAIN ASSIGNMENT PERFORMANCE BY DESCRIPTION REFERENCES