Measuring Students ' Academic Performance through Educational Data Mining

simulated dataset for predicting the students’ academic performance, we study/compare various decision tree (DT) based algorithms (which include ID3, C4.5 and CART) with different choices of information entropy metrics (which include Shannon, Quadratic, Havrda and Charvát, Rényi, Taneja, Trigonometric and R − norm entropies) to build a decision tree in order to provide appropriate counseling/advise at an earlier stage. DT is one such important technique in educational data mining (EDM) which creates hierarchical structures of classification rules “If ⋯, Then ⋯” building a tree structure by incrementally breaking down the datasets in smaller subsets. The results suggest that basic training of the students has no significant predictive power on performance, while information about their abilities, diligence, motivation and activity in the learning process can predict their grades. As such, the resulting forecasts can be used by the instructor in optimizing the learning process and designing the course content and schedule.

Index Terms-Decision tree algorithms, educational data mining, entropy metrics and students' academic performance.

I. BACKGROUND
The prediction of students' academic performance (e.g., [1]), which is one of the well-studied educational data mining problems, can be accomplished by the three stages below: Descriptive analytics -we get students' historical data through an on-line survey and introduce seven attributes (or features) (e.g., [2]- [4]) to help us predict and make decisions related to their academic status.
Predictive analytics -we use various DT algorithms (e.g., [5], [6]) to get a predictive final grade for a student, namely, a student's learning outcome and pattern based on his/her past data.
Prescriptive analytics -we show how we used different decision tree algorithms, e.g., ID3, C4.5 and CART, with various feature selection methods for further/deeper analysis, e.g., we display the If-Then figure and show the precision and accuracy of each method. Based on their final learning outcomes, we can create a better strategy for designing various MATH related course content and schedules at the Chinese University of Hong Kong (CUHK).
There is a significant number of articles (e.g., [1], [7]- [9]) that relate to different choices of information entropy methods used in this study. Despite a comparative study of Manuscript  various information entropy methods with eight real world datasets discussed in [1], analysis and prediction of student's learning performance from an algorithmic perspective is not fully investigated yet.

II. PURPOSE OF STUDY
In this paper, we address the following issues:

A. Handling Large Data Samples
We test our homemade codes on a set of students' learning data samples and show how effective DT rulebased algorithms are. As the size of the required data increases, will the obtained results of student's learning patterns/features be more relevant and have the same result using different DT and entropy measures?

B. Understanding Data Mining Actionable Trends
To generate multiple sets of the student's learning data, inspired from [10], the random nested sampling for evolving data streams is used. Hence, many of the original data are repeated in the resulting simulated date set. As the size of the required data increases, for each evolution stage we added 10% more noisy data from the previous data samples. A random selection procedure was used to obtain the noisy data from the normal distribution function. Therefore, we randomly generated seven feature values that were significantly related to students' end of semester marks. In other words, these data are drawn from the random sampling, but are not obtained from the students' survey. Hence, we called it noisy data. How will the noisy data in the synthetic dataset affect the students' prediction results? This process acts like real-time instances of overfitting. We used the so-called back-track pruned (BTpruned) algorithm [11] to reorganize the nodes of the constructed tree in order to overcome this drawback when large data samples were used.

C. Gaining Perceptive Knowledge in Teaching
Purposes In order to examine the training and testing datasets, the ten-fold cross validation model will be used. Will our predictions match the testing dataset?

A． Problem Description: Studying Students' Learning Activities
The dataset used here is collected from three sources:

Sample A
We used an observed dataset from [12] as a seed; the size of the required data is 50 (see Table I). Then, we combined this seed data set with our real data set, where the size of our required data is 25 (see Table II). Then, we added 10% of the noisy data we mentioned earlier through the normal distributed function into the mix of the two combined data sets (75). Using the concept of random nested sampling, we generated different sets of repeated multiple data depending on the designed size of the data set = 200,400,800,1600,3200,6400.

Sample B
As a real pilot study, we collected a dataset from the SAYT1510 course offered by the Chinese University of Hong Kong. We conducted an on-line survey via Studying Students' Learning Activities. Students who took SAYT1510 were international and local high school students.

Sample C
To validate our models, we combined two data samples sets, Sample A and Sample B, called Sample C, and repeated the same random nested sampling procedure as above.

B． Training Dataset
Our goal here is to study/predict students' learning activities based on a set of attributes described in [12]. Similar works are found in [13] and [14]. The training datasets that are combined from Sample A are used to build the model as shown in Table I (from one to fifty samples). We set the size of the data samples, = 200,400,800,1600,3200,6400, where the training set for our example is defined by the target class "students' learning activities" using the end of the semester marks with four modalities: We assume:End of semester marks = First, = Second, = Third, or = Fail The seven attributes describing the observations are: "Overall Semester Marks (OSM)", "Class Test (CT)", "Seminar Performance (SP)", "Assignment Performance (AP)", "Paper Presentations (PP)", "Attendance (ATT)", and "Laboratory Classes (LC)", which can take the following values: In terms of the set notations, we have:

C． Testing Dataset
The testing dataset was composed of the data from 25 students' in the SAYT1510 course, as shown in Table II.

D． Methodology: Decision Tree Algorithms
Decision tree classifiers are used to predict students' final marks. A decision tree is a flowchart like structure where the leaf nodes represent class labels and the non-leaf nodes represent attributes.
1) ID3 ID3 developed by Quinlan in 1979 [15] constructs a decision tree by employing a top-down, greedy search through the given sets of training data to test each attribute at every node, where the greedy search is based on the concept of heuristic problem solving by making an optimal local choice at each node. By making these local optimal choices, we reach the approximate optimal solution globally.
The ID3 algorithm can be summarized as: a) At each level (or stage or node), select out the best feature as the test condition (in this paper, seven features are considered). b) Now split the node into the possible outcomes (internal nodes). c) Repeat the above steps until all the test conditions have been exhausted into leaf nodes.
In (a), to make that decision, we need to have some knowledge about entropy and information gain. Based on the computed values of entropy and information gain, we choose the best attribute at any particular step.
To be more precise, the ID3 algorithm selects the attribute to be split based on two metrics: a) Measuring Impurity: An entropy metric measures the amount of information in an attribute. Entropy is calculated for all the remaining attributes. Split occurs at the attribute that has smallest entropy. Given probabilities where Value( ) is the set of all possible values for attribute , and is the subset of for which attribute has value (i.e., = { ∈ | ( ) = }).
2) C4.5 C4.5 known as J48 in WEKA (Waikato Environment for Knowledge Analysis) is a successor of ID3 developed by Quinlan in 1992 [15] that is also based on Hunt's algorithm. C4.5, not only handles both categorical and continuous attributes to build a decision tree, but also makes use of the Gain Ratio( , ) which is computed as follows: Gain Ratio( , ) = Gain( , ) Split Information ( , ) where Split Information( , ) represents the information generated by splitting the training set into partitions which correspond to the results of a test on attribute . Attribute with the highest Gain Ratio is selected as the splitting attribute. Compared to the ID3 algorithm, the expected entropy described by this second term is simply the sum of the entropies of each subset, weighted by the fraction of examples | | | | that belong to Gain( , ), and is therefore the expected reduction in entropy caused by knowing the value of attribute :

3) Classification and regression tree (CART)
CART was introduced by Breiman et al. in 1984 [16] and is also based on Hunt's algorithm. CART handles both categorical and continuous attributes to build a decision tree. It handles missing values. CART uses the Gini Index as an attribute selection measure to build a decision tree. Unlike the ID3 and C4.5 algorithms, CART constructs binary splits. Hence, it constructs binary trees. The Gini Index measure does not use probabilistic assumptions like ID3, and C4.5. CART uses cost complexity pruning to remove the unreliable branches from the decision tree to improve the accuracy. Gini impurity is defined as 1 minus the sum of the squares of the class probabilities in a dataset: Gini Impurity = 1 − ∑ 2 =1 . The Gini index is then defined as the weighted sum of the Gini impurity of the different subsets after a split: The Gini Index of a pure table which consists of a single class is zero because the probability is 1. Similar to Entropy, the Gini Index also reaches maximum value when all classes in the table have equal probability.

4) Entropy metrics
In what follows, we examine various entropy metrics in order to relax the complexity of the DT constructions with respect to the increasing number of nodes and leaves: • Quadratic entropy ( [17]) - where is a parameter adjusted by the user, i.e., lim →1 = .
To the very best of our knowledge, using Trigonometric entropy and R − norm entropy for analyzing student performance has not been done yet.

IV. MAIN ARGUMENT
We have established a list of the R scripts used for our predictions and to find interesting patterns in different educational data mining models. First, we used the ID3 algorithm with the Shannon entropy and compared our results with the WEKA tool. Both results, in terms of the decision tree figures, have no significant differences. Hence, the results using our source codes were reported as follows. Fig. 1 provides the decision tree based on the ranking of seven attributes the C4.5 algorithm with the Shannon entropy and a 70%:30% split, where we took 75 data samples from Table I and Table II. The knowledge represented by the decision tree can be extracted and represented in the form of If-Then rules. We only list some of easier If-Then rules. • The confusion matrix is an expression that shows the reliability of an algorithm, i.e., how accurate it is, in terms of containing information on actual values and predictions on classification, as illustrated in Table III:  where True Positive (TP) is the amount of positive data that is correctly classified, True Negative (TN) is the amount of negative data that is correctly classified, False International Journal of Information and Education Technology, Vol. 10, No. 11, November 2020 Negative (FN) is the amount of negative data and incorrectly classified, False Positive (FP) is the number of positive data and incorrectly classified.
By expressing values as a percentages, we have the following: • Precision is the fraction of retrieved instances that are relevant, which is given by = True positives True positives + False positives × 100%.
• It is calculated as the total number of true positives divided by the sum of the total number of true positives and the total number of false positives.
• Recall is the fraction of relevant instances that are retrieved.  Table IV when the synthetic training and testing data ratio using the Shannon entropy for each is 70%:30%. Increasing the size of the synthetic samples showed the accuracy and the convergence of each algorithm using both the BT-pruned and unpruned algorithms. Table IV shows that using the Sample A dataset, the BT-pruned algorithm not only has a similar accuracy trend as the unpruned algorithm but also in terms of the number of nodes greatly reduced significance when large data samples were used. For different BT-pruned DT algorithms performance at = 6400, we have CART > ID3 > C4.5 While for different unpruned DT algorithms performance, we have CART > C4.5 > ID3.
For further validation testing, we compared our C4.5 decision tree with the one generated by WEKA, where both figures for each sample size are the same. Until otherwise stated, we used our homemade codes to test different entropy results using the BT-pruned algorithm.   433  1268  87  132  Second  412  1291  97  120  Third  380  1355  128  57  Fail  266  1476  117  6  C4.5  First  430  1266  90  134  Second  410  1277  99  134  Third  393  1335  115  77  Fail  256  1473  127  64  CART  First  437  1265  83  135  Second  415  1289  94  122  Third  381  1360  127  52  Fail  278  1460  105  77 Concerning the choice of different parameters against different choices of entropy methods, the results of the application of two BT-pruned ID3 and C4.5 DT algorithms with regard to the TP, TN, FP and FN are summarized in International Journal of Information and Education Technology, Vol. 10, No. 11, November 2020 Table V. The asterisk placed in the table represents the highest number of true positives from the testing data. Here is a list of interesting findings: • The CART DT algorithm with the Gini Index attribute split is the best TP measure, as shown in Table V. • Table VI summarizes the accuracy of each entropy method and reveals that, with a specified choice of parameters, the Havrda and Charvá t entropy is the best one. We used the Naí ve Bayes classifier and studied the same problem using WEKA, where its accuracy was 59.22.
• Table VI shows that using Quadratic, Havrda and Charvá t, Ré nyi, Taneja, Trigonometric and R − norm of entropies, results in better accuracy rates than the Shannon entropy. Here are a few observations: -We agreed with [17] that using the Quadratic entropy, the accuracy of the BT-pruned ID3 algorithm is better than that of BT-pruned C4.5 one.
-We agreed with [19] that using the Ré nyi entropy is better than using the Shannon one in terms of the accuracy measurement.
-We agreed with [20] that if large values are selected for a pair of parameters ( , ), the accuracy of using Taneja entropy will close to that of the Shannon entropy.
-Our findings show that using the Trigonometric and R − norm entropies with a specified choice of parameters will maintain good accuracy.
For illustrative purposes, using the BT-pruned DT algorithm with the Ré nyi entropy, we set = 6400 and calculated the mean decreasing gain (or the mean decreasing information gain), the mean decreasing gain ratio and the mean decreasing Gini for ID3, C4.5 and CART respectively against a set of attributes. Hence, we extracted the rank of attributes based on a set of the mean decreasing values, as shown in Fig. 2. Inspection of Table VII indicates that the lowest mean decreasing value is the parent node, e.g., OSM for ID3.

2) Split validation tests
In what follows, we only report the result of using the BT-pruned C4.5 DT algorithm with Shannon Entropy, the information gain feature selection model and = 6400. By adopting the split validation, a training experiment will be conducted based on a predetermined split ratio: For example, International Journal of Information and Education Technology, Vol. 10, No. 11, November 2020 we used 10% of the Sample C dataset. For each experiment, we repeated calculations 10 times and took the average of these 10 accuracy values, e.g., their precision and recall values. The overall accuracy of each experiment is summarized in Table  VIII, where the accuracy/precision/recall values of each modality are given. Since the Sample B testing dataset does not contain any Fail grades, the last row of Table VIII has no value of its own. What we learned here is that the more data we collected, the likelier the predicted outcome will be accurate. In other words, if the training dataset contains a large volume of students' learning performances, then the user will certainly predict their learning outcome with high precision when the suitable EDM algorithm is used.

V. CONCLUSIONS
This paper describes a study for predicting student academic performance by using only an online questionnaire survey. Although we only used a limited amount of real data (from 25 students), we used the random nested sampling method to generate a large class of data based on published results. We have implemented different BT-pruned decision tree algorithms with different entropy methods. Using the split validation, we have shown that our homemade codes yield good prediction accuracy even when the size of the training dataset is large and also influenced by noisy data. Hence, the If-Then decision rules provide more accurate results.
Properly used, the BT-pruned decision tree algorithms developed in this study could help us to predict students' learning performances, which could be used to identify students that would benefit from early intervention or to design students' activities according to their skills and knowledge. By following the procedures described in the paper, when facing noisy or contaminated data from the old data, practitioners may use the pruning decision tree algorithm to improve the generalization performance in decision tree induction and get more insight into their students' performances.
Direction for further research emerged while this study was being conducted. The most significant direction would be to extract data from the qualitative approach. Based on these data, we will have a better understanding of students' needs. In addition, when the size of data samples increases, the visualization of the decision tree is nearly impossible. Tracing the If-Then rules is also our next research objective. Based on our preliminary work, for further studies, in order to obtain a big picture of Students' Academic Performance in our courses, we will collect more data, examine our proposed algorithms and publish our results elsewhere.
Jeff Chak-Fu Wong received his B.S. and M.S. degrees in mathematics and geodesy from the University of New Brunswick, Canada in 1997 and 2001, respectively, and his Ph.D. degree in mathematics from the Chinese University of Hong Kong, Hong Kong in 2004. He is currently a senior lecturer at the Department of Mathematics, Chinese University of Hong Kong. His research interests include computational social networks, data mining and machine learning.
Tony Chun-Yin Yip received his B.S. in mathematics from University of Hong Kong, Hong Kong, in 2019. His present interests include data mining, machine learning and AI related problems.