[IEEE 2011 International Conference on Research and Innovation in Information Systems (ICRIIS) -...

6
Application Of Self Organizing Map For Knowledge Discovery Based In Higher Education Data Robab Saadatdoost Faculty of Computer Science and Information Systems Universiti Teknologi Malaysia Johor Bahru, Malaysia [email protected] Alex Tze Hiang Sim Faculty of Computer Science and Information Systems Universiti Teknologi Malaysia Johor Bahru, Malaysia [email protected] Hosein Jafarkarimi Faculty of Computer Science and Information Systems Universiti Teknologi Malaysia Johor Bahru, Malaysia [email protected] Abstract— This paper focuses on knowledge discovery among attributes of Iran Higher Education Institute using self organizing map (SOM); the key problem with massive volume of data is extracting knowledge and patterns that are hidden in data. Managerial needs to explore this data for the purpose of decision making and strategy making reveals its importance. . Furthermore it can be useful for researchers that study and research about higher education. Meanwhile planning for higher education has significant impact on developing of one society, successful planning needs to analysis some huge and historical data that is available in higher education institutes. SOM is a particular type of neural network used in clustering and helps discover patterns and relations without advanced knowledge about them. The steps of this approach can be discussed under five headings, which are (i) Data Preparation (ii) Data Loading, (iii) Initializing, (iv) Map training and (v) Interpretation of the results. The target dataset contains data of five universities located in Tehran, Iran affiliated to Medical Ministry of Iran and the most important attributes are program of study, learning style, study mode and degree. Results show that the number of enrolling students for Tehran medical university has decreased for the past 23 years from 1988 to 2005. This study also finds that Tehran University of Medical Science covers the majority of high degrees like MDdisplay(Doctor of Medicine) and PhD. The findings of this study can be used in improving of higher education decision making systems and the results of this study indicate SOM toolbox utility in similar institutes to knowledge discovery in a visualizing way. Key words: knowledge discovery, Iran Higher Education, Self Organizing Map, Clustering, Medical Ministry of Iran. I. INTRODUCTION The key problem with massive volume of data is extracting knowledge that is hidden in data; there are many patterns and relations between attributes of one dataset that are certainly worthwhile for management and helpful in planning and decision making. A considerable amount of literature has been published on SOM (Self Organizing Map); a large body of literature that we followed is based on a research about usage of this toolbox and its performance by (Alvarez-Guerra, Molina et al. 2011) and (Vesanto, Himberg et al. 1999). SOM is a particular type of neural network used in clustering. It is common technique of data mining that is unsupervised learning and help to extract patterns and relations without advanced knowledge about them. SOM is a most important algorithm in data visualization (Alhoniemi, Himberg et al. 2002,2003), and it is used for data pre- processing, initializing and training, visualizing and analysing features of data and SOMs like relations between variables, clusters and quality of SOM (Vesanto, Himberg et al. 1999), SOM is a suitable toolbox for data understanding and analysis. In this project, we used higher education data as an input file of SOM to discover some relations and facts that are vital for management of higher education and researchers in this field. Any other similar institute can follow our process to do knowledge discovery. In this work, we focused on the usage of SOM in self clustering and its results that are shown as U-Matrix (Unified Distance Matrix) and component plane for each attribute. From visualization of these maps and comparing them we can reach some relations between attributes, discoveries and some facts hidden in data. The paper proceeds with studying the SOM and related case studies in section II, after which it proceeds with methodology with five steps. Subsequently, the results of the study are given followed by result and conclusion. II. SELF ORGANIZING MAP One of the unsupervised neural networks is self organizing map that reduces high dimensional vectors to a low or two dimensional maps. It is known as Kohonen’s self organizing map. A new SOM architecture WEBSOM has been developed in Kohonen’s laboratory for textual data mining. The SOM clusters data into groups. SOM algorithm contains three stages: competitive, cooperative and adaptive. One characteristics of SOM is developing of a prearranged network in which similarities will be shared between neurons and similar areas will be activated by similar patterns (Bação, Lobo et al. 2008). All neurons compete with each other for input patterns. The one selected for the input patterns wins the competition. Then this winner neuron is activated (Xiao, Dow et al. 2003). The winning neuron

Transcript of [IEEE 2011 International Conference on Research and Innovation in Information Systems (ICRIIS) -...

Application Of Self Organizing Map For Knowledge Discovery Based In Higher Education Data

Robab Saadatdoost Faculty of Computer Science

and Information Systems Universiti Teknologi Malaysia

Johor Bahru, Malaysia [email protected]

Alex Tze Hiang Sim Faculty of Computer Science

and Information Systems Universiti Teknologi Malaysia

Johor Bahru, Malaysia [email protected]

Hosein Jafarkarimi Faculty of Computer Science

and Information Systems Universiti Teknologi Malaysia

Johor Bahru, Malaysia [email protected]

Abstract— This paper focuses on knowledge discovery among attributes of Iran Higher Education Institute using self organizing map (SOM); the key problem with massive volume of data is extracting knowledge and patterns that are hidden in data. Managerial needs to explore this data for the purpose of decision making and strategy making reveals its importance. . Furthermore it can be useful for researchers that study and research about higher education. Meanwhile planning for higher education has significant impact on developing of one society, successful planning needs to analysis some huge and historical data that is available in higher education institutes. SOM is a particular type of neural network used in clustering and helps discover patterns and relations without advanced knowledge about them. The steps of this approach can be discussed under five headings, which are (i) Data Preparation (ii) Data Loading, (iii) Initializing, (iv) Map training and (v) Interpretation of the results. The target dataset contains data of five universities located in Tehran, Iran affiliated to Medical Ministry of Iran and the most important attributes are program of study, learning style, study mode and degree. Results show that the number of enrolling students for Tehran medical university has decreased for the past 23 years from 1988 to 2005. This study also finds that Tehran University of Medical Science covers the majority of high degrees like MDdisplay(Doctor of Medicine) and PhD. The findings of this study can be used in improving of higher education decision making systems and the results of this study indicate SOM toolbox utility in similar institutes to knowledge discovery in a visualizing way.

Key words: knowledge discovery, Iran Higher Education, Self Organizing Map, Clustering, Medical Ministry of Iran.

I. INTRODUCTION The key problem with massive volume of data is

extracting knowledge that is hidden in data; there are many patterns and relations between attributes of one dataset that are certainly worthwhile for management and helpful in planning and decision making. A considerable amount of literature has been published on SOM (Self Organizing Map); a large body of literature that we followed is based on a research about usage of this toolbox and its performance by (Alvarez-Guerra, Molina et al. 2011) and (Vesanto, Himberg

et al. 1999). SOM is a particular type of neural network used in clustering. It is common technique of data mining that is unsupervised learning and help to extract patterns and relations without advanced knowledge about them. SOM is a most important algorithm in data visualization (Alhoniemi, Himberg et al. 2002,2003), and it is used for data pre-processing, initializing and training, visualizing and analysing features of data and SOMs like relations between variables, clusters and quality of SOM (Vesanto, Himberg et al. 1999), SOM is a suitable toolbox for data understanding and analysis.

In this project, we used higher education data as an input file of SOM to discover some relations and facts that are vital for management of higher education and researchers in this field. Any other similar institute can follow our process to do knowledge discovery. In this work, we focused on the usage of SOM in self clustering and its results that are shown as U-Matrix (Unified Distance Matrix) and component plane for each attribute.

From visualization of these maps and comparing them we can reach some relations between attributes, discoveries and some facts hidden in data.

The paper proceeds with studying the SOM and related case studies in section II, after which it proceeds with methodology with five steps. Subsequently, the results of the study are given followed by result and conclusion.

II. SELF ORGANIZING MAP One of the unsupervised neural networks is self

organizing map that reduces high dimensional vectors to a low or two dimensional maps. It is known as Kohonen’s self organizing map. A new SOM architecture WEBSOM has been developed in Kohonen’s laboratory for textual data mining. The SOM clusters data into groups. SOM algorithm contains three stages: competitive, cooperative and adaptive. One characteristics of SOM is developing of a prearranged network in which similarities will be shared between neurons and similar areas will be activated by similar patterns (Bação, Lobo et al. 2008). All neurons compete with each other for input patterns. The one selected for the input patterns wins the competition. Then this winner neuron is activated (Xiao, Dow et al. 2003). The winning neuron

updates itself and neighborhoods values. After this adaption, similar clusters will be near to each other(Xiao, Dow et al. 2003).

One study done using self organizing map is “Evaluating students computer based learning” (Durfee, Schneberger et al. 2007). Its input data contains thirty six survey questions of 400 students, and thus clustering was done based on this input data. According to the data mining techniques, they found four student clusters. Therefore this assessment of students can be of help to improve computer based learning and evaluation (Durfee, Schneberger et al. 2007).

SOM was used in another case study, which was known as the Musical Artist Recommendation. These documents are basic information for SOM which they were in the text form and extracted from the Amazon web service. These documents were organized and then similar documents were located nearby (Vembu and Baumann 2004). Thus this task was done to explore similarities between various artist reviews to provide similar artists for recommendation service (Vembu and Baumann 2004). They presented results from 400 musical artists and validate them using another common recommendation service that is Echocloud (Vembu and Baumann 2004).

III. METHODOLOGY In this study we followed the steps of project suggested

by (Alvarez-Guerra, Molina et al. 2011) and (Vesanto, Himberg et al. 1999) to do knowledge discovery utilizing SOM. These are (i) data preparation (ii) data loading, (iii) initializing, (iv) map training and (v) interpretation of the results. In data preparation, we need to prepare our data according to self organizing map specification. Data preparation is shown below:

a) Step (i): DATA PREPARATION The variables in the SOM should be numerical, the type

of attributes are labeled using numerical value. Some examples are shown below:

TABLE I. UNIVERSITY CODE

UNIVERSITY

Item Label

Blood Transfusion Institute 1

Institute Pastor Iran 2

Iran University of Medical Sciences 3

shaheed Beheshti university of Medical sciences 4

Tehran University of Medical Sciences 5

TABLE II. DEGREE CODE

DEGREE

Item Label

Certificate 0

Bachelor 1

Master 2

MDdisplay(Doctor of Medicine)

3

PhD 4

b) Step (ii - iv): Loading – Initializing – Training As a continuing step we used command window in

Matlab and using this command “SD=som_read_data();” loaded data in the computer memory. There are some commands for remaining steps but in this study, we used this”Som_gui([SD]);” command. After evaluation of this command, we can see the below window (Figure 1). It contains some parts for loading of dataset, initializing and training of map. Each part has some default value and setting that we can change. In this project we just changed map size in initializing part and tried various sizes to choose the best one.

Figure 1. Initialization and Training of Map

We selected [70, 50] as map size and trained it by using default values. After training we can visualize map through utilities tab in this window and see maps. The Figure 2 shows U-Matrix and component planes of each attributes that are explained in section IV.

Figure 2. Visualizing of U-Matrix and component planes

a) Step (v): Interpretation After the training phase, we can visualize map containing

U-Matrix and component planes of attributes as shown in Figure 2. The clusters can be detected from the map. The U-Matrix needs to be interpreted for meaningful facts.

U-Matrix is a standard tool to present distances of input data. It computes the local distance structure of the data and it is useful to identify new and significant knowledge from data set (Ultsch 2003).

U-Matrix is seen in the top corner of Figure 2, Figure 3 is the large size of U-Matrix, and it shows two regions corresponding to the low and high value. According to this U-Matrix, we describe each cluster properties in table III:

Figure 3. U-Matrix (Unified Distance Matrix)

TABLE III. CLUSTERS IN THE SOM ANALYSIS

Cluster 1 Cluster 2 Iran University of Medical Sciences,

Shaheed Beheshti university of

Medical sciences, Tehran University

of Medical Sciences.

Tehran University of Medical

Sciences

Certificate, Bachelor, Master,

MDdisplay(Doctor of Medicine),

PhD

MDdisplay(Doctor of

Medicine)

Continuous and discontinuous Continuous

Daily, Equivalent, and Overnight Daily

Several years Between 1370, 1382

The table III describes two clusters detected from U-Matrix. Indeed cluster 1 is known as a large cluster including Iran University of Medical Sciences, Shaheed Beheshti university of Medical sciences and Tehran University of Medical Sciences with Certificate, Bachelor, and Master, MDdisplay(Doctor of Medicine), PhD degrees and other specifications, cluster 2 contains Tehran University of Medical Sciences with MDdisplay(Doctor of Medicine) degree. Major differences are seen among two groups; cluster 2 with low number of elements indicates to one university and can be selected by researchers for more analysis.

The description of attributes is seen in the Table IV according to this table, the analyses of the results are shown in the next section.

TABLE IV. ATTRIBUTES DEFINITIONS

Year The first registeration year (Julian year = solar year + 621)

Type

• Continuous : It passes in continuous years and does not have any stop. • Discontinuous: It is step by step, as a

case in point, one student first passes certificate degree then starts bachelor degree.

Study mode

• Daily: student can attend in class without tuition fee and in face to face learning style.

• Overnight: student can study with tuition fee and in face to face learning style.

• Equivalent: it involves someone that

cannot to finish its course or someone that enter university without exam.

Degree • Certificate: It is lowest academic

degree in Iran. • Bachelor: Students can study in this

one after high school or after certificate period.

• Master: It is after bachelor degree. • MDdisplay (Doctor of Medicine): It

takes 6 years and relates to medical and agricultural programs.

• PhD: The highest degree that starts after master or MD display degree.

IV. RESULTS AND DISCUSSION The correlations between university and year component

planes available in the Figure 2 are presented in Figure 4 separately. By comparing the two planes, it can be seen that small area in university plane contains high density of red color (Tehran Medical University). Thus, declining of red color in the large oval in university plane shows students of Tehran medical university have been decreased during the recent years.

This shows that students of Tehran Medical University have decreased from year 1988 to 2005 (1367 to 1384). On the other hand, the small oval in the year plane is almost filled with blue color around 1988 (1367). The large oval in the year plane with mostly red color exposes years around 2005 (1384). In overall, small oval with high density of red color indicating Tehran Medical University in around 1988 in comparison with large oval with low density of red color in around 2005 reveals a decline in the number of students of Tehran medical university during the recent years

Figure 4. Relation between university and year attributes

Figure 5. Relation between degree and year attributes

Figure 5 extracted from Figure 2 shows that high density of recent years corresponded with MDdisplay (Doctor of Medicine) and PhD degree in degree component plane. The dark blue in the degree plane relates to certificate degree and distribute during several years but the greatest number of students for certificate degree is seen between 1988 and 1996 (1367 and 1375).

What is interesting in this figure is, large number of students in MDdisplay(Doctor of Medicine) and PhD degree in recent years (2005) and the maximum number of students for certificate degree appeared between 1988 and 1996 (1367 and 1375).

Figure 6. Relation between type of study and year attributes

It is seen that red color in type plane corresponds with dark red in year plane. Hence, it presents that discontinuous type of study is introduced in recent years. The relation between two component planes of Figure 2 shown in Figure 6 depicts that discontinuous type of study (flexible and non-continuing) is launched in recent years (takes the value of 83.6). Before this, we just had continuous type of study.

An implication of this finding is that the introduction of discontinuous type of study by Iran Higher Education in recent years increases the number of students. This type of entrance to higher education degree is easier than continuous type of study.

Figure 7. Relation between study mode and year attributes

The rectangle in the study mode plane with red color density and its relation with year plane describes this fact that overnight and equivalent modes are mostly belong to the latest years. Two planes of Figure 2 are illustrated in Figure 7 which shows emerge of two modes of study (overnight and equivalent (see appendix for its definition)) in recent years (takes the value of 83.6). Daily study still has the maximum number of students.

Since the overnight mode of study is a paid program, and it is become more popular, the discoveries to further develop of similar programs, improve popularity and profit.

Figure 8. Relation between university and degree attributes

Figure 8 indicates to two planes of Figure 2 and reveals cluster 2 which contains Tehran University of Medical Science (is mentioned in TABLE III) associates to postgraduate degrees. Indeed, it appears to be the case that Tehran University of Medical Science covers a majority of postgraduate degrees as shown in Figure 8.

V. CONCLUSION Knowledge discovery is an interesting field that is

essential for any organization to get benefits from the collected data. Data is mined for finding relations, patterns and exciting knowledge. There are many techniques and software suits that help us in this way, in this study we used SOM and found some knowledge that is beneficial in

decision making and planning for higher education. The knowledge that were extracted from SOM analysis:

1. There is a decline in the number of enrolling students for Tehran medical university for the period of 23 years from 1988 to 2005.

2. There are a large number of students in postgraduate degrees within recent years (2005) and the greatest number of students for certificate degree appeared between 1988 and 1996 years (1367 and 1375).

3. Discontinuous type of study (flexible and non-continuing) in recent years helps an enhancement in the number of students in Iran Higher Education. This type of study is easier for entrancing to higher education degrees rather than continuous type.

4. The overnight mode of study, which is based on fee, is become more common recently it is an appropriate way that helps universities in profit making.

5. Tehran University of Medical Science covers a majority of postgraduate degrees.

SOM is a suitable toolbox for visualizing the discoveries. All of these discoveries are useful for managerial decision making to increase profit a university or a group of universities.

REFERENCES [1] Alhoniemi, E., J. Himberg, Et Al. (2002,2003).

SOM In Data Mining. [2] Alvarez-Guerra, E., A. Molina, Et Al. (2011). "A

SOM-Based Methodology For Classifying Air Quality Monitoring Stations." Environmental Progress & Sustainable Energy, .

[3] Bação, F., V. Lobo, Et Al. (2008). "Chapter 2. Applications Of Different Self-Organizing Map Variants To Geographical Information Science Problems." BOOK:Self-Organising Maps: Applications In Geographic Information Science

[4] Durfee, A., S. Schneberger, Et Al. (2007).

"Evaluating Students Computer-Based Learning Using A Visual Data Mining Approach." Journal Of Informatics Education Research Durfee, Schneberger, And Amoroso.

[5] Ultsch, A. (2003). "U*-Matrix: A Tool To Visualize Clusters In High Dimensional Data."

[6] Vembu, S. And S. Baumann (2004). "A Self-Organizing Map Based Knowledge Discovery For Music Recommendation Systems." Computer Music Modeling And Retrieval Second International Symposium, CMMR 2004, Esbjerg, Denmark, May 26-29, 2004 3310: 119-129.

[7] Vesanto, J., J. Himberg, Et Al. (1999). "Self-Organizing Map In Matlab: The SOM Toolbox." Proceedings Of The Matlab DSP Conference 1999, Espoo, Finland, November 16–17, Pp. 35–40, 1999.

[8] Xiao, X., E. R. Dow, Et Al. (2003). "Gene Clustering Using Self-Organizing Maps And Particle Swarm Optimization." Parallel AndDistributed Processing Symposium, 2003. Proceedings. International