A Decision Support System for Degree Coordination › downloadFile › 281870113704701 › e… ·...

A Decision Support System for Degree Coordination

Ruben [email protected]

Instituto Superior Tecnico, Lisboa, Portugal

October 2018

Abstract

Over time, higher education institutions are increasing their emphasis on strategic planning of theiractivities and degrees, which raises demand for a computerized system to help them in this process.One of the tools currently used for strategic planning is the release of summarized reports aboutacademic performance of the degrees offered by such institutions by degree coordinators. These reportsallow stakeholders to clearly pinpoint the factors leading to insufficient student performance. However,in many institutions, data for the summarized reports are compiled in a process heavily relying onmanual labor. This means that degree coordinators must ensure that, every semester, file names,file data, and formulas are consistent, in an always-increasing pile of files that leads to an inefficientprocess, which is prone to human error. We propose a decision support system to periodically extractacademic data from these files, transform them, load them to a data warehouse, and generate newExcel files with relevant results, increasing the automation and reliability levels of this process.Keywords: Decision Support System, Data Warehouse, Higher Education, Academic Performance,Degree Coordination

1. Introduction

The use of Decision Support Systems (DSSs) isbecoming widespread in areas connected to Busi-ness Intelligence (BI). Common DSSs are based onthe analysis of data stored in previously set updatabases, resulting in intuitive dashboards thatdisplay relevant information about the areas of anorganization that need to be improved the most.These systems represent an efficient and valuablemethod to support decision making [4].

Higher education is one of the areas where DSSscan be used. The performance of the offered coursesand the enrolled students depends on the conditionsthey are subjected to, such as the assigned pro-fessors and the balance of workload in the sched-ules of the students. These systems allow the enduser, who can be a degree coordinator, to find outwhat courses need to be revamped or improved, orwhat conditions should be provided to students toincrease their rate of success.

Currently, coordinators of academic degrees col-lect and extract large amounts of data regarding theperformance of their students as part of their activ-ities at the end of each semester. Historical data,regarding previous semesters, must also be kept, tokeep track of the evolution of students and Key Per-formance Indicators (KPIs) related to them, such asthe percentage of courses they have passed.

The process of collection and extraction of data is

mostly manual, typically resorting to programs suchas Microsoft Excel to organize and transform data.The reports for most degrees are built manually,and rely on several worksheets, to store data regard-ing grades, admission lists, and curricular plans,rather than having such data stored in a single datarepository. These worksheets add up over time, asthey must be kept as historical data. Furthermore,file names, file data and formulas may become in-consistent, a scenario that degree coordinators wishto avoid as this raises confusion and decreases ef-ficiency. While this process does not require anunderlying system, it is inflexible, inefficient, andprone to human error.

Even though several examples of DSSs for highereducation have been presented, few were translatedinto actual development. Moreover, some are spe-cific to certain universities or are not directly re-lated to the scope of our project – an academicdegree and its students. Thus, there is an openproblem in this area, which our DSS aims to solve.

1.1. Objectives

This project aimed to create a DSS to automate theprocess of extracting data, manipulating it, and ob-taining of data visualizations in the form of Excelfiles, with the final goal of providing all the neces-sary data to create reports on student performanceevery semester.

1

This system contains a structured data reposi-tory containing current and historical data, that de-gree coordinators will be able to query via graphicaltools, producing query results regarding indicatorsrelated to the evolution of academic performance ina given degree and its courses.

1.2. Document Outline

This paper is organized in five more sections. Sec-tion 2 presents the considered tools and the char-acteristics found more relevant in them. Section 3presents two case studies of DSSs in education, witha final discussion of common shortcomings. Imple-mentation of the prototype is discussed in Section 4,and its evaluation is made in Section 5. Finally, weconclude and present work that can be performedin the future.

2. Business Intelligence Tools

An instrumental part for the development of thisproject is the set of tools used to develop our datawarehouse-based system. This set of tools, com-monly designed as a BI stack, was chosen for theimplementation of the ETL (Extract, Transform,Load) process, and the presentation of data storedin the data warehouse. Features slated for fu-ture prototypes, such as Online Analytical Process-ing (OLAP) analysis, must be compatible with thework already performed as part of this initial pro-totype.

Considered factors for the selection of the ecosys-tem include the availability or a free edition, flexibil-ity, learning curve, whether the code is open-sourceor not, and cross-system compatibility.

Three BI stacks were considered: Pentaho, SQLServer Business Intelligence, and Qlik.

2.1. Pentaho

Pentaho’s BI suite1 consists in a wide set of open-source pillars that work independently from eachother, but can be combined to form a full ecosys-tem. Pentaho’s free community edition includes afull BI suite of core tools, along with the abilityto install several community-driven server pluginson top of the Pentaho platform. These tools wereall created with Java, making the suite compati-ble with all popular operating systems. A popularexample is Ctools2, a set of open-source tools miti-gating some limitations of the free edition, such asthe creation of dashboards.

The tool to use for the ETL process is Pen-taho Data Integration (PDI), codenamed Kettle.This tool allows the developer to create their datatransformation steps and jobs (sequences of trans-formation steps), necessary for the ETL process.

1http://www.pentaho.com/2https://help.pentaho.com/Documentation/6.1/0R0/

CTools/CTools Overview

The graphical interface, illustrated in Figure 1 isbased on drag-and-drop interaction, not requiringany lines of code. This inevitably limits flexibilityas the developer is limited to a set of graphical ele-ments rather than a full-fledged programming lan-guage; however, scripting and the creation of cus-tom components are possible and intended for moreadvanced users.

Figure 1: PDI’s drag-and-drop interface

OLAP cubes can be created with PentahoSchema Workbench3. For reporting purposes, thesubset of dashboard-related tools in Ctools or Pen-taho Reporting4 can be used.

2.2. SQL Server Business Intelligence

Microsoft SQL Server5 is a popular, closed-source Relational Database Management Sys-tem(RDBMS) developed by Microsoft. Its Expressedition is free, like the more complete Developer edi-tion. The Enterprise edition, the one with the mostfeatures, is paid, but it can be obtained for free viathe Microsoft Imagine6 program. As such, we willconsider the features offered by the Enterprise edi-tion. SQL Server 2017 is natively compatible withWindows and Linux, and is compatible with MacOS via Docker. It is the first version of SQL Serverto be compatible with Linux.

The tool used to model the ETL process is SQLServer Integration Services (SSIS). Much like PDI,the ETL process can be modelled without writ-ing a single line of code. However, it is possibleto write code to define custom tasks, transforma-tions, log providers, and objects, to increase flex-ibility. OLAP is handled by SQL Server Analy-sis Services (SSAS). This tool supports relational

3https://mondrian.pentaho.com/documentation/workbench.php4https://community.hds.com/docs/DOC-1009856-

pentaho-reporting5https://www.microsoft.com/en-us/sql-server/sql-server-

20176https://imagine.microsoft.com/

2

OLAP, multidimensional OLAP, and hybrid OLAP,unlike Mondrian, Pentaho’s analysis engine, whichis based on relational OLAP. Reports can be cre-ated using SQL Server Reporting Services (SSRS)or Power BI7. However, all of these applications areWindows-only (excluding Power BI, which also hasa cloud-based version).

2.3. QlixQlik Sense and QlikView are the main products ofQlik8 and constitute an alternate approach to a BIsuite. They have free editions, but the full stackof Qlik software can be obtained for free, throughan application via their academic program9. How-ever, this process is not automatic, and as such wewill restrict ourselves to describing the free editionsof Qlik Sense and QlikView. The free edition ofQlikView has a very relevant limitation: it can onlyopen applications developed on the same system.As such, it is not possible for unlicensed users tomake changes to the application.

Unlike most BI stacks, Qlik does not use an un-derlying RDBMS nor database queries; instead, itloads all data into memory thanks to their asso-ciative engine, and persistent data can be storedusing proprietary files. This makes Qlik a more in-tuitive program for business users when comparedto more traditional BI stacks, with the drawbacksof not being able to handle large amounts of dataor very complex ETL due to limitations with theprovided tools.

ETL processes, and OLAP-based star schemasand snowflake schemas can be modelled withQlikView. We have seen before, however, that com-plex eTL may not be practical with a QlikView-based approach. For OLAP analysis, the used toolis Qlik Sense. Unlike the previously analysed stacks,Qlik does not natively support MultidimensionalExpressions (MDX) queries. QlikView Reports,part of QlikView, is used to design reports. Its basicpremise is to drag objects from the base QlikViewapplication to another page, which has the layoutdesired by the user at a better resolution. This pagecan be then exported to PDF, for instance.

2.4. Selecting the Right EcosystemThe biggest limitation of Qlik is that its free edi-tion can only open documents that were designedby the same installation of it (thus, free editionscannot open documents designed in other comput-ers). It is possible to obtain an academic editionthat can do so, for free, but it requires manual ap-proval. This fact alone excludes the possibility ofusing Qlik, as the future development of applica-tions on this ecosystem could be threatened.

7https://powerbi.microsoft.com/8https://www.qlik.com/9https://www.qlik.com/us/company/academic-program

Both Pentaho and SQL Server BI would be suit-able for our domain. Pentaho has the advantagesof being open-source and supporting Mac OS andLinux, while Microsoft’s ecosystem is more popularand robust. Considering this, Pentaho seems to bethe most suitable choice, unless a critical limitationwith its tools is experienced; in this case, the morecomprehensive Microsoft SQL Server stack wouldbe the best solution.

To take full advantage of the fact that Pentahois open-source, it would make the most sense to usean underlying open-source RDBMS with it. Theselected RDBMS was MySQL.

3. Related Work

This section aims to analyze the current state ofDSSs in education. We will describe two differentimplementations of such systems in this context,and discuss their shortcomings.

3.1. Case Study: The Croatian Higher EducationInformation System (ISVU)

The ISVU10 is a nation-wide information system,launched in 2000, to respond to the need of digitiz-ing higher education institutions and of providing aseamless picture of the higher education panoramain Croatia. This project is funded and supervisedby the Croatian Ministry of Science, Education, andSports, and its data warehouse module stores in-formation on students, professors, courses, curricu-lar plans, student enrollments, and exams, with theability of automatically generating reports on factsof these fields.

In 2003, Baranovic et al. [1] presented the datawarehouse behind this system. The requirementswere specified according to two main goals: havinga quick and efficient tool for data analysis, and hav-ing an intuitive and constantly available service forcoordinators and administrative staff to generatereports. With these goals in mind, a multidimen-sional model, composed by a set of star schemas,was created, supporting queries on degree enroll-ment, course enrollment, course attendance, andexam taking.

The only data source for this system was a re-lational database containing all required academic-related information at a nation-wide level. Eventhough all necessary information was stored in thatdatabase, querying it was a slow and inefficient pro-cess. As such, an ETL process was developed, copy-ing the most relevant tables to a smaller replicaof the relational database (the staging area of thisarchitecture), transforming it, and loading it to adatabase based on a multidimensional model. Thedatabase replica and the multidimensional databasewere implemented in SQL Server, using Windows

10http://www.isvu.hr/javno/hr/index.shtml

3

2000 Server as the operating system. This multi-dimensional database is the source for the OLAPserver and its OLAP cubes, which interacts withthe user via a web-based application.

In 2009, Mekterovic et al. [3] presented someimprovements to the data warehouse, consideringchanges in requirements due to the introduction ofthe Bologna process11, and an evolution in otherbusiness rules. A more robust data cleaning stepwas introduced to the ETL process, to ensure con-sistency and accuracy of data.

Regarding data analysis and presentation, the de-veloped web-based application for the presentationof data supported predefined queries, detailed adhoc queries (where the user could define constraintsin a query via text boxes and combo boxes, andthis was then transformed to a parameterized SQLquery), and summarized ad hoc queries (where userscould perform queries with some OLAP operationsover a drag-and-drop interface).

3.2. Case Study: Data Warehousing for UniversityAccreditation

Sinaga et al. [5] implemented a data warehouse foruniversity accreditation, where comprehensive re-ports on the activities of the university are providedto BAN-PT, an Indonesian accreditation agencyfor higher education. BAN-PT was created withthe intent of enforcing quality standards in Indone-sian higher education. Examples of information re-quested by this accreditation body are the averageGrade Point Average (GPA) of students in the lastfive years, the average waiting time for alumni tofind a job, and the yearly ratio of candidate studentsto the total number of open spots for new studentsin the available degrees. The aim of the data ware-house is to provide the means to quickly generatethe necessary charts and tables with such informa-tion. The logical design for the data warehouse cre-ated to provide the requested information is a mul-tidimensional model, consisting on a set of threestar schemas: one for student registration, one forstudent enrollment in courses, and one for lecturers.The ETL process was implemented via PDI, withthe extraction of data from multiple sources (rela-tional databases, Excel files, and Microsoft Wordfiles) being considered as the biggest challenge.

3.3. DiscussionThe introduced examples, and the remaining exam-ples in the related work related to this field, havemany shortcomings. he most salient shortcoming isthe lack of detail – while most of the provided exam-ples provided sufficient information regarding thelogical design, with schematic diagrams being pre-sented, knowing everything else is difficult: details

11http://ec.europa.eu/education/policy/higher-education/bologna-process en

on tools used to build the data warehouses are of-ten omitted, the validation stage is not performed ornot documented, examples of usage are insufficientor non-existent, and information on ETL, analy-sis and reporting is frequently vague. Furthermore,these examples do not come from particularly rep-utable sources and most of them are only usablein particular use cases, such as the system intro-duced by Sinaga et al. This system is only usefulfor reporting information related to specific fields,of specific forms, for a specific accreditation body,not being useful for any other applications. A finalshortcoming is the unavailability of source code forany of the discussed examples, which means that wecannot reuse the code or even look at implementa-tion details.

4. Implementation

In this section, we will describe the first functionalprototype of the decision support system that aimsto provide key performance indicators related to thestudents and courses of a degree, aiding the deci-sions and suggestions of degree coordinators.

4.1. Gathering Business Requirements

Kimball’s approach for defining business require-ments was followed, in a four stage process: pre-planning, collecting the business requirements, con-ducting data-centric interviews, and a debriefingstage, where requirements are documented [2].

As part of the preplanning stage, the approach offace-to-face interviews to gather business require-ments was selected, ensuring an in-depth look atwhat is needed, why is that needed, and whatmight be needed in the future. It was also definedthat the primary end user would be the coordina-tor of our degree of reference, LEIC-T (Bachelorof Science in Computer Science and Engineeringat IST-Taguspark), and that the resulting systemshould be usable by other degree coordinators aswell, whether they are from the same institution ornot.

In the second stage, which regarded the collectionof business requirements, the actual face-to-face in-terviews with the primary end user were conducted.At this stage, the input formats to support, the out-put formats, the collection of queries to support,and metrics to evaluate the system were all defined.Additionally, past spreadsheets and reports werebrought to the interviews by the primary end user,facilitating the communication of what is expectedof this system.

The third stage, regarding conduction of data-centric interviews, was done in cooperation withboth the primary end user and a subject matter ex-pert, to ensure that the required core data existedand was accessible, and to evaluate the potentialdrawbacks of the drafted multidimensional model,

4

along with the feasibility of performing the requiredqueries over a data warehouse built with this modelin mind.

The last stage consisted in creating documenta-tion for the business requirements and presentingsuch documentation, to ensure the alignment of ev-eryone involved with the project.

As a result of the conclusion of this four stage pro-cess, it was specified that the DSS would supportfour input Excel file types, process them and pop-ulate a data warehouse based a multidimensionalmodel designed as a result, and support a set of 19predefined queries, with the results being writtento a set of Excel files with the specified levels ofgranularity.

Additionally, it was also defined that adding newgenerations at once on an average machine shouldtake, at most, 30 minutes, with 11 generations ofLEIC-T used as input, with the execution time in-creasing with the number of input generations in alinear fashion. The requirement for execution of allqueries (filtered to LEIC-T), was to ensure that exe-cution time would be under one hour for 11 genera-tions of LEIC-T in the data warehouse, with the to-tal execution time also increasing linearly with thenumber of input generations. To evaluate the sys-tem’s correctness, it was stated that all 19 querieshad to be implemented, and that, where compara-ble, the query results should match the results ofpast reports of the primary end user, who stated,however, that it is likely that multiple minor errorswould have occurred in such reports.

4.2. ArchitectureThe architecture of the implemented DSS can bebroken down into three components: the set ofdata sources, the data warehouse populated usingthe aforementioned sources, and an analysis compo-nent, which provides the desired data visualizations.This general architecture can be seen in Figure 2.

More specifically, the set of data sources for thissystem consists of four different types of Excelfiles, provided with varying periodicities to the datawarehouse.

This data warehouse forms the central piece ofthis architecture, and data are loaded to it via anETL process over the data sources.

The analysis component provides a set of Excelfiles as output, obtained by performing SQL queriesover the data warehouse. These files can then beused to generate data visualizations.

4.3. Multidimensional ModelThe multidimensional model of the data warehouseis a fact constellation, composed by five dimensiontables and four fact tables. This model is presentedgraphically in Figure 3. Primary keys are repre-sented in bold and foreign keys are italicized. As-

Figure 2: Architecture of the developed DSS

sociations between tables and their respective car-dinalities are also represented.

Information in this data warehouse is mainlystructured in five different dimensions, which serveas five different perspectives of interest to our usecase.

The d degree dimension table stores informationregarding the degrees that are to be considered bythe system. This dimension is needed as the anal-ysis procedure involves filtering data by degree, asdegree coordinators are more interested in visual-izing data from this perspective due to their posi-tions. Each degree has a name, in its abbreviatedform and in its full form, the minimum number ofECTS12 credits required for completion of the de-gree, and the number of years for which the degreeis designed for (for example, LEIC-T is designed tohave a duration of 3 years13).

The student dimension is represented by thed student dimension table. Each student is iden-tified by its institutional ID and, optionally, by thestage of admission to the degree.

The time dimension is represented by the d timedimension table. The chosen time granularity is thesemester, as a semester is the minimum amount of

12https://ec.europa.eu/education/resources/european-credit-transfer-accumulation-system en

13https://fenix.tecnico.ulisboa.pt/cursos/leic-t/curriculo

5

Figure 3: Multidimensional model represented in the form of a diagram

time needed by a student to finish a course and tohave an associated grade.

Curricular plans are not necessarily static overtime; therefore, they are also stored on thedatabase, thanks to the d curricular plan dimen-sion table. Each curricular plan is associated witha degree and a minimum semester of validity, viasurrogate keys identifying the degree and time di-mensions, respectively.

The d course table represents the course dimen-sion. Each course is associated with a curricularplan, and it has a name, in its abbreviated formand in its full form, a year and semester number(for instance, a course can be taught in the firstsemester of the first year of the degree), a numberof ECTS credits, and a Boolean is new field. Thisfield is set to true if the respective course has thelowest year and semester numbers for a given ab-breviated course name, in a given curricular plan,and set to false otherwise. In the scope of our sys-tem, each course is associated with a single curricu-lar plan (which is, in turn, associated with a singledegree), and to a single year/semester pair of thedegree, which means that different courses with the

same name may exist.

Four fact tables are part of this data warehouse,and they contain keys to the aforementioned dimen-sions, and facts or measures.

The data warehouse keeps track of a student’sactivity in a given degree, at a given semester,by using a fact table named f student activity.It contains a Boolean field named active, whichstates whether a student was evaluated in at leastone course, considering the provided degree andsemester. Thanks to this table, the system can fil-ter out inactive students (students not evaluated inany course).

The f student admission fact table keeps a recordof all admitted students, the degrees that they wereadmitted to, and when admittance of such studentsoccurred. A field named enrolled semester1 is avail-able to filter out students that did not enroll in allavailable courses of their first semester.

In the f student degree fact table, each studentand degree is associated with two fields: de-gree all courses time id and degree all ects time id.The former will be set to the lowest time for whichthe student has completed all available courses in

6

the specified degree; the latter will be set to thelowest time for which the student has met or sur-passed the minimum requirement of ECTS creditsdeclared in the degree dimension.

The f student evaluation fact table contains datarelated to grades. Grades obtained in the regularacademic period, in the special academic period,and final grades are all available for each coursethat a student has enrolled in, at a given semester.Additional facts are also available: state determineswhether a student has passed a course, failed it,or has not been evaluated, num enrollments is thenumber of times that a student had enrolled in theassociated course until the associated semester, andfirst passed is set to true if that entry represents thefirst time that a student has passed the associatedcourse, or set to false otherwise.

4.4. Populating the Data Warehouse

he process of populating the data warehouse froma set of data sources is composed by two stages: averification stage and an ETL stage, as Figure 4shows.

Figure 4: Diagram describing the populator process

The verification stage involves two steps: first,creating the database tables if they do not existyet, which includes calling a stored procedure topopulate the time dimension, and second, checkingif any valid input files exist.

After this first stage is successful, the more com-plex ETL stage begins. It is composed by twogroups of subprocesses: the first group processes thedifferent types of input Excel files, and the secondgroup populates new tables and new attributes inexisting tables of the data warehouse. This secondgroup of ETL subprocesses was created with themain objective of facilitating common queries thatmay be performed over this database. For instance,we can verify if a student is active in a given degreefor a given semester by checking if the respectiveentry exists in f student activity and if active is setto true.

Some subprocesses depend on data from previ-ous subprocesses, leading to the decision of runningthem sequentially. Data already present in the datawarehouse is ignored, and not reinserted into thedata warehouse.

4.5. Querying the Data WarehouseThe process of gathering output data from the ware-house can be split into two: one manages coursequeries over the data warehouse, and another man-ages queries regarding generations, as seen in Figure5. Each subprocess gathers all possible results forall queries in an individual degree, as provided tothe DSS when running this process.

Figure 5: Diagram overview of the query executorprocess

Running course queries on this system’s datawarehouse is a two stage process: first, the systemgets all courses defined in any curricular plan of thedegree, and then, all queries run for each course.This architecture is represented in Figure 6.

Thus, in the initial stage, an SQL query responsi-ble for getting all course short names ever defined ina curricular plan associated to the selected degreeis executed. These course names are then retrievedby a higher level process, which forwards the resultsto the second stage.

In the second stage, a lower level process is calledto run all associated subprocesses, each associatedto a query, and write their output to Excel files.Each subprocess structure includes an extractionstage, where the data warehouse is queried, an op-tional transformation stage, where extracted data isjoined with other queries or transformed to the de-sired formats, and a loading stage, where the resultsare written to a single Excel file. This lower levelprocess runs once for every course name retrievedin the first stage.

The process of generating results for generationalqueries is more complex, as there are two differentlevels of granularity from the perspective of outputExcel files:

• Level 1 : Single file, one sheet per generation;

• Level 2 : One file per generation, one sheet persemester since admission of that generation.

For queries of the aforementioned level 1 of gran-ularity, the two stage process shown in Figure 6also applies here; however, rather than running eachquery once for every course name, the query runsonce for every generation (year of admission of stu-dents to the degree under analysis). An example of

7

Figure 6: Diagram overview of the query executorprocess

an output file for one of these queries is displayedin Figure 7.

Figure 7: Output Excel file for the average numberof enrollments needed to pass a course, by course

For queries of the deeper granularity level 2, anested loop is required. For each admission yearpreviously fetched, the system runs an inner loopto fetch a number of semesters since admission(the lowest of the maximum number is the high-est semester for which the data warehouse has anyinformation, and the semester of admission plus themaximum number of semesters defined when run-ning the top-level process). Then, each combina-tion of admission year and semester is fed to an-other lower level process, which executes all trans-formations associated with queries of this level ofgranularity, and writes a different Excel file for ev-ery generation, and a different sheet in each file foreach semester considered in the same generation.

5. EvaluationThe evaluation process of the implemented DSSwas performed by validating the obtained resultsagainst previous reports and Excel files created bythe coordinator of LEIC-T, and by ensuring that allresults could be obtained in a reasonable amount oftime.

As such, multiple experimental validation roundswere made until the implemented DSS yielded theexpected results, and execution times for the de-sired process with different quantities of input data

were measured.

5.1. ValidationTo ensure that the system worked as expected,the implemented DSS was loaded with all avail-able information for our degree of reference, LEIC-T. Available information comprised student ad-missions, student grades, and curricular plans be-tween the school years of 2007/2008 and 2017/2018.Then, all possible queries for this degree were ex-ecuted, and results were compared with both pastreports, and private Excel result files created by thecoordinator of this degree.

The validation process yielded the expected re-sults, as, whenever comparable, the query results asobtained from the implemented prototype matchedexactly or had minor deviations (of 0.1 to 0.2% ofaverage ECTS credits obtained, for instance), whichcan be mostly blamed on rounding errors on the filesused for validation.

However, the validation process was relativelycomplex, for two distinct reasons: first, therewere differences in conditions between implementedqueries and the analogous results in the degree co-ordinator’s files and reports, and second, several er-rors existed in such files and reports.

Regarding the first difficulty, all queries werespecified as part of the process of gathering businessrequirements, and some criteria were defined differ-ently, making the comparison with existing resultsimpossible. In particular, direct comparison of anycourse query is not possible, because the subset ofconsidered students differs. In the files and reportscreated by the coordinator of LEIC-T, grades of allstudents in the degree (and even some of other de-grees, due to inconsistencies with files generated byFenix), were considered. However, the implementedsystem only considers grades of students whose ad-mission information is available – this means thatstudents admitted before 2007, students admittedvia special methods, and the rare cases of studentsnot in LEIC-T but being considered by the degreecoordinator’s files, were excluded by the DSS dur-ing the loading process. However, if the Excel filesused for validation were altered to only consider thesame subset of students, the query results matched.

Regarding the second difficulty, many errors werefound in both the Excel files and the reports usedfor validation. As it was not possible to track thesource of the inconsistencies between these resultsand the results provided by the implemented DSS,many hours of investigation were wasted investigat-ing and attempting to debug the various compo-nents of the DSS unnecessarily. The coordinator ofLEIC-T stated that errors in these files and reportswere likely due to the complexity of the utilized for-mulas and difficulties in adapting them to changesin curricular plans, which is why they were investi-

8

gated as well. Unfortunately, these concerns wereindeed justified.

These errors affected validation with all genera-tional queries. For instance, the degree coordina-tor’s global output Excel file included a compila-tion of grades in a given semester for each student,and this compilation was transmitted via an Excelformula related to the linked Excel files. However,issues with this formula led to grades being asso-ciated with the wrong students, and even to somestudents have the impossible passing grade of ”0”.This led to many validation inconsistencies, such asactive students being incorrectly considered as in-active, and students being associated with incorrectECTS credit numbers.

The validation process did not just consist of is-sues that could not be directly blamed on the DSShowever. Two distinct sources of errors arose dur-ing the various iterations of this prototype: the firstsource of errors was the populator process, in whichissues with validation of the input Excel files ledto some students and grades not being considered,even though they should have been considered; sec-ond, some queries in the query executor processwere defined incorrectly in earlier stages of this pro-totype, leading to unexpected results.

5.2. Performance

The following business requirements were set for ac-ceptance of this system: first, the time elapsed topopulate the data warehouse must be 30 minutesor less, with all 11 available generations of LEIC-Tserving as input; second, the time elapsed to exe-cute all supported queries must be 60 minutes orless, with all 11 available generations of LEIC-T asinput. Execution time should increase linearly withthe number of generations in both cases.

To verify how the execution time of the processesincreased with the number of generations, a scriptwas developed to iterate all data sources (excludingfiles listing the degrees to consider) related to the11 generations of LEIC-T for which we had datafor, and multiply them by 2, 4, or 8 times. Thiswas possible by changing the institutional IDs as-sociated with the students, and by increasing theyears associated with the existing curricular plans,student admissions, and student grades by 11, foreach iteration.

For each of these four sets of input files, we havepopulated the data warehouse of the implementedDSS, using a newly created database, and executedall queries on this populated data warehouse, andmeasured the execution times of the processes andtheir main components.

The results for the populator process can be seenin Figure 8.

By analyzing the plot corresponding to execution

0 11 22 33 44 55 66 77 880

4

8

12

16

20

Number of input generations of LEIC-T

Exec

uti

on

tim

e[m

in]

Entire processVerification stageETL – ExcelETL – populator

Figure 8: Execution time of the populator processand its components

time of the entire process (which is the sum of theexecution times of its three subcomponents), we cansee that the part of the business requirement re-garding loading time for 11 generations of LEIC-Twas met, as the entire process took less than twominutes to complete.

It can also be seen that the dark blue plot repre-sents a near linear increase of execution time. The18.6 minutes required for loading 88 generations aresatisfactory, as this represents an increase in execu-tion time of 969% for an increase in number of gen-erations of 700%, proving that the process will scalewell over time. The increase in execution time whena high number of generations is inserted at once canbe blamed mostly on the ETL subprocesses thanpopulate additional fields, which is expected, as themost complex calculations are made at this stage,and these are the processes responsible for popu-lating the tables that increase exponentially in sizewith the number of generations.

Regarding the query executor process, the resultsare shown in Figure 9.

The dark blue plot corresponds to the executiontime of the entire process, and the three subcompo-nents are: the executor of generational queries ex-cluding one of the queries (in green), the executorof course queries (in red), and the excluded query(in orange).

Two conclusions can be quickly drawn from ana-lyzing this chart: first, execution time for 11 inputgenerations was satisfactory, being well under thespecified 60 minutes; second, the execution time in-creased exponentially, rather than linearly.

Regarding the first conclusion, the execution timefor 11 input generations was approximately 175 sec-onds, less than 5% of the maximum allowed time.Execution times of earlier versions of the prototype

9

0 11 22 33 44 55 66 77 880

30

60

90

120

150

180

210

Number of input generations of LEIC-T

Exec

uti

onti

me

[min

]Entire process

Generational (excl. Q13)CourseQ13

Figure 9: Execution time of the query executor pro-cess and its components

were much higher or even infinite, because manyoptimizations had not been implemented yet.

However, the execution time did not scale wellfor subsequent executions with higher numbers ofinput generations. The plot representing the exe-cution of generational queries excluded one of thequeries to allow easy identification of the culprit.Attempts at optimizing it, such as insertion of in-dexes on fields that serve as join keys, and split-ting the query into smaller queries did not decreasethe execution time significantly. Therefore, this isa limitation that should be worked on by futureprototypes. Considering the maximum number ofinput generations, 88, this query alone accountedfor 91% of the execution time of the entire process,including 19 queries.

We have ensured that this query was the onlyculprit, as not considering it in the dark blue plotleads to a linear increase in execution time ratherthan an exponential increase.

6. Conclusions

This document presented the first functional proto-type of a DSS applied to degree coordination. Thissystem was created with the intention of automat-ing the currently manual process of extraction ofdata related to students, courses, and student per-formance, manipulating them, and obtaining datavisualizations, which can be used for an eventualcreation of reports by degree coordinators.

As this project represents a first functional pro-totype, there is still a lot of work to be done. Someassumptions that facilitate the execution of some ofthe supported queries were made; they are reason-able in the context of LEIC-T and can be extrapo-lated to other degrees, but not to all. The most lim-iting assumption is that there is no distinction be-

tween mandatory and elective courses. Two meth-ods are used to verify if a student has completed adegree: obtention of the minimum requirement ofECTS credits, and conclusion of all courses in thecurricular plan valid at the time. Neither methodmakes a distinction between mandatory and elec-tive courses, which makes some parts of this systemnot applicable to many degrees.

Other than the above, some features that werenot explored at all in the context of this prototypecould be eventually explored. One of those featuresis the introduction of an OLAP server, allowing di-mensional analysis of data, via a drag and drop in-terface, or via MDX queries. Another unexploredfeature, also related to the field of data analysis,is the implementation of dashboards, with more fil-tering options, allowing the visualization of moretailored KPIs in a simple graphical interface. Athird unexplored feature is automation of reports,ensuring that the process between inserting data tothe DSS and creating a report does not require anyhuman interaction.

Nevertheless, the ecosystem selected for this pro-totype allows any future developer to reproduce itin their favorite development environment, enablingreusability of the currently implemented DSS.

AcknowledgementsI am very grateful to my thesis advisors Prof. He-lena Galhardas and Prof. Nuno Mamede, for theircontinuous guidance and support during this year.

I would also like to thank my family and friends,as they all played important roles in my life.

References[1] M. Baranovic, M. Madunic, and I. Mekterovic.

Data Warehouse as a Part of the Higher Ed-ucation Information System in Croatia. Tech-nical report, Faculty of Electrical Engineeringand Computing, University of Zagreb, Croatia,2003.

[2] R. Kimball and M. Ross. The Data WarehouseToolkit. Wiley, third edition, 2013.

[3] I. Mekterovic, L. Brkic, and M. Baranovic.Improving the ETL process of Higher Educa-tion Information System Data Warehouse. InWSEAS International Conference on AppliedInformatics and Communications, 2009.

[4] V. L. Sauter. Decision Support Systems forBusiness Intelligence. Wiley, second edition,2010.

[5] A. S. Sinaga and A. S. Girsang. University Ac-creditation using Data Warehouse. Journal ofPhysics: Conference Series, (801), 2016.

10

A Decision Support System for Degree Coordination › downloadFile › 281870113704701 › e… ·...

Documents

Transcript of A Decision Support System for Degree Coordination › downloadFile › 281870113704701 › e… ·...