DATA MINING IN HIGHER EDUCATION - Rapid … · DATA MINING IN HIGHER EDUCATION 1 Abstract There has...
Transcript of DATA MINING IN HIGHER EDUCATION - Rapid … · DATA MINING IN HIGHER EDUCATION 1 Abstract There has...
DATA MINING IN HIGHER EDUCATION
Marshall University Graduate College
Degree - MS in Technology Management with
Emphasis in Information Security
Submitted on 10/27/2010 Advisor - Dr. Tracy Christofero
Capstone Project
By: Debra Elliotte
DATA MINING IN HIGHER EDUCATION
1
Abstract
There has always been an ongoing need for private monies to supplement state and federal funds in public education. Private monies fund scholarships, buildings and equipment either partially or completely. State colleges such as Marshall University received less and less from the government and relied more and more on private donors to make up the difference. Advancement officers looked for new techniques to expand their reach and use their resources efficiently. One way to increase efficiency was to understand constituents, to understand why constituents donated and what their connection to the University was. Data mining and donor modeling tools provided the means to understand constituents and how best to use that information in fundraising efforts. The analysis identified characteristics likely to solicit a donation from a non-donor and used the software program Rapid Insights to produce the statistical models. Keywords: Data mining, algorithm, fundraising, philanthropic, donor modeling, pattern analysis.
DATA MINING IN HIGHER EDUCATION
2
Acknowledgements
I thoroughly enjoyed my graduate education; even the hard parts. I received a great
education. I met and worked with some wonderful individuals who deserve my gratitude.
My professors were always helpful and supporting regardless of the number of questions I
asked, Drs. Christofero and Logan in particular come to mind, but I learned from all. In Dr.
Larsen’s class, I not only improved my writing skills but my approach to getting along in the
world as well and Professor Biros sent me further along my path of understanding data
mining.
I met students along the way who were a joy. Some were just partners to
commensurate in misery; others were partners to work on projects. I learned from them all,
and developed great friendships.
I want to thank Dr. Lynne Mayer who started me down the path to obtain my degree,
Dr. Ron Area, CEO of The Marshall University Foundation, who gave me permission to use
the data from the alumni database for this project, and Rapid Insight Software Corporation
for making the software available free to students. I also want to thank Rebecca Samples and
Barbara Hicks for their support.
I want to thank Peter Wylie for his support and guidance and Kevin MacDonnell for
his informative blog about data mining, I learned a great deal from both.
Last, but not least, I want to thank my family for their patience and support.
DATA MINING IN HIGHER EDUCATION
3
Table of Contents
Abstract ................................................................................................................................1
Acknowledgements ..............................................................................................................2
Table of Contents .................................................................................................................3
List of Figures ......................................................................................................................5
List of Tables .......................................................................................................................6
Terms and Definitions..........................................................................................................7
Introduction ..........................................................................................................................8
Background ..................................................................................................................9
Problem Statement .....................................................................................................13
Topic Selection ..........................................................................................................15
Literature Review ......................................................................................................15
Research Methods .....................................................................................................23
Results .......................................................................................................................25
Discussion and Evaluation ........................................................................................32
Conclusions ........................................................................................................................35
Future Work ...............................................................................................................36
References ..........................................................................................................................37
Appendix A A&S Study Aggregated Results ...................................................................40
Appendix B Marshall University Alumni Database Variables .........................................41
Appendix C Rates for Greeks and non-Greeks Schools C – F .........................................42
Appendix D Median Dollars for Greeks and non-Greeks Schools C – F .........................43
Appendix E Giving By Marital Status and Class Year Schools B – E ............................44
Appendix F Giving by Marital Status and Class Year Schools F-H.................................45
DATA MINING IN HIGHER EDUCATION
4
Appendix G Degree Breakdown .......................................................................................46
Appendix H Major Breakdown ........................................................................................47
Appendix I Lifetime Giving and Age ...............................................................................48
Appendix J Donor Indicator and Degrees .........................................................................49
Appendix K Donor Response Model ...............................................................................50
Appendix L Likelihood Model .........................................................................................51
Appendix M Distribution of Alumni by State .................................................................52
DATA MINING IN HIGHER EDUCATION
5
List of Figures
Figure 1 - Understanding .................................................................................................. 14
Figure 2 - Deep Pockets .................................................................................................... 16
Figure 3 - Lifetime Greeks Bearing Gifts School A ......................................................... 18
Figure 4 - Lifetime Greeks Bearing Gifts School B ......................................................... 18
Figure 5 - Median Greeks Bearing Gifts School A ........................................................... 20
Figure 6 - Median Greeks Bearing Gifts School B ........................................................... 20
Figure 7 - Cluster Char ..................................................................................................... 22
Figure 8 - Giving and Age ................................................................................................ 26
Figure 9 – Giving and 9 - Giving and Years since Graduation ........................................ 27
Figure 10 - Giving By Degree .......................................................................................... 28
Figure 11 - Donating and Student Activity ....................................................................... 29
DATA MINING IN HIGHER EDUCATION
6
List of Tables
Table 1 - Donor Response Model ..................................................................................... 30
Table 2 - Donor Response Model Continued ................................................................... 31
DATA MINING IN HIGHER EDUCATION
7
Terms and Definitions
Algorithm – a set of rules for solving a problem
CAE – Council for Aide to Education
CASE – Council for Advancement and Support of Education
Dependent variable – value to predict
FICO – Fair Isaac Corporation, a credit-scoring model
FY – Fiscal Year
Greeks – alumni who were members of fraternities and sororities
Hard credit – total outright dollars given
Mining model – a combination of one or more algorithms and data
Multivariate – the multiple variables used in statistical analysis
Text mining - extracting values from free text data
Univariate – the single variable used in statistical analysis
DATA MINING IN HIGHER EDUCATION
8
Introduction
Monies from private sources have been of major importance to colleges and
universities. Private monies, unlike allocations from government, usually prescribed or at
least closely regulated, frequently represented unrestricted spending and were often a
source of institutional discretionary funds. These funds were a source for innovations and
risk taking. These discretionary funds frequently provided the margin of excellence that
separated one institution from another. Voluntary support played a critical role in
balancing institutional budgets and as the availability of those sources diminished,
institutions looked for ways to keep donors engaged (Leslie & Ramey, 1988).
The best institutions knew how to collaborate with alumni, friends and parents.
Such collaboration required knowing and understanding their interests and behaviors,
then tracking involvement such as giving, advocacy and relationships. Analytical tools
identified key characteristics indicating when people were ready to give and help
fundraisers understand behavioral patterns critical to donor retention (Birkholz, 2008).
Donors gave to what they cared about, to what they valued. Nonprofit
organizations who understood this knew they must bring together donor values and
corresponding institutional needs for the organization to be successful. If giving were
simply a matter of assets and income fundraising would be easy, but that is not the case.
Fundraisers must understand why people gave. Analytical tools helped fundraisers do this
(Birkholz, 2008).
Data mining was an analytical tool. It located patterns and relationships in data
that were useful to make valid predictions and draw meaningful conclusions about that
DATA MINING IN HIGHER EDUCATION
9
data. These patterns and relationships became the basis for building predictive and
descriptive models. Predictive models forecasted explicit values from known results.
Descriptive models described patterns and created meaningful subgroups (Larsen, 2009).
Most individuals have interacted with a predictive model when applying for
credit. The Fair Isaac Corporation (FICO) score is such an example. This score predicted
a statistical likelihood of loan repayment by analyzing characteristics of individuals who
do and do not pay back loans. When banks loaned money, an individual’s financial
ability and likelihood to pay were key criteria (Birkholz, 2008).
Fundraisers assessed donors according to financial ability and likelihood of
making a major gift. A donor’s connection to the institution and/or alignment of values
between the institution and the donor were key criteria (Birkholz, 2008).
Corporations such as Best Buy, J.P. Morgan and Volkswagen conducted analyses
of who purchased their product and services. Their goal was to produce key groups for
specific marketing campaigns. When fundraisers developed strategies for alumni, faculty
and community members they used similar customizations (Birkholz, 2008).
Background
Data mining produced new information by building a real world model using
existing data. The result was a description of patterns and relationships in the data to use
for prediction (Two Crows, 2004).
Data mining is a technical process driven by a business goal (Khabaza, 2009).
The steps listed below encompass both the technical process and business components
involved in data mining (Han & Kamber, 2006; Larsen, 2009):
• Define the problem or goal
DATA MINING IN HIGHER EDUCATION
10
• Determine patterns to mine
• Data cleaning, preparation and selection
• Train the model
• Validate
• Deploy
Business knowledge and data knowledge drove the first step. Individuals who
developed the problem statement or goal had to understand the business, for instance the
business of fundraising. They had to understand the information being mined in order to
answer questions like, “What do the codes in this field mean?” Without this, the problem
statement would be ill-defined and the results incorrectly interpreted (Kahbaza, 2009). A
focused statement describing the problem to solve was best (Two Crows, 2004). For
example, “Are constituents who attend University events repeatedly more likely to give
money than those who attend less often?”
The next step was to determine the kind of patterns to mine. This also required
that the data be understood, but from a different point of view. Data could have a
numerical value, e.g., number of gifts, or could be categorical, e.g., (donor/non-donor).
Categorical data could be ordinal, e.g., having a meaningful order, such as high, medium,
low; or nominal or unordered as in the case of zip codes. Having this information
influenced which algorithm(s) to use (Two Crows, 2004).
Data cleansing, preparation and selection was next. At the time of this writing,
there was a vast amount of information captured in databases. Choosing useful fields
depended on the problem statement, required understanding about the data captured, and
DATA MINING IN HIGHER EDUCATION
11
what information it provided. If the business goal was to increase sales in a particular part
of the country, then data about current sales in that area would be needed (Larsen, 2009).
Data cleansing, preparation and selection were steps unique to the data. Initially,
the problem statement guided data selection. If the model was to predict event
attendance, data about past event attendance was needed. However, not all of the data
available was useful, and some may be incorrect or missing, e.g., a blank in a marital
code or a gender code-indicating male but a name prefix indicating female. Inspecting the
data uncovered these problems and decisions made about what to do with incorrect and or
missing values. Discarding those records could result in a sample size giving an
inaccurate picture of the data. On the other hand, the fact a value contained no data could
be significant, i.e., perhaps the field captured information about a small subset of
individuals.
Missing values were particularly troublesome because not all mining methods
accommodated data that contained missing values (Cios, Kurgan, Pedrycz & Kurgan,
2007). One approach for fixing missing values was to calculate a substitute value such as
a marital code of married for two linked records missing that information (Two Crows,
2004). Realistically all problems could not be fixed, but being aware of those problems
allowed discrepancies to be corrected. Once cleansed and selected, the data were loaded
into a database and made available to data mining software algorithms. There were
numerous algorithms used in data mining, however not all were applicable to every
situation.
Classification algorithms identified characteristics that determined where an item
fit. For instance, classifying a loan applicant as a good or bad credit risk, or identifying a
DATA MINING IN HIGHER EDUCATION
12
potential, new donor. For the latter, the algorithm examined previous donor data to
discover attributes that distinguished donors from non-donors. Those distinguishing
attributes forecasted the value of a donor indicator attribute (dependent value) in future
cases (Larsen, 2009).
Regression algorithms predicted a continuous value. Regression looked at trends
in order to predict ones that might continue, e.g., donations over a span of fiscal years.
Looking at donations and gift dates over a span of years might reveal that donations were
seasonal (Larsen, 2009).
Segmentation algorithms divided data into groups having similar characteristics.
If donors were grouped into ranges of giving, the algorithm analyzed other known data
about the groups and looked for interesting similarities between them. Those similarities
became part of the process for dealing with non-donors (Larsen, 2009).
Association algorithms required that the data already had some sort of grouping,
such as multiple classes taken by a student or donors who attended a yearly event over
consecutive years. As with segmentation, association looked for characteristics in other
data known about the group to use when dealing with prospective students or donors
(Larsen, 2009).
After data selection, came model training. A data-mining model is a combination
of one or more algorithms and data. The model applied the algorithms to the data to
create the classification, association and regression formulas to solve the business
problem. For example, data with attributes for customers from a chain of stores could be
used to learn who was a good or bad credit risk. The model needed two sets of data; one
set trained the model and the second set validated the model. During validation, the
DATA MINING IN HIGHER EDUCATION
13
second set of data, for which the dependent variable is unknown, was loaded into the
model and the model predicted the value of the dependent variable (Larsen, 2009).
The last step was to deploy the model. New data, again with an unknown
dependent value, would be loaded into the model so that the model could discover
patterns and relationships.
Problem Statement
There was an ongoing need for additional monies in higher education. Private
monies funded scholarships, buildings and equipment either partially or completely.
Public education institutions such as Marshall University received less and less financial
support from state and federal governments. Between 2002 and 2005, state funding for
Marshall University dropped $7.4 million dollars. In 2006, funding increased slightly
each year until 2008. However, when corrected for the Consumer Price Index (CPI)
relative to the base fiscal year 2002, the financial outlook was less appealing. In FY2011,
the University expects to receive funds at FY2008 gross levels with the purchasing power
of FY2005 dollars (Kopp, 2010). See Figure 1.
DATA MINING IN HIGHER EDUCATION
14
Figure 1 - Understanding (Kopp, 2010)
A survey of 1,027 institutions conducted by the Council for Aid to Education
(CAE) showed that colleges brought in an estimated $27.85-billion in gifts in the 2009
fiscal year. The year before, they raised $31.6-billion. Although grim, the findings came
as no surprise given an economy where donors lost significant portions of their wealth or
were afraid that they would. The survey also found alumni of record who gave declined
in 2009; falling one percentage point to 10 percent which was the lowest level recorded.
This was significant because alumni were the largest source of contributions (Masterson,
2010).
Fundraisers turned to other means as institutions relied on them to supplement
traditional sources. Advancement officers looked for new techniques to expand their
reach and use their resources efficiently. One way to increase this efficiency was to
DATA MINING IN HIGHER EDUCATION
15
understand why constituents gave. In order to accomplish this, fundraisers had to learn
what they knew about constituents.
Topic Selection
The Marshall University Foundation, Inc. was a non-profit, tax-exempt,
educational corporation. The Foundation collaborated with Marshall’s Office of
Development to secure private financial support for the University. The Office of
Development Services produced information for the gift officers in support of their
fundraising activities. The decrease in monies from state government, lagging economy
and drop in overall alumni support necessitated new methods for understanding donor
giving. The University’s alumni database appeared to be the best place to begin to gather
this information.
Literature Review
Data mining may have had its roots in industry, but its application in fundraising
also attracted the attention of fundraisers. Research done by Wylie (2005), demonstrated
the value of mining the data in an organization’s own database to find prospective
donors. He selected eight schools, private and public, differing in size and student
population. For each school he obtained a random sample of at least 5,000 records from
the alumni database. Each record contained the total giving amount, preferred year of
graduation and marital status, Wylie studied the relationship between giving and
preferred year of graduation and giving and marital status. Figure 2 shows the results for
School A.
DATA MINING IN HIGHER EDUCATION
16
Figure 2 - Deep Pockets (Wiley, 2005)
This figure highlights several facts (Wiley, 2005):
• The oldest 25 percent of alumni (graduated in 1963 or before) accounted for almost three quarters (73 percent) of the total alumni dollars given.
• Alumni listed in the database as married accounted for a large amount (85 percent) of the total alumni dollars given.
• The youngest 50 percent of alumni (graduated in 1979 or later) accounted for only 9 percent of the total alumni dollars given.
The figures for Schools B through H for this study are in Appendices E and F.
The results of this these figures revealed the following (Wiley, 2005):
• At least 90 percent of the money from any alumni population tended to come from people who had been out of school at least 30 years.
• Regardless of the actual marital status, alumni listed as married tended to give much more money than alumni with other marital codes did.
• Alumni out of school at least 30 years and listed as married often gave a huge amount of money compared to any other group classified by marital status and class year.
DATA MINING IN HIGHER EDUCATION
17
Wylie (2005) recommended:
• If looking for major gifts, concentrate on people who have been out of school for at least 30 years (but nurture younger alumni).
• Prospect researchers (using a screening service and doing individual research) should focus on these older individuals.
In his article, Greeks Bearing Gifts, Wylie (2007) documented the importance of
capturing other types of information about alumni. In this study, giving information
captured for Greeks and non-Greeks crossed a fifty-year span from six, four-year
institutions across the country. Five of the institutions were private and one was a large
public school. All records of solicitable alumni were in the dataset from the smaller
schools, and random samples of at least ten thousand records were in the dataset from the
larger schools. Variables in the datasets included:
• Whether an alumnus was listed as having belonged to a Greek organization
• The preferred year of graduation for each alumnus
• The total lifetime hard credit giving from each alumnus
In this analysis for each school, Wylie (2007) answered four questions:
1. Was there a difference in the rate of lifetime giving between former Greeks and non-Greeks?
2. How did this rate change as a function of the length of time an alumnus was out of school?
3. For Greeks and non-Greeks donors, was there a difference in the median lifetime giving between the two?
4. How did this difference change as a function of how long they had been out of school?
DATA MINING IN HIGHER EDUCATION
18
The first two questions were answered by computing the percentage of Greeks
and non-Greeks who had ever donated at each of eleven, five-year intervals since years of
graduation (equal to or less than 5 years, 6-10 years, 11-15 years, 50 years or more).
Figures 3 and 4 show results for Schools A and B, respectively.
Figure 3 - Lifetime Greeks Bearing Gifts School A (Wylie, 2007)
Figure 4 - Lifetime Greeks Bearing Gifts School B (Wylie, 2007)
DATA MINING IN HIGHER EDUCATION
19
The figures for Schools A and B showed the very large difference in giving rates
between Greeks and non-Greeks regardless of the number of years the alumnus had been
out of school. It further showed that lifetime giving of non-Greeks out of school more
than 20 years off dropped off slightly, i.e., less than 1 percent, while those for Greeks
stayed about the same.
In none of the schools did the lifetime giving rates of non-Greeks ever exceed
those of the Greeks (Wiley, 2007). Only in one instance (Figure 6 for School F, alumni
out of school 5 years or less) was the participation rates the same. The difference in
lifetime participations between Greeks and non-Greeks widened as the time since
graduating increased, but not in every case. Sometimes, the gap narrowed, but the Greeks
always remained ahead of non-Greeks in giving. The remaining figures of Lifetime
Participating Rates by Greeks and Non-Greeks since graduation for Schools C through F
are in Appendix C.
Wylie (2007) answered questions 3 and 4, by calculating the median lifetime
giving by age group for Greeks and non-Greeks, but for only those alumni who ever
made a gift.
3. For Greeks and non-Greeks donors, was there a difference in the median lifetime giving between the two?
4. How did this difference change as a function of how long they had been out of
school?
The results for Schools A and B are in Figures 5 and 6, respectively.
DATA MINING IN HIGHER EDUCATION
20
Figure 5 - Median Greeks Bearing Gifts School A (Wylie, 2007)
Figure 6 - Median Greeks Bearing Gifts School B (Wylie, 2007)
DATA MINING IN HIGHER EDUCATION
21
This data indicated that Greeks gave more than non-Greeks and that difference
tended to grow the longer alumnus were out of school. Wylie (2007) believed this study
showed Greeks were definitely better givers than non-Greeks. He also pointed out how
important it was for institutions to look at the data in their own databases to learn about
their graduates, and put that information to use in their fundraising efforts. The results
for Schools C through F are in Appendix D.
In a study conducted between the Council for Advancement and Support of
Education (CASE) and Statistical Package for the Social Sciences (SPSS) alumni records
from the John Hopkins Zanvyl Krieger School of Arts and Sciences (A&S) were used to
explore data mining (Krieger & Luperchio, 2009). The model tested successfully on
datasets from other educational institutions. Including A&S, the schools that volunteered
data comprised every aspect of higher education and giving (Krieger & Luperchio, 2009):
• Seven universities, two colleges and one community college
• Six public and four private institutions
• Eight institutions across the United States, one in Canada, and one in England
• Five self-identified research institutions and two specializing in the liberal arts
• Alumni participation rates ranging from 3 percent to 92 percent
DATA MINING IN HIGHER EDUCATION
22
The analysis revealed four distinct patterns of giving by alumni as indicated in
Figure 7 (Krieger & Luperchio 2009).
Figure 7 - Cluster Chart (Krieger & Luperchio, 2009)
Although the cluster in Figure 7 was comprised of schools with low lifetime
participation rates between 5 and 15 percent, each institution still had a small subset of
major donors who provided most of the total giving. The high percent of non-donors
limited the ability of the A&S model to reduce non-donor representation in the top
deciles. The few committed major donors created a strong ideal major donor profile that
captured nearly 90 percent of known major donors for school in this cluster (Krieger &
Luperchio, 2009). The aggregated study results for the remaining clusters are contained
in Appendix A.
DATA MINING IN HIGHER EDUCATION
23
The A&S model was successful in developing a profile for the ideal major donor
at each institution, identifying qualified prospects for major gifts and the variables most
closely related with lifetime giving. The analysis also revealed similarities and
differences. Predictors that were most influential for one institution were completely
insignificant for another and some institutions produced stronger predictors than other
institutions, despite having fewer data variables (Krieger & Luperchio, 2009).
In his blog, Kevin MacDonnell (2010) posted 15 top predictors for annual giving.
Among those were class year, home telephone, marital status, employment, events
attended and business telephone. As in the 2009 study done by Krieger and Luperchio,
MacDonnell pointed out there was no “magic list of predictors” that worked everywhere
and always. While some variables such as ‘class year’ and ‘home’ telephone were
important, institutions must explore their own data.
Even though research indicated a correlation between attributes known about
alumni and giving, it also indicates every situation is unique, and understanding what
individual bits of information mean to a specific institution was most important.
According to McClintock (2004), “Data mining is not just about finding individual
wealthy prospects. This data mining is about truly understanding your prospect pool. It’s
about providing knowledge that informs strategic planning-knowledge that leads to
increased fund-raising results” (p. v).
Research Methods The data used in this project came from the Marshall University Alumni database.
Alumni in this database are graduates and former attendees of Marshall University and
any constituent, person or non-person who donated to Marshall University. For this
DATA MINING IN HIGHER EDUCATION
24
project, only individuals were included. The dataset contained 113,405 individuals
including both donors (23.8 percent) and non-donors (76.2 percent) and 49 variables
including graduation year, degree, major, home telephone, email, employment, event
attendance and several variables capturing previous donation behavior. Appendix B
contains a complete description of the variables.
Several tools provided functionality needed to capture and prepare the data for
modeling. The tables containing alumni information were imported from the Marshall
University Foundation, Inc. (MUFI) Oracle database into Microsoft Access, where one
table containing all needed information was created using SQL queries. The modeling
software needed variables created to represent the existence of a value for categorical
fields, such as email, employment, home telephone number, student activity, direct mail
response and fiscal year donations. These variables contained a 1 if the information was
present in the database and a 0 if it was not.
The data inspection for appropriate content and missing values revealed missing
gender values. These records received a new value based on the name prefix field where
a name prefix existed. Records with unknown data, such as birth year or age received a
zero or null value depending on the variable data type.
Rapid Insight’s Analytics predictive modeling software provided the functionality
for producing summary descriptive statistics and data modeling. Descriptive statistics
included mean, min, max, number of records, and standard deviation. This information
provided a check of record counts for specific fields, overall counts of the dataset,
number of observations per field and variable type. A complete list of the variables is
listed in Appendix B.
DATA MINING IN HIGHER EDUCATION
25
Results
Univariate analysis on several variables provided counts and segment information
about the dataset population. The majority of alumni currently reside in the Tri-State
area, consisting of West Virginia, Ohio and Kentucky. The overwhelming majority of
alumni reside in West Virginia, i.e., 50 percent of the population, followed by Ohio with
almost 10 percent and Kentucky with almost 5 percent. Appendix M contains
percentages for the entire population. The average graduation year was 1985, and
average age on record was forty-nine.
The majority of alumni represent the following degrees; Bachelor of Arts,
Bachelor of Arts in Business, Bachelor of Science and Bachelor of Science in Nursing.
Almost 64 percent of the alumni fell within less than 2 percent of the declared majors,
9.24 percent listed elementary education, followed by accounting at 4.49 percent, then
management with 4.39 percent. Appendices G and H list the complete degree and major
breakdown. Gender distribution indicates females make up 54 percent and males 45
percent of the alumni dataset. There are 54 percent listed as married, 12 percent as single
and unfortunately, almost 30 percent were listed as unknown, indicating lost information
and a missed opportunity. As Wiley (2005), reported in his article, Where the Alumni
Money Is, a study that included eight different higher education institutions, alumni listed
as married accounted for 86 percent of the total alumni dollars given. In order to present
the most complete picture of alumni, it is important to capture this most basic type of
data.
Multivariate analysis done between the donor indicator variable and other
variables underscore the impact of these variables on the potential of giving. This study
DATA MINING IN HIGHER EDUCATION
26
of variables identified segments of the dataset likely to become indicators for giving prior
to regression modeling. This analysis highlighted the importance of age and years since
graduation, and donating to the University. These results mirror the results found in the
study done by Wiley (2005), which illustrated a correlation between years since
graduation and donor giving. Figures 8 and 9 show the relationship between giving and
age, and giving and years since graduation.
Figure 8 - Giving and Age
DATA MINING IN HIGHER EDUCATION
27
Figure 9 – Giving and Years since Graduation
These results showing a steady increase in giving until age 50 may reflect this
group having increased financial security and increased capacity to donate. A slight
leveling occurs around the same time before increasing again until about age 70. This
offers the opportunity for two different types of solicitations using age groups as the
guiding factor. Appendix I shows a very similar relationship between lifetime giving of
$1,000 or more and age.
The relationship between degree of record and giving to the University indicates
alumni with a Bachelor of Science degree have a likelihood to donate that was not
immediately obvious in the multivariate analysis below.
DATA MINING IN HIGHER EDUCATION
28
Figure 9 - Giving By Degree
Figure 10 indicates graduates with a Bachelor of Arts and Bachelor of Arts in
Business have high donor counts. However, the underlying data shown in Appendix J
indicate graduates with a Bachelor of Science degree are donors who are more frequent.
Their donor count of 2,048 is 30 percent of their overall count of 6,724. Both the
Bachelor of Arts and Bachelor of Arts in Business have 30 and 31 percent respectively,
but their overall counts of 29,066 and 11,001 are much higher. Here, too the analysis
reveals information useful in framing solicitations.
A multivariate analysis of donor indicator and student activity revealed
disappointing, but expected results. The student activity code indicates alumni
participation in an activity while attending Marshall. Such activities include, among
others, belonging to University-related groups or organizations (fraternities, sororities,
DATA MINING IN HIGHER EDUCATION
29
student government), playing sports or being a member of the band. Figure 11 shows the
alumni coded as having participated in a student activity and whether or not they donated
anytime during the past five years.
Figure 10 - Donating and Student Activity
The columns represent alumni donors. A 1 indicates a donor and a 0 indicates a
non-donor. This figure shows that alumni who participated in a student activity donated
in fewer numbers than those alumni who participated in a student activity and did not
donate.
Although this Marshall University student activity code includes Greek
membership as well as other student activities, these results do not reflect the results
described in Greeks Bearing Gifts, (Wylie 2007), showing a strong connection between
Greek memberships and giving. It is important to put this in the context of the overall
DATA MINING IN HIGHER EDUCATION
30
counts in the database. Of the 86,329 actual graduates in the dataset, only 12,567 (14
percent) have information in the database indicating they participated in a student
activity. This is another example of being unable to present a complete and accurate
picture of Marshall’s alumni, therefore missing the opportunity to use that information in
the University’s fundraising efforts.
Having evaluated the variables based on their relationship to the donor code, a
logistic regression model provided a comprehensive picture of donor giving, using Rapid
Insight’s Analytics predictive modeling software. The goal was to identify characteristics
useful in framing a donation request that would increase the changes of receiving a
donation from a non-donor. The needed target variable for this case was the donor code
indicator, referred to as a response rate, defined as a binary variable containing a 1 if an
individual ever donated to the institution and a 0 if he or she did not. The Rapid Insight
mining tool identified 10 of the 49 dataset variables related to a response rate variable at a
significance level of (p=.01) The model
including variable coefficients and individual p-
values is located in Table 1. From this analysis,
it is evident which variables have a strong impact
on whether an individual will ever provide a
donation to Marshall University. In addition to
age, individuals who provide their telephone
information and email have a higher
propensity to donate than those individuals who do not provide phone or employment
information.
Variable Coefficient p-value
EMAIL 0.6788 0.0000 HPHONE 0.5386 0.0000 EMPLOYMENT 0.5712 0.0000 STUCODE 0.5549 0.0000 OTHCODE 0.3693 0.0000 ALUM -1.103 0.0000 GRADYR 0.9473 0.0000 ATTEVNT10 0.7484 0.0000 ATTEVNT09 1.610 0.0000 ATTEVNT07 0.7666 0.0000
Table 1 - Donor Response Model
DATA MINING IN HIGHER EDUCATION
31
Also of interest are the indicators for event attendance. Event attendance data was
included for FY05 through FY10, but the analysis indicates only FY07 and FY10 were
significant. The odds ratio of 2.1525 and 2.1135 respectively, support this (See appendix
K). This may be because the movie, We Are Marshall, released in late December 2006
greatly increased Marshall’s visibility nationwide and the momentum continued through
fiscal year 2007. Also during fiscal year 2010, the organization made a concentrated
effort to capture event attendee information and include that in the alumni database.
The model also scored
a Bachelor of Science Degree
strong as well as a major in
accounting and journalism.
Not surprising is the low donation likelihood score given to the residents of the state of
West Virginia. However, the high score for the state of Connecticut was unexpected since
the multivariate analysis indicated first the Tri-State area, then Ohio with individuals
likely to make a donation. A second model using as the response indicator the likelihood
to donate $1,000 or more in one’s lifetime, mirror the results seen in the donor response
model (See Appendix L). The low score for West Virginia is supported by the overall low
score (-1.103) of the alumni variable indicator as a characteristic of a donor. The state has
the largest population of alumni, yet has low donation numbers. Of the 40,979 alumni in
the state, only 10,877 or 26 percent are donors, versus 35 percent for both Ohio and
Kentucky. Connecticut has significantly fewer alumni in residence (162) but 92 of them
are donors for an impressive 56 percent. When looking at specific fiscal years and
numbers of donors, West Virginia fared much better. Of the 4,708 donors in FY2008, 47
Variable Coefficient p-value Binary (DEG1, Bachelor of Science) 0.3122 0.0000 Binary (MAJ1, BBA, Accounting) 0.5806 0.0000 Binary (MAJ1, Journalism) 0.9633 0.0000 Binary(STATE, WV) -0.2280 0.0000 Binary(STATE, CT) 2.1699 0.0000
Table 2 - Donor Response Model Continued
DATA MINING IN HIGHER EDUCATION
32
percent were West Virginia residents and in FY2010, 45 percent of the donors were West
Virginia residents. The donor response model considered all the variables, which
included a great deal more information nevertheless these numbers, indicate a need for
further research to understand the results.
Rapid Insight provides a utility to apply the logistic regression to the entire
dataset by scoring all individuals in the dataset and identifying those who currently do not
donate, but have a high likelihood of doing so. The scoring system ranked all individuals
in the dataset between 1 and 10 to indicate a propensity to donate. This model returned
4,201 current non-donors within the first decile, indicating a high propensity to donate.
The model was developed using fifty percent of the dataset and tested the remaining
dataset for accuracy. This resulted in a 76.12 percent concordance rate. The concordance
rate measures model fit. Percentages close to 100 percent indicates a nearly perfect
model.
Of the 4,201 non-donor records, 633 were in a dataset sent out to an external
wealth screening service to obtain a score indicating their propensity to donate a major
gift to the University. The model also returned 7,136 current donors in the first decile,
and of those, 4,839 were in the dataset sent to the screening company. The information
from the screening service clearly support the results returned from the scoring model,
indicating it is a successful model for predicting non-donors.
Discussion and Evaluation
The use of predictive modeling offers fundraising organizations the possibility of
new donors. The studies done using information gathered about alumni to create a set of
characteristics that identify an individual or a group of individuals offers the opportunity
DATA MINING IN HIGHER EDUCATION
33
to streamline and focus campaigns and solicitations. Equally important, modeling can
provide an institution with the resources to target the best constituents. While studies
show that a valid predictor for one institution may not work at another institution, they
underscore the importance of using existing in-house information to make those
identifications, and thus illustrating, the importance of capturing that information. The
result of this project reinforces those findings and makes a strong case for the importance
of capturing this information to use in analysis. Multivariate analysis in this study
supported results seen by Wylie (2005), and illustrate the importance of age and
philanthropic contributions. Model results highlighted the connection between attending
events, such as galas, homecoming and alumni weekend, and donating. Individuals
attending these events likely have a strong relationship with Marshall University and are
more inclined to donate. Despite the limited amount of activity related information
captured in the database, both the donor response model and the lifetime giving model
show a strong association between the propensity to donate money and attend the
institution’s activities. The results not only highlight different strategies for donation
requests using degrees, majors, graduation year and age, but areas for further study such
as donations by state and donations over a span of years.
Even though records in the dataset are missing an accurate marital code, the
multivariate analysis of donor code and marital reflect the results found by Wylie (2005).
Of the individuals coded as married, 28.9 percent are donors, as compared to single
individuals at 19.9 percent, and those coded as unknown at 14 percent.
The database does not have information such as event attendance for a large
number of events or attendees. Nevertheless, there is still a promising correlation between
DATA MINING IN HIGHER EDUCATION
34
event attendance and giving. For example, of the 149 individuals attending an event in
fiscal year 2010; they donated $878,832.64 in the same fiscal year. This warrants further
study as there is significant giving information (amounts, dates, areas of interest,
consistent giving) which could reveal giving trends and interests previously unknown.
Additionally, it clearly underscores the need to capture additional information about
alumni. Both the multivariate analysis and the models, which used the same variables,
illustrate the need to increase efforts to capture information about alumni such as event
attendance, marital codes and connections to the University.
The results of the multivariate analysis and models supports research done by
others, indicating, that using the data in an organization’s own database can yield results
useful to fundraising efforts. Further, there appear to be variables across numerous
studies that consistently associate themselves with the characteristics of donors.
This study uncovered associations between donors and information known about
donors that when applied to non-donors could yield beneficial results for the University’s
fundraising efforts. However, the data also indicate a need for further study so that the
information will be used in the most efficient and successful manner.
DATA MINING IN HIGHER EDUCATION
35
Conclusions
While not mainstream within the field of philanthropic giving, the abundance of
research and availability of modeling software indicate that predictive modeling is a
beneficial tool for fundraising organizations. It offers the opportunity to make
informative, statistically supported business and fundraising decisions. Research supports
the use of predictive modeling as a proven means to identify the best prospects, targeting
methods, and segmentation groups. Such a powerful tool is certain to become mainstream
in the near future. As more and more philanthropic organizations utilize modeling, it will
become the preferred tool to identify new donors and fundraise more efficiently. With the
ability to discover hidden patterns and build models to predict behavior, fundraising
organizations can address issues in solicitations, campaigns, marketing and prospect
research.
The analysis and logistic regressions developed throughout this project identified
several key characteristics about donor alumni as well as areas needing improvement in
the database. This information was positive because unknown connections and
relationships became known and their potentials revealed.
When speaking of connections and relationships, the most important are the
connections and relationships established and nurtured by fundraisers with current and
prospective donors. These relationships measured over time, vary from person to person,
and fundraiser to fundraiser and, are difficult to measure statistically. However, their
importance cannot be understated. Data mining cannot replace or duplicate this most
DATA MINING IN HIGHER EDUCATION
36
essential aspect of fundraising, but data mining can enhance and support that process in
such a way that both donor and institution benefit.
Future Work
The most common use of analytics is to identify characteristics useful in locating
major donors, but analytical tools are suited to other areas of fundraising. Some of these
include:
• Hone in on variables that are strong indicators for current and future use
• Understand donations in relations to event attendance
• Understand how fundraisers spend their time and what activities translates into a gift
• Which tasks contribute to increased giving and which detract
• Predict top donors
• Predict event attendance
• Predict who will be top donors in ten years
• Discover groups using text mining
• Develop an integrated prospecting system
• Uncover giving patterns
• Model phonathon segmentation
DATA MINING IN HIGHER EDUCATION
37
References
Birkholz, B. (2008). Fundraising analytics. Using data to guide strategy. John Wiley &
Sons, Inc.: New Jersey.
Cios, J., Kurgan, L., Pedrycz, W. & Swiniarski, R. (2007). Data mining. A knowledge
discovery approach. Springer Science+Business Media, LLC: New York.
Han J., & Kamber M. (2006). Data mining concepts and techniques. (2nd ed.).
Morgan Kaufmann Publishers: New York.
Iwankj, B., Nichol, J (Producers), & Nichol, J. (Director). (2006). We Are Marshall
[Motion picture]. United States: Warner Brothers.
Khabaza, T. (2009). Hard hat area: Myths and pitfalls of data mining. Executive brief.
SPSS. Retrieved December 10, 2009, from
ftp://hqftp1.spss.com/pub/web/wp/HHAEB-0209.pdf
Kopp, S. (2010, February). Understanding our budgetary challenges. Communiqué
retrieved August 27, 2010 from
http://www.marshall.edu/president/comm/feb2010.pdf
Krieger, Z., Luperchio, D. (2009). Data mining and predictive modeling in institutional
advancement: How ten schools found success. SPSS Technical report produced
jointly with the Council for the Advancement and Support of Education (CASE) and
SPSS Inc. Retrieved January 2, 2010, from
http://whitepapers.techrepublic.com.com/abstract.aspx?docid=1125993
Larsen, B. (2009). Delivering business intelligence with Microsoft SQL Server 2008.
New York: McGraw-Hill.
DATA MINING IN HIGHER EDUCATION
38
Leslie, L., & Ramey G. (1988). Donor behavior and voluntary support for higher
education institutions. The Journal of Higher Education, Vol 59. No. 2 (Mar. –
Apr., 1988), pp. 115-132. Retrieved August 13, 2010 from
http://www.jstor.org/stable/1981689
McClintock, S. Foreword. (2004). Data mining for fund raisers, 2005. By Peter Wylie.
Council for Advancement and Support of Education: Washington, DC, V
MacDonnell, K. (2010, January 22). Four mistakes I have made. Message posted to
Retrieved January 20, 2010, from
http://cooldata.wordpress.com/2010/01/22four-mistakes-i-have-made/
MacDonnell, K. (2010, January 6). The 15 top predictors for annual giving. Retrieved
January 20, 2010, from message posted
http://cooldata.wordpress.com/2010/01/06/the-15-top-predictors-for-annual-giving/
Masterson, K. (2010). Private giving to colleges dropped sharply in 2009. The
Chronicle of Higher Education. Retrieved August 13, 2010, from
http://chronicle.com/article/Private-Giving-to-Colleges/63879/
Two Crows Corporation. (2005). Introduction to data mining and knowledge discovery.
(3rd ed.). [Electronic Booklet]. Potomac, MD. Retrieved February 26, 2008 from
http://www.twocrows.com/index.htm
Wylie, P. (2007). Greeks Bearing Gifts. Retrieved September 7, from
http://www.datadesk.com/products/mediadx/keydonor/Greeks_Bearing_Gifts.pdf
Wylie, P. (2005). Deep pockets. Where the alumni money is. Retrieved September 11,
2010 from http://www.datadesk.com/products/mediadx/keydonor/Deep
percent20Pockets.pdf
DATA MINING IN HIGHER EDUCATION
39
Wylie, P. (2004). Data mining for fund raisers. Council for Advancement and Support of
Education: Washington, DC.
DATA MINING IN HIGHER EDUCATION
40
Appendix A
A&S Study Aggregated Results (Kreiger & Luperchio, 2009)
DATA MINING IN HIGHER EDUCATION
41
Appendix B
Marshall University Alumni Database Variables
DATA MINING IN HIGHER EDUCATION
42
Appendix C
Rates for Greeks and non-Greeks Schools C – F (Wiley 2007)
dd
S
DATA MINING IN HIGHER EDUCATION
Appendix D
Median Dollars for Greeks and non-Greeks Schools C – F (Wiley, 2007)
DATA MINING IN HIGHER EDUCATION
44
Appendix E
Giving By Marital Status and Class Year Schools B – E (Wiley, 2005)
DATA MINING IN HIGHER EDUCATION
45
Appendix F
Giving by Marital Status and Class Year Schools F-H (Wiley, 2005)
DATA MINING IN HIGHER EDUCATION
46
Appendix G
Degree Breakdown (Marshall University, 2009)
DATA MINING IN HIGHER EDUCATION
47
Appendix H
Major Breakdown (Marshall University, 2009)
DATA MINING IN HIGHER EDUCATION
48
Appendix I
Lifetime Giving and Age (Marshall University, 2009)
DATA MINING IN HIGHER EDUCATION
49
Appendix J
Donor Indicator and Degrees (Marshall University, 2009)
DEG1 Y-variable Mean Y-variable Sum Count Bachelor of Applied Science 0 0 9 Bachelor of Arts 0.26901 7819 29066 Bachelor of Arts in Business 0.31161 3428 11001 Bachelor of Engineering Scienc 0.5377 164 305 Bachelor of Fine Arts 0.12006 85 708 Bachelor of Science 0.30458 2048 6724 Bachelor of Science in Chemist 0.16129 5 31 Bachelor of Science in Cytotec 0.11765 2 17 Bachelor of Science in Enginee 0.28571 2 7 Bachelor of Science in MedTech 0 0 25 Bachelor of Science in Nursing 0.20232 331 1636 Bachelor of Social Work 0.09365 28 299
DATA MINING IN HIGHER EDUCATION
50
Appendix K
Donor Response Model (Marshall University, 2009)
DATA MINING IN HIGHER EDUCATION
51
Appendix L
Likelihood Model (Marshall University, 2009)
DATA MINING IN HIGHER EDUCATION
52
Appendix M
Distribution of Alumni by State (Marshall University, 2009)