This notebook contains an exploration of DonorsChoose.org data and the development of a simple recommendation system to match donors with projects.
Using KMeans clustering to identify latent groupings or profiles of projects, new projects are classified into one of the identified project profiles. A web-based API running the classifier will output lists of Donor IDs segmented by location (using
Donor State as proxy variable), by organization loyalty (using
cluded Optional Donation as proxy variable), and by donation timing (using the difference between
Donation Received Dateand
Project Posted Date as derived proxy variable). These segmentation based on location, organization loyalty, and donation timing were based on related literature outlining the relevance of preference based on proximity and personal background  , support to the crowdfunding host as an indicator of donor retention or long-term repeat donations  , and donation timing affecting when donors actually donate  . Following the methodology illustrated in the flowchart below, a working recommender system was developed with four identified latent profiles of projects with corresponding clusters of potential repeat Donors. Original document containing the codes can be found here: https://www.kaggle.com/tristanjoshuaalba/helping-donorschoose-org-eda-recommender-system
Key Insights from Data:
- The number of projects posted the first time is rapidly growing since 2002 with peak at around 80,000 projects per year in 2016.
- The prices of requested items are cheap, most of them are less than 20 dollars with a few outliers costing up to almost 100,000 dollars
- Cheaper items come in bigger quantities for resources item requested
- Majority of projects are teacher-led.
- Top projects are those that deal with literacy and language.
- Top grade level category for projects is Grades Pre-K-2.
- Top projects requested either books, supplies or technology.
- Most projects are already fully funded.
- The average cost of a project is around 741 dollars pulled up by a few high value projects with the largest costing 255,737 dollars. 75% of the projects costs less than 868 dollars.
- On average, it only takes about a month (~32.07 days) for a project to get fully funded. 75% of projects get funded within 50 days.
- Most Teachers that create projects have prefixes
- We can infer from above that most Teachers that create projectes are females
- Sunday is the most popular day for Teachers to post their project the first time (73,410 projects posted)
- September is the most popular month for Teachers to post their project the first time (59,495 projects posted)
- Most beneficiary schools come from
- It’s safe to say that most schools that benefit from the projects give out free lunches with median percentage of 61% of student population are given free lunches
- California is home to the most number of beneficiary schools; while Wyoming comprises the least.
- There are a handful donors that are teachers, but most aren’t.
- California is home to the most number of donors; while Wyoming comprises the least.
- Chicago is the city that is home to most number of donors, followed by eastern city of New York, and western cities of California such as San Francisco and Los Angeles.
- Swarm plots of donations over time show that donors from the same city tend to donate in bursts within a short time period.
- Most donations come in less than 20 dollars, with a very few big individual ones at 400 or 500 dollars.
- Most donations come with an optional donation to donorschoose.org
- On January 26, 2018, there was an observed peak for donations received this year
Donorschoose.org is an online charity platform dedicated to supporting K-12 public education in the U.S. Briefly, it is a crowdfunding site where teachers can post or create project requests and where donors can donate and help raise funds to fulfill the teachers’ educational causes. Since its founding in 2000, the platform has raised over $685 million for 1.1 million projects from over 3 million people and partners. Moreover, teachers from almost 75% of all public schools in the U.S. have sought the help of DonorsChoose.org in raising funds for their projects, making the platform the premier website for supporting education.
Currently, teachers still spend over a billion dollars out of pocket for their students’ needs. In order for students to get what they need to learn, DonorsChoose.org must be able to encourage its roster of first-time donors to donate again to projects that inspire them most.
Building the Recommendation System
This section presents a model for a hybrid recommendation system utilizing both content-based filtering and demographic recommendations. By definition, content-based filtering focus on properties of items and their similarities are determined by measuring the similarity in their properties. On the other hand, a demographic recommender provides recommendations based on a demographic profile of the user.
Borrowing terminologies from the field of data mining, there are two components of the recommendation system in relation to the donorschoose.org challenge:
- Item Profiles (Projects) – given by the characteristics of the projects from which ‘profiles’ will be constructed
- User Profiles (Donors) – given by the demographics of the donors
By making inferences from users’ behaviors (donor donation and project preferences) and their demographic profiles, we develop a recommendation system potentially useful for the creation of targeted email campaigns to encourage repeat donations from first time donors.
Note: For the purpose of simplicity, we opted to omit
Project Essay and other free-form answer features of the projects and resorted to using the several project ‘tags’ features to discover item profiles.
Construction of the aforementioned Recommendation System involves the following steps:
- Determining Item Profiles (using KMeans Clustering / unsupervised Machine Learning)
- Classifying New Unobserved Items (using Logistic Regression / supervised Machine Learning)
- Filtering User Profiles (using heuristics)
Fed with information about the new projects, a web app containing the identified project clusters will churn out the corresponding cluster of potential donors, to be filtered by location (
Donor State), explicit support to donorschoose.org (
Donation Included Optional Donation) and by donation timing (
Donation Received Date)
Identifying Item Profiles (Projects) – KMeans Clustering
Finding the optimal number of clusters
We find that the optimal number of clusters is 4. We used two internal validation methods to find the optimal number of clusters: the sum of squares of the distances of each data point in a cluster from the centroid is small (SS), and the Silhouette value. The optimal number of clusters is the number at which SS is small and Silhouette value is closest to 1. Plotting these two validation measures vs the number of clusters k, we find that the optimal tradeoff between the two validation measures is at k = 4.
We plot the projects on the first two principal components and color code them according to the clusters identified by kMeans.
Classifying New Unobserved Items (New Projects) – Logistic Regression
Logistic regression was able to classify new projects into the identified clusters with 95-96% accuracy.
Filtering User Profiles
Storing Donor ID and other Donor information in dictionaries, by clusters
After clustering, the Donor ID of Donors who previously donated to projects in each of the identified clusters is stored in dictionaries. Conveniently, these clusters of Donor IDs can be easily called later on. Furthermore, to take into account proximity, organizational loyalty, and donation timing, Donor ID filtered on these features are also stored in individual dictionaries.
- Proximity/Location –
- Organizational Loyalty –
- Donation Timing –
Now, to demonstrate how this recommender system would work when presented a new project, we have a function
donors_to_recommend to churn out Donor ID’s of potential donors based on the predicted class or cluster of the new project. A new project to be recommended to donors has the features outlined below.
|Resource Unit Price||10|
|Project Subject Category Tree||Health & Sports|
|Project Subject Subcategory Tree||Gym & Fitness, Health & Wellness|
|Project Grade Level Category||Grades 9-12|
|Project Resource Category||Sports & Exercise Equipment|
|School Metro Type||suburban|
|School Percentage Free Lunch||65|
|School State||New York|
|School City||New York City|
|School District||New York Dept Of Education|
By encoding the categorical variables the same way as the initial input dataset, the same optimized logistic regression model can reliably predict the class by which will be the basis of the cluster of Donors the project will be recommended to.
Sample Web App
We developed an interactive website application which takes in information about new projects and runs the classifier to determine which cluster of donors it would recommend the project to. The app is accessible through this link: https://cyntwikip-choosedonors.herokuapp.com/
With the aim to increase the volume of donations and encourage first-time donors to donate again, DonorsChoose.org faces a challenge to build a recommendation engine that would allow for previous donors to easily find and support new projects that inspires them the most. To build the recommender, first, a clustering algorithm (KMeans with k = 4) partitioned the projects on its characteristic features to develop item profiles. Next, a classifier (logistic regression with L1 regularization) is developed to predict clusters of new unobserved projects. Finally, the designed recommendation system will churn out Donor IDs and information about the donors corresponding to the particular cluster. In the experiment above, we have shown how the recommender would identify potential donors once presented a never seen before the project. Finally, this recommendation system can be used to run an email marketing campaign making identification of target segment more efficient.
 Breeze, B. (2013) How donors choose charities: the role of personal taste and experiences in giving decisions. Voluntary Sector Review, Vol. 4, (2), pp. 165-183
 Althoff, T and Leskovec, J (2015) Donor Retention in Online Crowdfunding Communities: A Case Study of Donorschoose.org. ACM
 Salomon, J, et.al (2015) Don’t Wait! How Timing Affects Coordination of Crowdfunding Donations. ACM
 Rajaman, Leskovec, and Ullman (2014) Chapter 9 Recommendation Systems in Mining of Massive Datasets (307-340) Palo Alto, Cali., USA
Traag, Vincent (2016) Complex Contagion of Campaign Donations. Public Library of Science
Tristan Joshua Alba
Prince Joseph Erneszer Javier
Jude Michael Teves
We would like to acknowledge our mentors Dr. Christopher Monterola and Dr. Erika Fille Legara for their invaluable inputs on how to tackle this problem. We would also like to thank Dr. Christian Alis and Eduardo David for giving us access to computing hardware and also for giving their inputs. We thank Erwin Obias and Bryan Damasco for their support on this project.