Helping EDA and Recommender System

This notebook contains an exploration of data and the development of a simple recommendation system to match donors with projects.

Executive Summary:

Using KMeans clustering to identify latent groupings or profiles of projects, new projects are classified into one of the identified project profiles. A web-based API running the classifier will output lists of Donor IDs segmented by location (using Donor State as proxy variable), by organization loyalty (using Donation Included Optional Donation as proxy variable), and by donation timing (using the difference between Donation Received Dateand Project Posted Date as derived proxy variable). These segmentation based on location, organization loyalty, and donation timing were based on related literature outlining the relevance of preference based on proximity and personal background [1] , support to the crowdfunding host as an indicator of donor retention or long-term repeat donations [2] , and donation timing affecting when donors actually donate [3] . Following the methodology illustrated in the flowchart below, a working recommender system was developed with four identified latent profiles of projects with corresponding clusters of potential repeat Donors. Original document containing the codes can be found here:



Key Insights from Data:


  • The number of projects posted the first time is rapidly growing since 2002 with peak at around 80,000 projects per year in 2016.
  • The prices of requested items are cheap, most of them are less than 20 dollars with a few outliers costing up to almost 100,000 dollars
  • Cheaper items come in bigger quantities for resources item requested
  • Majority of projects are teacher-led.
  • Top projects are those that deal with literacy and language.
  • Top grade level category for projects is Grades Pre-K-2.
  • Top projects requested either books, supplies or technology.
  • Most projects are already fully funded.
  • The average cost of a project is around 741 dollars pulled up by a few high value projects with the largest costing 255,737 dollars. 75% of the projects costs less than 868 dollars.
  • On average, it only takes about a month (~32.07 days) for a project to get fully funded. 75% of projects get funded within 50 days.


  • Most Teachers that create projects have prefixes Mrs. or Ms.
  • We can infer from above that most Teachers that create projectes are females
  • Sunday is the most popular day for Teachers to post their project the first time (73,410 projects posted)
  • September is the most popular month for Teachers to post their project the first time (59,495 projects posted)


  • Most beneficiary schools come from suburban or urban communities
  • It’s safe to say that most schools that benefit from the projects give out free lunches with median percentage of 61% of student population are given free lunches
  • California is home to the most number of beneficiary schools; while Wyoming comprises the least.


  • There are a handful donors that are teachers, but most aren’t.
  • California is home to the most number of donors; while Wyoming comprises the least.
  • Chicago is the city that is home to most number of donors, followed by eastern city of New York, and western cities of California such as San Francisco and Los Angeles.
  • Swarm plots of donations over time show that donors from the same city tend to donate in bursts within a short time period.
  • Most donations come in less than 20 dollars, with a very few big individual ones at 400 or 500 dollars.
  • Most donations come with an optional donation to
  • On January 26, 2018, there was an observed peak for donations received this year

About is an online charity platform dedicated to supporting K-12 public education in the U.S. Briefly, it is a crowdfunding site where teachers can post or create project requests and where donors can donate and help raise funds to fulfill the teachers’ educational causes. Since its founding in 2000, the platform has raised over $685 million for 1.1 million projects from over 3 million people and partners. Moreover, teachers from almost 75% of all public schools in the U.S. have sought the help of in raising funds for their projects, making the platform the premier website for supporting education.

Currently, teachers still spend over a billion dollars out of pocket for their students’ needs. In order for students to get what they need to learn, must be able to encourage its roster of first-time donors to donate again to projects that inspire them most.

Building the Recommendation System

This section presents a model for a hybrid recommendation system utilizing both content-based filtering and demographic recommendations. By definition, content-based filtering focus on properties of items and their similarities are determined by measuring the similarity in their properties. On the other hand, a demographic recommender provides recommendations based on a demographic profile of the user.

Borrowing terminologies from the field of data mining[4], there are two components of the recommendation system in relation to the challenge:

  1. Item Profiles (Projects) – given by the characteristics of the projects from which ‘profiles’ will be constructed
  2. User Profiles (Donors) – given by the demographics of the donors

By making inferences from users’ behaviors (donor donation and project preferences) and their demographic profiles, we develop a recommendation system potentially useful for the creation of targeted email campaigns to encourage repeat donations from first time donors.

Note: For the purpose of simplicity, we opted to omit Project Essay and other free-form answer features of the projects and resorted to using the several project ‘tags’ features to discover item profiles.

Construction of the aforementioned Recommendation System involves the following steps:

  1. Determining Item Profiles (using KMeans Clustering / unsupervised Machine Learning)
  2. Classifying New Unobserved Items (using Logistic Regression / supervised Machine Learning)
  3. Filtering User Profiles (using heuristics)

Fed with information about the new projects, a web app containing the identified project clusters will churn out the corresponding cluster of potential donors, to be filtered by location (Donor State), explicit support to (Donation Included Optional Donation) and by donation timing (Donation Received Date)


Identifying Item Profiles (Projects) – KMeans Clustering

Finding the optimal number of clusters

We find that the optimal number of clusters is 4. We used two internal validation methods to find the optimal number of clusters: the sum of squares of the distances of each data point in a cluster from the centroid is small (SS), and the Silhouette value. The optimal number of clusters is the number at which SS is small and Silhouette value is closest to 1. Plotting these two validation measures vs the number of clusters k, we find that the optimal tradeoff between the two validation measures is at k = 4.


We plot the projects on the first two principal components and color code them according to the clusters identified by kMeans.

cluster viz.png

Classifying New Unobserved Items (New Projects) – Logistic Regression

Logistic regression was able to classify new projects into the identified clusters with 95-96% accuracy.

log reg results.png

Filtering User Profiles

Storing Donor ID and other Donor information in dictionaries, by clusters

After clustering, the Donor ID of Donors who previously donated to projects in each of the identified clusters is stored in dictionaries. Conveniently, these clusters of Donor IDs can be easily called later on. Furthermore, to take into account proximity, organizational loyalty, and donation timing, Donor ID filtered on these features are also stored in individual dictionaries.

  1. Proximity/Location – clusters_by_state
  2. Organizational Loyalty – clusters_by_org_loyalty
  3. Donation Timing – clusters_by_early_donors and clusters_by_late_donors


Now, to demonstrate how this recommender system would work when presented a new project, we have a function donors_to_recommend to churn out Donor ID’s of potential donors based on the predicted class or cluster of the new project. A new project to be recommended to donors has the features outlined below.

Resource Quantity 10
Resource Unit Price 10
Project Type Teacher-Led
Project Subject Category Tree Health & Sports
Project Subject Subcategory Tree Gym & Fitness, Health & Wellness
Project Grade Level Category Grades 9-12
Project Resource Category Sports & Exercise Equipment
Project Cost 53
Teacher Prefix Mrs.
School Metro Type suburban
School Percentage Free Lunch 65
School State New York
School City New York City
School County Queens
School District New York Dept Of Education

By encoding the categorical variables the same way as the initial input dataset, the same optimized logistic regression model can reliably predict the class by which will be the basis of the cluster of Donors the project will be recommended to.

Screen Shot 2018-09-14 at 10.55.41 AM.png

Sample Web App

We developed an interactive website application which takes in information about new projects and runs the classifier to determine which cluster of donors it would recommend the project to. The app is accessible through this link:


With the aim to increase the volume of donations and encourage first-time donors to donate again, faces a challenge to build a recommendation engine that would allow for previous donors to easily find and support new projects that inspires them the most. To build the recommender, first, a clustering algorithm (KMeans with k = 4) partitioned the projects on its characteristic features to develop item profiles. Next, a classifier (logistic regression with L1 regularization) is developed to predict clusters of new unobserved projects. Finally, the designed recommendation system will churn out Donor IDs and information about the donors corresponding to the particular cluster. In the experiment above, we have shown how the recommender would identify potential donors once presented a never seen before the project. Finally, this recommendation system can be used to run an email marketing campaign making identification of target segment more efficient.


[1] Breeze, B. (2013) How donors choose charities: the role of personal taste and experiences in giving decisions. Voluntary Sector ReviewVol. 4, (2), pp. 165-183
[2] Althoff, T and Leskovec, J (2015) Donor Retention in Online Crowdfunding Communities: A Case Study of ACM
[3] Salomon, J, (2015) Don’t Wait! How Timing Affects Coordination of Crowdfunding Donations. ACM
[4] Rajaman, Leskovec, and Ullman (2014) Chapter 9 Recommendation Systems in Mining of Massive Datasets (307-340) Palo Alto, Cali., USA
[5]Traag, Vincent (2016) Complex Contagion of Campaign Donations. Public Library of Science


Tristan Joshua Alba
Prince Joseph Erneszer Javier
Jude Michael Teves


We would like to acknowledge our mentors Dr. Christopher Monterola and Dr. Erika Fille Legara for their invaluable inputs on how to tackle this problem. We would also like to thank Dr. Christian Alis and Eduardo David for giving us access to computing hardware and also for giving their inputs. We thank Erwin Obias and Bryan Damasco for their support on this project.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at

Up ↑

%d bloggers like this: