JustGiving’s RAVEN platform turns data into donations
Thu 21 Feb 2019 | Richard Freeman, PhD
Leveraging data science and machine learning at scale, JustGiving is providing personalised experiences and identifying the causes that people actually care about. Richard Freeman, PhD, lead data and machine learning engineer, talks through the tools and processes underpinning the tech-for-good company’s platform, and why you should think twice before relying on third-party services
JustGiving has a long history of embracing data science. One of the earliest projects began in 2012 and funded by NESTA. We successfully built a recommendation engine that could actually identify people that are likely to fundraise.
In 2013, I joined to lead and deliver machine learning (ML) into production, in what was called the PANDA platform. This was challenging and rare – at a time when most ML was done offline and not embedded directly in a high-profile consumer product with 26 million users.
Back in 2013, PANDA was leveraging Apache MapReduce jobs and building model trainers’ pipelines and low latency model scoring APIs. By 2014, we had managed to change JustGiving products to make them ML-driven through recommendations, predictions and suggestions to help users raise more for good causes. Back then there was no Apache Spark, TensorFlow library, Docker container, or serverless computing which was only introduced into the platform later.
In parallel, around 2014, we noticed that our data scientists spent a lot of time preparing data, that our queries were growing in complexity, and that we were ingesting big data sets like web analytics data.
In response, I led the delivery of our in-house data science platform I called RAVEN in AWS, centred around massively parallel processing data warehouse Amazon Redshift.
RAVEN allowed us to join transactional data with non-transactional data (giving insight into user journeys), run experiments but also prepare the data for machine learning training and scoring at scale in PANDA.
The ML deployment objectives are to streamline the training, running of experiments and deployment of models into JustGiving products such as the fundraising page, feed and email campaigns.
RAVEN allows ingesting and processing of streaming and big data sets to ensure that analysts and data scientists are productive in their roles, where they can run experiments but also be able to easily train and test new models.
JustGiving can now provide a personalised experience, suggest interesting content and identify the causes that the users care about.
“Doing so it has increased user engagement, retention, and ultimately raised more charities and good causes”
Doing so it has increased user engagement, retention, and ultimately raised more charities and good causes.
Toolsets, tools, and processes
JustGiving today uses Amazon Redshift, Apache Spark on Amazon EMR, TensorFlow, docker containers and many Python packages. In terms of languages we use Python, R and SQL extensively and query Redshift for email logs, web analytics and transactional data.
Only open source software is used. External proprietary vendor products generally come at an additional cost and introduce an unnecessary tie in. Data pipelines in RAVEN also automate the shaping and data preparation steps to train and score the models that are deployed into PANDA.
It is sometimes forgotten, but data preparation is critical for any machine learning process, so before doing any model training and testing data is prepared using our data pipelines.
Once it is cleaned, flattened, and shaped it is used in training and testing. Although there are many scenarios, generally speaking, data is split into training, test and validation sets to evaluate the trained and deployed model. Experimenting is done offline but also in production using A/B Testing, and where appropriate using a multi-armed or contextual bandits testing. These are more complex to put into production as they optimise the variations dynamically but do lead to faster results.
Measuring performance is a key metric when deploying a trained model into production. Many organisations tend to measure the accuracy of a model during testing and validation phases, which in my view can be incomplete. Measuring the ongoing performance is not typically done as it requires additional engineering effort in terms of tracking and analysis.
A lot of companies give their power away by using third parties for web analytics solutions, rather than building their own. That data is then siloed in marketing or sales departments and is difficult or impossible to get back in its raw form and cannot be streamed back. This can, for example, prevent you from making real-time ML recommendations or predictions directly in your product.
JustGiving has built an in-house web analytics product called KOALA and thus has this data available in real-time as an AWS serverless stack. This provides a full suite of data pipelines for ML training and analytics in-house.
Generally speaking, if you deploy ML suggestions or recommendations into a product, you need to be able to attribute this to a product enhancement, measure user engagement and how they convert per ML driven product vs. the human hard-coded product. These are powers given to us in RAVEN with the KOALA data.
“We noticed that our data scientists spent a lot of time preparing data, that our queries were growing in complexity, and that we were ingesting big data sets like web analytics data”
For example, in the JustGiving feed, each card is personalised to the user based on what is known about their donations and fundraising activities on the platform, it then uses a ranking algorithm to show the card they might be most interested in first. This increased average user engagement by 15 percent, and with the ranking algorithm the card click-through rate went up by 20 percent.
To sum up, with data science think about your ML pipelines but also the data pipelines that feed them. Having an experimental and measurement mindset will help you demonstrate the value and benefits to business.
Tags:Big Data Cloud containers data science Docker machine learning
Big Data Thu 21 Feb 2019Big data need not mean high costs and lengthy training ...
AI Thu 21 Feb 2019Automation is forcing us to radically rethink the company
Five ways to make edge data centres more cost efficient
Read More >>
Exploring the security benefits of hybrid cloud
Read More >>
Why we need to automate automation
Read More >>
Collaboration with young generation key to virtual bank success, says ZhongAn...
Read More >>
Before you pay your ransomware read this
Read More >>