Understanding Cohort Analysis

What is Cohort Analysis and why use it?

Cohort Analysis may sound a bit tedious, but it is one of the simplest ways to compare different entities based on shared characteristics. We group these entities by the shared characteristic, each group is called a cohort, and we analyse them alongside other cohorts. This gives us a look at how different factors impact these cohorts, and for businesses who have cohorts of customers, it can help improve customer relationship management.

Who uses cohort analysis?

Since it gives focussed information on the growth or performance of different subsets, cohort analysis is pretty widely used by researchers and analysts in academic and business settings. Pretty much any data with a chronological component can be subject to cohort analysis – but it’s important to be able to identify variables that make sense as cohorts. You’ll have to identify variables that you know you can impact through some decision or action.

Why should SaaS and content organisations use it?

It is widely used by SaaS and content organisations to understand the how different aspects of their service impacts users and user retention. By understanding this, they are able to modify their service, or otherwise take action and improve the relationship and retention.

Analysing Cohorts

You can analyse cohorts in pretty much any statistical software, and there are ways to do it in just MS Excel as well, but we’re going to use R and MS Excel together to analyse some cohorts in our data.

We’ve created some sample data of subscribers of a video-on-demand service organisation, and we’re going to calculate their user retention rate. You can download the sample data used from here. When working with R, be sure to have the following packages installed and loaded:

· ‘lubridate’ – to deal with date and time variables

· ‘plyr’ – to manipulate data

· ‘dplyr’ – to manipulate data

· ‘xlsx’ – to save to an excel file

Preparing the Data

Our data has three columns which tell us about each of our users:

· Subscription.Id – the unique subscription IDs for the users

· Date.Subscribed – the date on which each user subscribed to the service

· Date.Cancelled – the date on which each user cancelled their subscription to the service (if they’ve cancelled)

If you’re working with real data, you must have the similar columns for your users, and any other columns you may wish to create cohorts from, e.g. lead source, subscription plan, or country.

First, we read our data and ensure the two columns with dates are the ‘Date’ type. We’re using lubridate to work with dates and side-step some of the less friendly built-in date functions.

We’ll look at the summary of our data to see if there are any inconsistencies.

From our summary we can see that there are many NAs in the Date.Cancelled column – we expected this because we have many users who are still using the service, so obviously no cancellation date exists.

However, for some of our calculations we cannot use NAs, so we’ll set all NAs to the furthest date we have ‘2018-07-01’. This date from the future (we’re in September 2017) exists because sometimes people subscribe to annual packages – so these would be people who subscribed some time in 2017, chose not to auto-renew subscriptions, and have their packages ending in 2018.

We use this future date as a cut-off because when we can’t have analyse users with unknown lifetimes.

Creating Key Variables

Now we get the data in chronological order of subscription, and create the cohorts we’ll be working with – the month of subscription. If you have another variable you’d like to use as cohort, ensure that it is categorical, e.g. countries should be 'India', 'China', 'Brazil', 'Singapore', etc.

If you have other cohorts, you’d add those to the above code, for example using countries:

Also find out the interval from subscription to cancellation, this is required irrespective of your cohort as it will help us calculate the retention. You cannot find these intervals if there are NAs in your data, which is why we removed them earlier.

Cohort-wise Lifetime Months

And we get to the most important variable we’re going to be working with the lifetime month.

Calculating the cohort-wise lifetime months tells us the number of subscribers we have for each cohort over the period we’re studying (i.e. subscription month to July 2018). The code here gets the sum of users with lifetime months over the selected months – it gives a cohort-wise breakdown.

If you want to analyse based on other cohorts, then simply add that column name along with Cohort.Month, e.g. .(Cohort.Month, Subscriber.Country).

Here’s the output:

Taking the first cohort, subscribers who joined in November 2015, we see that 78 people subscribed that month (M0), but then by the next month (M1) there were only 72 left. We can also see that 10 months down the line (M10), i.e. September 2016, this has fallen to 49.

It’s important to remember that for every column Mx, x represents the number of months since subscription and not any particular month of any year. So M10 for the first row represents September 2016, but for the second row represents October 2016, and so on. This is how cohort analysis works, it compares chronological change based on the number of units and not the units themselves. We’re going to be working with ’12 hours later’ or ‘in the following 5 years’.

Determining User Retention Rate

Now let’s get the ratio of customers retained each month after subscription.

We can now see that in the month after subscribing in November 2015, 92.31% of users remained. Ten months in (M10) we see that the percentage has dropped to 62.82%.

Due to the way the code works we have a lot of 0’s and false ratios along a diagonal, these represent months from July 2018 onwards, so let’s get rid of those as they will be inaccurate (since we manually set unknowns to that month). At the same time let’s round all the ratios to four decimal places so we can get percentages to two decimal places in Excel.

Let’s save this as an Excel file and format it so it’s easier to understand and present.

Formatting and Presenting in Excel

First thing to do is remove the ‘#N/A’ values that have appeared when we removed the 0’s and false ratios. We can just use Find and Replace (Ctrl + H), and keep the replace box empty.

Now convert it all to percentage (Alt + H + P) with two decimal places (Alt + H + 0, twice) and add a row at the bottom with the averages (=AVERAGE (B$2:B$21)). Also add some formatting if you like, we’ve added some borders.

Selecting just the retention rate and the averages, apply colour scales form the conditional formatting menu (Alt + H + L + S + Enter). You now have a users’ retention rate heat map to help you understand how well you retain users from each cohort.

Selecting just the retention rate and the averages, apply colour scales form the conditional formatting menu (Alt + H + L + S + Enter). You now have a users’ retention rate heat map to help you understand how well you retain users from each cohort.

You’ll notice that for our sample data, on average we’ve lost nearly 50% of users by the time they’ve been subscribed for 5 months – but that on average the largest drop in users tends to happen by the month after subscription. Maybe we have a 30-day trial or something which is sufficient for a lot of users to take a look at the service and decide it isn’t for them.

That’s a Wrap

So this is a pretty simple and quick way to do cohort analysis, and it’s just one of the uses for it. With insights from other useful variables we could probably make out more useful info, like country-wise cohorts could tell us whether campaigns in those countries are successful or need to be changed. Maybe subscriptions plans could tell us if users tend to stay when they have the flexibility or feel they’re getting better value for money. And once we have these numbers we can use them to calculate other key metrics like lifetime value, lifetime value vs. acquisition costs, and more.

This article was written by Kaushal Rao, RAP (Redwood Apprenticeship Programme) Intern, Redwood Algorithms

Using Cohort Analysis