How I analyze my Google Play Android App install stats with Python

Last week I wrote about the discrepancies I spotted between the number of the daily user installs and the differences between the total number of the user installs for two consecutive days. Those two values ought to match, but they don’t and it seems to be a very random process. Google have adjusted the stats for August 21, 2012 and August 22, 2012, but the smaller discrepancies are still there.

The stats I use are available from the Google Play Developer Console. You need to go there, find you app, click on Statistics and then click on all in the top right corner. Once the graphs refresh, click on Export as CSV and make sure all boxes are ticked. Download the ZIP file, unpack it and import into your favorite spreadsheet.

This article is also available on Amazon Kindle. You may consider buying it, if you would like to keep it for your reference.

You can monitor those stats yourself daily, but it would take up up to an hour of your time, if you were to import that data into a spreadsheet by hand.

I use a simple Python script to pre-process the data before pasting it into a spreadsheet. It saves me a lot of time and I thought it might be a good idea to share that tool with other Android developers, in case you might want to use it. It is also a good example of how using a few standard Python modules can help you save time processing data.

Prerequisites: Python 2.7.1 or later.

You can check which version of Python you have installed on your system with the following command (do not type $):

$ python --version

Here is how I begin my script:

#!/usr/bin/python

The first line gives the command-line interpreter a hint to about the location of the Python interpreter on your system. If Python is not located at /usr/bin/python adjust the first line to match your system’s configuration. (This line is an example of what is known as the shebang in Unix scripts.)

Next, we tell Python to import the following four modules:

import argparse
import csv
import re
import zipfile

Here’s what they do:

argparse is used to parse command-line options and arguments, i.e. anything that is listed after the name of the script. It also displays syntax and usage information when the user makes a mistake or uses the -h option. (For more information on argparse consult the official documentation.)

csv is used to read and write Comma-Separated Values (CSV) files, which is the lowest common-denominator file format used to exchange data between different spreadsheets. It is also the format that Google Play publishes your app stats. (For more information on csv consult the official documentation.)

re is the module that implements regular expressions, necessary for filtering text. (For more information on re consult the official documentation.)

zipfile is used to read and write ZIP archives, such as those served by Google Play Developer Console when you click on the Export to CSV link. It can access files inside archives without you having to explicitly unpack them. (For more information on zip file consult the official documentation.)

I will describe those modules in a little more detail later. Let’s have a look at the default values set in the next section:

# define defaults

redate = re.compile('^[0-9]{8,8}$')

redate is a regular expression, a pattern that matches any numeric string expressed using digits 0-9. That string must contain exactly eight digits, e.g. 20120801, but not 20120108a.

To be precise, redate is an SRE_Pattern object returned by the compile() function defined in the re module. Every time you want to use regular expressions, you must first define (compile) the pattern you will be using to match, search, delete, or replace strings with.

rows = []

rows is a list that will store the data extracted from the the APPID_overall_installs.csv file. (APPID is the ID of your Android application, e.g. com.example.myapp.)

rc = 0

rc is a helper variable. Its use will be explained later.

dui = 0

dui stores daily user installs, the numbers extracted from the daily_user_installs column from the APPID_overall_installs.csv file.

ddui = 0

ddui is described later.

tui = 0

tui stores total user installs, the numbers extracted from the total_user_installs column from the APPID_overall_installs.csv file.

dtui = 0

dtui is described later.

Once the defaults have been set, the script can begin parsing command-line options and arguments. To do that it needs to create an argument parser object:

# parse arguments

parser = argparse.ArgumentParser(description='Compare the total number of user app installs with the number of the daily user installs on Google Play.')

An argument parser is an object returned by the ArgumentParser() function of the argparse module. Right now it is just an empty object that doesn’t do much, although if you were to run your script with the -h option, it would display the helper text defined in the description argument of ArgumentParser().

Our script needs to know which ZIP file you wish to use data from and the app ID. We will pass them as arguments of the -f and -a options respectively. Definitions of those options are added to the parser object with calls to the add_argument() function:

parser.add_argument('-f', required=True, action='store', dest='fin', 
                    help='the name of the ZIP archive file downloaded from your Google Plus Developer Console')

parser.add_argument('-a', required=True, action='store', dest='appid', 
                    help='app ID, e.g. com.example.myapp')

The first argument of add_argument() is the option string, e.g. '-f' defines the -f; the second argument is required, which is set to True for every option that must be set for the script to functions properly.

Next, you need to tell the parser what it should do with the arguments that follow the options. This is specified in the value of the action argument of add_argument(). Since we need to process those values later on, the script needs to store them somewhere. Hence action is set to 'store'.

Once we tell the parser what we want to do with the arguments to the options defined, we need to tell it where it should store those values. The names passed as the values of the dest argument will become the names of the properties of the parser object. They are also displayed in uppercase in the syntax section printed when you run the script with the -h option or when you make a mistake.

The last argument, help defines a short description of the purpose of the option and its argument.

Once all options have been defined, we need to initialize the parser:

args = parser.parse_args()

If all goes well, we should be able to read the ZIP archive:

# process the zip file

zf = zipfile.ZipFile(args.fin, 'r')

If the argument of the -f option (stored in args.fin) is a valid path to the ZIP file downloaded from Google Play, we should be able to read it. That’s what we request when we pass the 'r' argument to the ZipFile() function.

Knowing the app ID we can splice together the name of the CSV file inside the ZIP archive represented by zf. The app ID is given as the argument of the -a option and stored in args.appid. That spliced name is passed as the first argument of reader() and the second argument is delimiter, which separates cells in rows. It is set to , for the CSV files generated by Google Play.

# process the CSV file

cf = csv.reader(zf.open("%s_overall_installs.csv" % args.appid), delimiter=',')

The cf object represents the CSV file stored inside the archive represented by zf

We will now read the data, row by row, skipping empty rows and those where the first cell does not mach the regular expression pattern stored in redate.

The rows that pass the tests, are inserted at the beginning of the rows list, which is a way to reverse their order. We do it, because it helps process data later on.

for r in cf:

    if r == []:
        continue

    if not re.match(redate, r[0]):
        continue

    rows.insert(0, r)

We now have a list of rows in reverse order, from the earliest to the latest stats. Now is the time to crunch data:

for r in rows:

    dtui = int(r[4]) - int(tui)
    ddui = int(r[3]) - int(dtui)

    if rc == 0:
        print r[0] + "," + r[3] + "," + r[4]
    else:
        print r[0] + "," + r[3] + "," + r[4] + "," + str(dtui) + "," + str(ddui)

    tui = r[4]
    rc = 1

Because we need data from the previous day to compute the difference between the total user installs of the app, the first row (the oldest entry) needs to be printed with some data missing. This is why we use the rc flag.

So, the first row of the output will contain just three cells:

date,daily_user_installs,total_user_installs

Starting with the second row, the output will consist of five cells per row:

date,daily_user_installs,total_user_installs,delta_total_user_installs,delta_daily_total_user_installs

where:

delta_total_user_installs = dtui computed as the difference between the total number of user app installs for today minus the total number of user app installs for yesterday;

delta_daily_total_user_installs = ddui computed as the difference between the daily number of user app installs minus delta_total_user_installs.

I called my script gpstats.py.

To make it executable, you need to run the following command:

$ chmod 0755 ./gpstats.py

When you run gpstats.py it should produce a five column output that you need to capture to a file, preferably with a .csv filename extension so you can later import it into your favorite spreadsheet. Remember to redirect it into a file of your choice using the > symbol, e.g.:

$ ./gpstats.py -f com.example.myapp.zip -a com.example.myapp  > mystats.csv

The output file can be opened in any spreadsheet application.

You can download the script if you would like to see if you notice similar discrepancies in the stats reported by Google Play.

Have fun!

PS. If you want to learn Python, have a look at these Python programming books.

Vim and Vi Tips: Essential Vim and Vi Editor Skills, 3rd ed.

No Unix-class system administrator or user will get far without learning the basics of Vim, the modern replacement for vi. Contrary to some misinformed opinions spread among users who are new to Unix-class systems, Vim is not difficult to learn. Granted, it is not very friendly to beginners, but once you grasp the basic concepts, you will never have to learn another text editor again, because Vim is available for all standard operating systems, including Microsoft Windows, Mac OS X, Linux, BSD, and many others. This book helps you shorten the time you need to spend learning Vim.

With over 50,000 copies in distribution to date, “Vim and Vi Tips” has become the Vim book to go to when you are learning Vim or just trying to remember a specific shortcut.

The latest, 3rd edition teaches you even more about Vim without sacrificing the concise style and easy to follow presentation of the subject.

I have good news for all readers of Vim and Vi Tips! The third edition is available in PDF!

This is a serious upgrade. There are 22 new sections, one new chapter, and 71 updated sections. No page left unturned. And now, with even more screenshots!

Once again, I went against the traditional way of teaching Vim. The comments I received from the readers of the first and the second editions of Vim and Vi Tips confirm that my decision to present the subject from the point of view of someone who does not have a lot of Unix experience was the right one. This stuff is not hard, if you shed some of the old ways and lingo!

And over 50,000 of you who have purchased that book since 2008 seem to agree!

If you want to find out what has changed, I posted the full Table of Contents and the detailed list of the changes from the Second Edition on this site.

Can you trust Google Play statistics?

Last week I did a promotion for my Vim book on the Amazon Kindle publishing platform. It went very well (more on that tomorrow) and it reminded me how well Amazon is prepared to handle both sales and sales reporting.

That Amazon employs some of the best software architects, developers, and admins can be seen not only in their extensive AWS catalogue, but also in the Amazon Kindle Direct Publishing control panel. When someone buys my book using their Kindle account, the sales count is updated within a few minutes allowing me to monitor the success of my promotions. That is exactly what I expect from a company that brought cloud computing to the startup masses and moved their own backend to their own cloud.

Similar levels of service can be reasonably expected from Google Play. After all, Google is another well-managed cloud. Or so we tend to think. Until a few months ago, I did not watch the application installations statistics very closely, as it was somebody else’s job and all I got was the total count (the clients had the raw data), but ever since I decided to finally embark on a serious Android-based project myself, I got to watch the stats more closely. And I noticed something I cannot understand. Let me explain.

First of all, I do not understand why Google cannot show me the number of installations in real time or in a slightly-delayed continuous fashion like Amazon can? This should be doable given the intellectual power and the infrastructure available to Google. So, why are those stats published once a day?

SIM Info deltas

Second, I do not understand the differences in the installations statistics reported by Google itself. Let me use my simple utility, SIM Info, as an example. The stats provided by Google contain, among other things, the total number of users who have installed my app. In theory, the difference between the total number of users who have installed my app by the end of August 20 and the total number of users who have installed my app by the end of August 21 ought to be equal to the number of the users who have installed my app on August 21.

SIM Info installations stats on Google Play

The math is simple, if the 8,732 users have installed my app by the end of Aug 20 and by the end of Aug 21 Google tells me that the total of 9,285 users have installed my app, I can assume that 553 users have installed my app on August 21. That does not seem to be true, as Google claims it was only 94 users. The data I get from Google shows differences of a few installs per day, with a whooping 459 on August 21!

I do not want to say that somebody is cheating here, but the data as it is delivered now is not worth much, which is a strange thing given the fact that Google has plenty of time to collect, sort, and check it. I do realize the scale of the data stream Google has to deal with and I am aware of issues like time drift, but these should not influence the validity of the data. SIM Info is just a simple app, but if I was to explain the performance of a VC-funded Android app to my investors, I would have trouble saying that my data can be trusted,

The numbers do not add up.

Update: August 24, 2012, 7:00 am GMT

Well, this is strange… the number of the total user installs has been corrected and today it is lower than yesterday. According to Google, by the end of August 22, 2012 my app was installed by only 8903 users which is lower than 9,285 users reposted for August 21, 2012 and that number never goes lower.

The number of the daily installs does not add up either.  If you subtract 9,285 from 8,903, you get -382, but Google reports 80 installs on August 24, 2012. The data is a mess.

I really do not know what to think of it. I wish someone would explain the algorithm used to compute those numbers. It does not look like the actual number of downloads, but more like some estimate, which is troubling.

I posted a question about this on Stackoverflow and got suggestions I should use some sort of external monitoring service, but this is excessive and potentially expensive for the user (network data access costs users money) so I’d rather avoid that in what is a simple utility. I might consider using such solution in an app that requires internet access anyway.

Update: August 26, 2012, 7:00 pm GMT

Not sure if it was my activity on the subject that has caused Google to notice the problem, but they posted a message to developers on the Google Play Developer Console.

And finally…

Update: August 29, 2012, 4:00 pm GMT

Google have adjusted their stats and fixed the numbers for August 21 & 22, 2012.

However, the discrepancies between the difference between the total number of user installs for two consecutive days and the daily number of installs remain.

Time will tell if Google fixes that. I think they should.

Update: September 12, 2012, 10:00 am GMT

Google has informed developers that their stats from September 6 are not correct. This seems to be a more serious problem that we originally thought.

PS. If you want to know what tool I use to make my analysis quicker, read the description of the script I wrote to parse Google Play stats.

PS. If you want to learn Android programming, have a look at these Android programming books.

Your users will tell you what you wrote

Ask anyone who wrote any type of a non-game application and they will tell you that users often invent ways of using your code that you have never dreamt of.

Case in point? Spreadsheets, which many people use not to do any sort of financial modelling, but to keep lists of things to do. I have a friend who plans conferences in OpenOffice.org Calc.

But you do not have to write huge pieces of code to experience that phenomenon. I recently published a free Android app that originally began life as a part of my test code. I wanted to know more about the SIM cards I was using to test my other apps.

When I published SIM Info I was convinced it would only be of interest to a small group of developers, but as it turns out, many users find it handy when they need to unlock their phones and tablets. Users tell me that operators ask them for information that is not displayed in the Android Settings menu and that when SIM Info becomes a very handy utility to have.

Based on that feedback, I added a way to share the information displayed by SIM Info via email, text message, or any other communication channel available on your Android device.

The moral of the story is you should just put your code out there and listen to your users. They will tell you what they think of your code and how it is useful to them. And that’s what counts.