Reviews, scores, sales and playtime – a Steam perspective

Will Metacritic affect your PC game’s success?

Recently EEDAR´s Head of Insights, Patrick Walker, released an update in the ongoing debate on whether review scores, notably on Metacritic.com, has any influence on sales. The question is pertinent because, as Patrick notes: “common criticisms of the metric argue that game quality can’t be summarized in a single score and that the metric has been misused by publishers as a measuring stick for incentives and business decisions.” The debate resurfaces from time to time, for example touched upon on GamesBeat back in 2009, and more recently by Justin Baily from Double Fine, Warren Spector and on Ars Technica.

As part of an on-going project on exploring player behavior via the Steam platform (see e.g. here and here) we decided to run a quick correlation analysis across games on Steam, focusing on sales (game ownership) and a two different aggregate review/ranking scores. In addition, we wanted to explore if review scores correlated with the actual amount of time people spend playing games on Steam.

Looking at playtime adds an interesting perspective to the debate about the potential relationship between review scores and sales: It is one thing to investigate sales and how these correlate with review scores – but do people actually play the games they buy? If our customers do not care to play the games we produce, there is at least a chance they will not be repeat customers, which is about as desirable as them not buying our games in the first instance …

“Looking at playtime adds an interesting perspective to the debate about the potential relationship between review scores and sales: It is one thing to investigate sales and how these correlate with review scores – but do people actually play the games they buy?”

The dataset we are using here consists of records from a bit over 3000 Steam titles (only full games were included, e.g. no demos), and over 6 million players (this corresponds to about 3.5 per cent of all Steam profiles, or about 8 per cent if related to the active accounts), covering a bit over 5 billion hours of playtime. The data are from the Spring 2014, and thus not up-to-date on any developments in the Fall.

There are some assumptions in using data from Steam – for example some uncertainties about how Valve tracks playtime. For a breakdown, see here (scroll down to “limitations and caveats”). It should be noted that any analyses on the topic of game sales – which is shown clearly in the debate on the topic – needs to make some assumptions and estimates when it comes to sales figures. This adds some uncertainty, but is hard to avoid given the confidentiality of sales figures.

The good thing about game ownership data from Steam is that we do not have to make any sales estimates based on collecting information from a variety of sources, with all the potential sources of error and bias that risks imposing on the data. Working on a platform like Steam means that the sales data (ownership) are readily available, and we can account for free games, demos, etc. Note that how the game got there – for example whether the user in question bought it or if it was a gift from a friend, is not immediately obvious. On the negative side, we only have data from Steam sales, not e.g. mobile platforms.

“It is important to note that we are not specifically looking at correlations for high-ranking, low-ranking etc. games, but general correlations across the entire scale of reviews”

It is important to note that we are not specifically looking at correlations for high-ranking, low-ranking etc. games, but general correlations across the entire scale of reviews. We will publish some results on low vs. high-ranking games later.

Finally, it should be mentioned that the work presented here, and all previous analyses on the topic of reviews scores and sales, including EEDARs, is correlational in nature. This means that no causal relationships can be identified, only speculated about.

Causal relationships define why sets of variables covary, i.e. change values according to a specific relationship. The only way to do so is via experimental research, which is tricky to apply here, as a scientific approach in this case would demand for control of confounding factors, i.e. factors that could influence game sales without being related to Metacritic scores/game reviews (e.g. Christmas sales spikes, to take a currently pertinent example!).

What this means in practice is that even if it was shown that Metacritic scores correlated with number of sold units, revenue, playtime or similar metric we are interested in, we cannot, from a correlational analysis, tell if it is the Metacritic scores that resulted in the metric behaving as it is. It may be that Metacritic scores have no systematic relationship with sales at all, or conversely that sales impact on Metacritic scores in a systematic fashion. This issue something that is mentioned too rarely in these types of analyses, including those published earlier on the topic of review scores and sales.

“There are, unfortunately, many examples of correlational analysis being misinterpreted as causal experiments, with disastrous or, in some cases, hilarious results”

There are, unfortunately, many examples of correlational analysis being misinterpreted as causal experiments, with disastrous or, in some cases, hilarious results.

Rather than just correlating with scores from Metacritic.com, we wanted to see if we could get some more review-type measures for games. For this, we turned toGaugepowered. Gaugepowered is a platform used by Steam players to rate games, follow game sales and observe basic game statistics such as median playtime, number of players playing the games or game community value. It provides a means of obtaining a review score (ranking) that is more directly influenced by the players as compared to Metacritic aggregate review scores.

We harvested review scores from MetaCritic.com for 1426 games in the dataset, and player ranking scores from SteamGauge.com for 1213 games. We then ran a simple Pearson correlation analysis with scores from the two sites against: a) game ownership on Steam (~sales) and b) aggregate playtime (i.e. how much time the games had actually been played).

For game ownership there is a statistically significant correlation at r=0.22 for Metacritic and r=0.25 for SteamGauge (r being the correlation coefficient), but neither of these explain a lot of the variance in the dataset. Just because a relationship between variables is statistically significant at some defined level of propability, does not mean the relationship explains a lot of the variance in the dataset, especially at sample sizes like the ones used here.

We also analysed the correlation between the total playtime of the games weighted by the total number of players and the two sets of scores. For Steamgauge we obtained r-coefficients of 0.22, for Metacritic 0.06, again indicating no strong relationships. It is interesting to note that Steamgauge, the more directly player-derived ranking of the two used, in both cases correlate better than Metacritic scores. It may be that ranking on platforms such as Steamgauge are more important – for games on Steam – than Metacritic scores, in terms of using these ranks to predict or estimate sales. Additional research will be needed to validate this idea, however.

In other words, from this analysis, there is no strong evidence supporting a direct correlation between Metacritic scores, Steamgauge rankings and game ownership or playtime.

It would appear that such review scores have no or only minimal relationship with whether or not people buy and play specific games, although this conclusion can only be made for the games investigated here, i.e. the Steam platform.

By Anders Drachen, Rafet Sifa and Christian Bauckhage. Reposted from GamesIndustry.biz, original posted Dec. 23rd 2014.

One thought on “Reviews, scores, sales and playtime – a Steam perspective

  1. Metacritic CANNOT be trusted – pure and simple. Their algorithm for compiling and weighing different review scores differently is secret – and it is well known in the gaming community that review scores in general are a product of how much marketing money is being paid to the reviewer or the reviewer’s company. It is entirely unreliable in this sense.

    If it was a pure average of aggregated review scores then it might be slightly more reliable – but again: Review scores themselves aren’t that reliable.

Leave a comment