Wikipedia Banner Challenge: Results

Congratulations to Wikipedia for a  successful fundraiser.  They raised 20 million dollars with donations from more than one million people.   Now that the fundraiser is complete, we have archived the Wikipedia Banner Challenge; you can still vote and upload new banners, but those contributions will not be recorded.  Below, I’ll present some analysis of the data and provide links to the raw data so that you can analyze it too.

Over the approximately two weeks that the site was active, there were about 100,000 votes cast and about 1,500 banners uploaded.  Of these 1,500, I activated about 1,000.  Basically, I activated every banner that was in English, did not have any grammatical errors, and was not obscene.  In general, the number of banners uploaded per day closely matched the number of votes per day and participation was from all over the world.

Votes over time

Banners uploaded over time

Map of votes

Map of banner uploads

First, let’s look at some broad patterns in the results.  We started the process with 300 banners based on Wikipedia’s previous research, and in that set of seed banners we included every possible combination of 12 images and 23 pieces of text.  The heat-map below shows the scores of those 276 banners.  Recall that the score ranges from 0 to 100 and represents the estimated chance that a banner will win against a randomly chosen banner.  Here’s the graph (blue is lower score and red is higher score):

You can click on the heat-map to see a larger version.  If the only thing that matters for a banner is the image, then we would see clear vertical bands of color.  If the only thing that matters is the text, on the hand, we would see clear horizontal bands of color.  In fact, we see something in between.  A few of the clearest patterns are:

  • Across a range of messages, images of Jimmy and the Earth seem to do better than average and images of Susan seem to do worse than average.
  • Across a range of images, some text seems to do above average (“Imagine a world in which every person on the planet had free access to all human knowledge.” & “Let’s make a world in which every person on the planet has free access to all human knowledge.”) and some text seems to do worse than average (“Want to make the world a better place? What are you waiting for?” & “Let’s keep Wikipedia growing”). 

These results are about the seed banners, but what about the more than 1,000 uploaded banners?  Were any good banners uploaded?  It seems that the answer is yes: the top 10 scoring banners were all uploaded by users.  In other words, we seeded the Wikipedia Banner Challenge with 300 banners building on Wikipedia’s extensive earlier research, but not one of these 300 banners was in the top 10.  Unfortunately, some of these uploaded banners had very few votes for or against them because they were uploaded close to the end of the process (this is a pattern we see frequently and something we are working on solving).  Therefore, it is probably better to restrict our attention to banners that had more than 50 completed contests.  In this case, 9 of the top 10 banners were uploaded by users, and here are the three with the top scores:

There is also the question of whether the scores from the Wikipedia Banner Challenge can predict click rate during the fundraiser.  For example, you might wonder if these three high scoring banners had higher click rates and were able to raise more money than the banners Wikipedia are currently using.  Unfortunately, we don’t know.  The fundraiser reached its target, and thus ended, much faster than previous years so there was no time to run any banners from our site.  
Even though we don’t know these banners would have done, we do have some data about the relationship between score and click rate during the fundraiser.  Wikipedia had done some previous banner testing experiments, and they made some of this data publicly available.  We included these previously tested banners in our set of seed banners, so we can see if the score from our site can correctly “predict” these experiments that have already happened.  To summarize our findings, if anything the relationship seems to be opposite of expected: banners with lower score seem to have higher click rates.
The Wikipedia fundraising page reports results from three banner tests that we were able to replicate. In the first test run by Wikipedia, three banners with the picture of Jimmy Wales had different texts:

  • [Jimmy] Please read: A personal appeal from Wikipedia founder Jimmy Wales
  • [Jimmy] Please read: Advertising isn’t evil but it doesn’t belong on Wikipedia
  • [Jimmy] Advertising isn’t evil but it doesn’t belong on Wikipedia

In the second test, Jimmy was compared to Wikipedia contributors Susan and James:

  • [Susan] Please read: A personal appeal from an author of 549 Wikipedia articles
  • [Jimmy] Please read: A personal appeal from Wikipedia founder Jimmy Wales
  • [James] Please read: A personal appeal from Wikipedia editor Dr. James Heilman

In the third test by Wikipedia, Jimmy was compared to Wikipedia contributor Sarah.

  • [Sarah] Please read: A personal appeal from an author of 159 Wikipedia articles
  • [Jimmy] Please read: A personal appeal from Wikipedia founder Jimmy Wales

We included all of these banners in our site, so we can compare the score on our site to the click rates during the fundraiser.  Since these outcomes are not measured on the same scale, we would not expect them to match exactly, but we would expect to see that the banners that had higher score also had higher click rate.  Here’s the graph:


In fact, the click rate seems to be negatively related to score (r=-0.77).  The numbers and colors also provide additional information.  Each number represents one specific banner, so this plot makes it clear that one of the banners — “[Jimmy] Please read: A personal appeal from Wikipedia founder Jimmy Wales” — appeared in three separate tests.  And, somewhat surprisingly, the click rate for this banner varied by almost a factor of 2, probably because the tests were run at different times (the exact time of all of the experiments is not publicly available as far I can tell).  The color of each marker allows you to group results by test.  For example, all the banners marked in green were run in the same test and therefore at the same time.  By looking at the results for each test individually, we can see that the negative relationship between click rate and score holds true in two of the three experiments.
There are lots of reasons why the score on our site might not be predictive of the click rate on Wikipedia, and here are three that I think are most likely.   First, I think that on our site, we were estimating the preferences of die-hard Wikipedians — our big sources of traffic were the Wikipedia blog and the thank you page that people saw after making a donation — and these Wikipedians might have different preferences from the more numerous casual Wikipedia users.  Second, our site and the actual fundraiser are different contexts so it might be the case that a banner that does well when being directly compared to other banners might not do well when embedded at the top of a real Wikipedia page.  Finally, I think that the specific banners that we have data about were pretty similar — 7 of the 9 were “Please read:  A personal appeal from X” — so it might be that we are unable to detect the relationship between score and click rate for this narrow set of banners.  Without more research, it will be hard to distinguish between these three and the many other possible explanations for this pattern.
However, carrying this negative relationship to its full conclusion, one might want to run the lowest scoring banners.  Here they are (again with a minimum of 50 completed contests).  I don’t think these would raise a lot of money, but who knows?

The undoubtedly complex relationship between score on our site and click rate leads to some natural questions about how Wikipedia should use these results.  My recommendation would be to think about Wikipedia Banner Challenge (and All Our Ideas more generally) as a decision making guide, not a decision making machine.  That is, All Our Ideas + Wisdom does better than just Wisdom or just All Our Ideas (a related point was made in this interesting blog post by Sharad Goel).  The ability of wisdom to supplement All Our Ideas seems self-evident, but the ability of All Our Ideas to supplement wisdom comes from providing a decision maker access to a wide range of inputs that have been sufficiently filtered to be useful.  Both components are key.  A wide range of inputs helps expose the decision maker to new ideas — what Donald Rumsfeld might call “unknown unknowns" — and filtering, even imperfect filtering, helps make those inputs useful.  Inputs without filtering are noise.  For example, compare your ability to make a decision based on the Wikipedia Banner Challenge results page and Wikipedia’s effort to crowdsource banners in 2010.  While the attempt from 2010 provides lots of information, it is hard to use for decision making because it has not been sufficiently filtered.
The analysis that we have done so far is just a first step; we know that there is much more to learn from the data so here it is:

We are placing this data in the public domain under the CC0 License (for more on licensing for open data, see Molley (2011)).  Here’s some documentation explaining exactly what is in these files.  And, to help get you started, here’s the R code that I used.  Please let us know if you do anything with the data; we’d love to see it.

  1. prowthish-istoselidon reblogged this from allourideas
  2. allourideas posted this