Thursday, January 13, 2011

download your data

We created allourideas.org to improve the way that groups collect information, and now we are making it easy for you to download your data from our site.  I’m hoping that once more people have their hands on their data, we’ll start to discover cool new ways to learn from it. 

To further facilitate this development, we are happy to release all the data from one of our largest idea marketplaces: the Washington Post’s idea marketplace about who had the worst year in Washington (thank you Ryan Kellett and colleagues at the Post).  Please feel free to play with the data and send us your own analysis of it.  We’ll post the coolest things we get here on the blog.  It’s kind of like the Netflix Prize, but minus the million dollars.

Keep reading for detailed documentation about the data files and for links to the Washington Post’s data.

The website can generate files in comma-separated values (csv) format. These csv files can then be analyzed using spreadsheet programs (e.g., Excel or OpenOffice) or statistical programs (e.g., R).  You can request the csv files at the bottom of the admin page (http://www.allourideas.org/[yourmarketplace]/admin).  Then our server will create the files (this could take up to one hour) and email you a link to download them.  You can only request files from the idea marketplaces where you are an administrator.

There are three files that you can request: 1) a file where each record is an idea, 2) a file where each record is a vote, and 3) a file where each record is a non-vote. Here are the three files from the Washington Post’s Worst year in Washington idea marketplace:

Here is documentation for the information in each of these files.

1) Idea file (ideamarketplace_[ID#]_ideas.csv)

This file has one record for each idea.

  • Ideamarketplace ID: ID number for your idea marketplace.  Every row in your file will have the same number.  It is also the number that will be in the name of all of the csv files from this idea marketplace.
  • Idea ID: ID number for this idea (unique).
  • Idea Text: Text of the idea (e.g., “Free ice cream all the time”).
  • Wins: Number of times that this idea has won a pairwise comparison.
  • Losses: Number of times that this idea has lost a pairwise comparison.
  • Times involved in Cant Decide: Number of times this idea was in a contest where “I Can’t Decide” was selected.
  • Score: Estimated probability that this idea will beat a randomly chosen other idea from this idea marketplace.  This will always be between 0 and 100.
  • User Submitted: TRUE/FALSE to indicate whether the idea was uploaded by a user.  Ideas uploaded from the creation page (http://www.allourideas.org/questions/new) are not user submitted.  Ideas submitted from the voting page (e.g., http://www.allourideas.org/studentgovernment) are user submitted.
  • Session ID: ID number of the session in which this idea was added.
  • Created at: Indicates when this record was created in the database.  So, all seed ideas will have approximately the same “Created at” time.  All timestamps represent time UTC time.
  • Last Activity: Indicates the last time that this record was updated.
  • Active: TRUE/FALSE to indicate whether the idea is current active (meaning that it can be included in future pairs and shown in the results).
  • Appearances on Left: Number of times that this idea appeared on the left of the pair.  Note that appearance just refers to this idea appearing on the screen.  Not all appearances results in a vote.  This would occur, for example, if the voter closes her browser while this idea is on the screen.
  • Appearances on Right: Number of times that this idea appeared on the right of the pair.  See note about Appearances on Left for more information.
  • Info: records the “info” parameter passed into the url (more information) for the session where this idea was uploaded.  If no parameter was passed, this value will be NA. 

2) Vote file (ideamarketplace_[ID#]_votes.csv)

This file has one record per vote.

  • Vote ID: ID number for this vote (unique).
  • Session ID: ID number for the session in which this vote was cast.
  • Ideamarketplace ID: ID number for this idea marketplace.  This will be the same for each row in the file.
  • Winner ID: Idea ID for winner of this vote.
  • Winner Text: Text of winner of this vote.
  • Loser ID: Idea ID for loser of this vote.
  • Loser Text: Text of loser of this vote.
  • Prompt ID: ID number of prompt (e.g., “free beer” vs “free ice cream”).  Note that this is left/right sensitive: (“free beer” vs “free ice cream”) has a different prompt ID from (“free ice cream” vs “free beer”).
  • Left Choice ID: ID number of idea on the left of the prompt.
  • Right Choice ID: ID number of idea on the right of the prompt.
  • Created at: When this record was created in the database.  All timestamps represent UTC time.
  • Updated at: When this record was last updated in the database.
  • Response Time (s): Response time for this vote as measured on the client side (in seconds).  Beware that measuring response time using javascript is not exact, and we recommend that you treat this measurement with some caution. 
  • Missing Response Time Explanation: reason that response time is missing (if it is missing).
  • Valid: TRUE/FALSE indicating whether this record is valid.  This is part of our infrastructure for handling attempted gaming of the voting process.  For your analysis, we recommend only using records where valid==TRUE.  Please email if you have questions.
  • Hashed IP Address: a cryptographic hash of the IP address of the session.  From this value you can see if two sessions are from the same IP address, but not what that IP address actually is.
  • URL Alias: the string that appears after “www.allourideas.org/” for this idea marketplace.  For example, for “www.allourideas.org/studentgovernment” the url alias is “studentgovernment.”
  • User agent string: this string records the browser and operating system of the computer used for this vote.  For more information here is the Wikipedia page on user agent strings.
  • Referring url: this is the url from which this session originated.  For example, if the session originated after the person clicked on a link to your idea marketplace from http://www.nytimes.com/example, then the referring url would be http://www.nytimes.com/example. Fore more information here is the Wikipedia page on HTTP referrers.  Other possible values are “DIRECT_VISIT” if the url is typed in directly, and “REFERRER_NOT_FOUND” if we are not able to locate the referrer.
  • Widget: TRUE/FALSE value that records whether the vote was recorded at the widget interface (e.g., http://widget.allourideas.org/[url]).  If FALSE the vote was recorded at the main interface (e.g., http://www.allourideas.org/[url]).  
  • Info: records the “info” parameter passed into the url (more information).  If no parameter was passed, this value will be NA. 

3) Non-vote file (ideamarketplace_[ID#]_non_votes.csv)

This file has one record for each non-vote.  There are three kinds of non-votes: “Bounce,” “Stopped_Voting_Or_Clicking,” and “Skip”.  A “Bounce” is recorded when a session begins but no vote or skip occurs (i.e., someone visits the site, but does not participate).  A “Stopped_Voting_Or_Clicking” is recorded when a prompt (e.g., “Free ice cream” vs “free beer”) is shown to a voter and no response of any kind is returned.  This occurs most frequently when the voters closes her browser.  A “Skip” is recorded when the voter clicks “I can’t decide.” 

  • Record Type: “Bounce,” “Stopped_Voting_Or_Clicking,” and “Skip” are the only valid values.  These are defined above.
  • Record ID: ID number for this record. 
  • Session ID: ID number of the session in which this action occurred.
  • Ideamarketplace ID: ID number for this idea marketplace.  This should be the same for all rows in the file.
  • Left Choice ID: ID of idea appearing on left of the prompt.
  • Left Choice Text: text of idea appearing on the left.
  • Right Choice ID: ID of idea appearing on the right of the prompt.
  • Right Choice Text: text of idea appearing on the left.
  • Prompt ID: ID number of prompt (e.g., “free beer” vs “free ice cream”).  Note that this is left/right sensitive: (“free beer” vs “free ice cream”) has a different prompt ID from (“free ice cream” vs “free beer”).
  • Reason: If the Record Type is “Skip” this indicates why the voter could not decide.  For “Bounces” and “Stopped_Voting_Or_Clicking” this field should always be NA. 
  • Created at: When this record was created in the database.  All timestamps represent UTC time.
  • Updated at: When this record was last updated in the database.
  • Response Time (s): Response time for this vote as measured on the client side (in seconds).  Beware that measuring response time using javascript is not exact, and we recommend that you treat this measurement with some caution.   For “Bounces” and “Stopped_Voting_Or_Clicking” this field should always be NA. 
  • Missing Response Time Explanation: If response time is missing, an explanation of why.
  • Valid: TRUE/FALSE indicating whether this record is valid.  This is part of our infrastructure for handling attempted gaming of the voting process.  For your analysis, we recommend only using records where valid==TRUE.  Please email if you have questions.
  • Hashed IP Address: a cryptographic hash of the IP address of the session.  From this value you can see if two sessions are from the same IP address, but not what that IP address actually is.
  • URL Alias: the string that appears after “www.allourideas.org/” for this idea marketplace.  For example, for “www.allourideas.org/studentgovernment” the url alias is “studentgovernment.”
  • User agent string: this string records the browser and operating system of the computer used for this vote.  For more information here is the Wikipedia page on user agent strings.
  • Referring url: this is the url from which this session originated.  For example, if the session originated after the person clicked on a link to your idea marketplace from http://www.nytimes.com/example, then the referring url would be http://www.nytimes.com/example. Fore more information here is the Wikipedia page on HTTP referrers.  Other possible values are “DIRECT_VISIT” if the url is typed in directly, and “REFERRER_NOT_FOUND” if we are not able to locate the referrer.
  • Widget: TRUE/FALSE value that records whether the vote was recorded at the widget interface (e.g., http://widget.allourideas.org/[url]).  If FALSE the vote was recorded at the main interface (e.g., http://www.allourideas.org/[url]).  
  • Info: records the “info” parameter passed into the url (more information).  If no parameter was passed, this value will be NA. 

FAQ:

1) Is it easy to read these files into R?

Yup. We use R so you can be sure that these files play nicely with R.  To read the votes files into R, just use this syntax:

votes <- read.csv(“ideamarketplace_[ID#]_votes.csv”, header=TRUE, sep = “,”, dec=”.”);

For the ideas and non-votes, just adjust the above as appropriate.

2) How can we handle the timestamps in R?

The code below will turn the timestamps in the POSIXct data type.  Then you can do nice things like time2 - time1 (etc).

as.POSIXct(as.character(votes[,”Created.at”]), tz=”UTC”, format=”%Y-%m-%dT%H:%M:%S+00:00”);

3) Why is my data file different from the layout described above?

We are constantly improving the data files by adding more information.  Therefore, if you would like the most recent data file please return to your idea marketplace and re-generate it.  A full list of changes to the files is at the end of this page.

CHANGELOG:

Votes file:

  • User agent string and referring url added on March 29, 2011.
  • Changed “tracking” to “info” on July 24, 2011.
  • Converted times from Pacific Time (US) to UTC time on March 21, 2012.

Non-vote file:

  • Before March 29, 2011 “Bounce” and “Stopped_Voting_Or_Clicking” were both called “Orphaned Appearance”.
  • User agent string and referring url added on March 29, 2011.
  • Changed “tracking” to “info” on July 24, 2011.
  • Converted times from Pacific Time (US) to UTC time on March 21, 2012.

Ideas file:

  • Added info parameter on July 24, 2011.
  • Converted times from Pacific Time (US) to UTC time on March 21, 2012.

Notes

  1. allourideas posted this