As someone whose career in the 21st Century has focused mainly on user contribution systems and user created content, I leverage several crowd-sourcing sites on the Web. One of my favorites is Kaggle.com, which according to its Australian CEO, Anthony Goldbloom, whom I recently spoke to, enables people to outsource big data questions. Every predictive modeling problem is framed with a competition where the person who builds the most accurate model gives that model to the company and in exchange the company gives them a prize. Kaggle is a powerful way to build predictive modeling algorithms. Why is this important? Imagine a bank being able to predict who will default on a loan. (Note: Predictive Models are created or chosen to try to best predict the probability of an outcome. In many cases the model is chosen on the basis of detection theory to try to guess the probability of an outcome given a set amount of input data, for example given an email determining how likely that it is spam (definition from Wikipedia)
Goldbloom came up with the idea for Kaggle, while working at The Economist. He worked on an article on big data and data science, although as Anthony reminds me, ‘It wasn’t called that at the time”. While talking to CIOs who were struggling to get value from their data, he knew he could solve them and could “put up those problems (on the web)” and people could kind of prove their mettle by actually solving them.
During our discussion, Goldbloom mentioned two competitions:
- The William and Flora Hewlett foundation (Hewlett) reached out to Kaggle’s data scientists and machine learning specialists to develop an affordable solution for automated grading of student written essays. (Not sure my wife, who is a high school teacher will like this). The Hewlett foundation ended up collecting 24,000 graded essays written by high school students. In the end, a British hedge fund trader (trained as a physicist), a software developer at the national weather service and a German grad student created the winning solution, which can help schools assess students’ writing. The Foundation sponsored the contest and awarded $100,000 to the top three research teams. In the end, 250 teams participated and there were 2,500 submissions. (Note: None of the winners had a data science background).
- The Wikipedia Challenge focused on getting data-mining experts to build a model that predicts the number of edits an editor would make. Wikipedia wanted to understand what factors determine editing behavior. Contestants were expected to build a predictive model that can be reused by the Wikimedia Foundation to forecast long term trends in the number of edits that we can expect. There were 94 Teams with 115 players and 1024 entries. Here’s a page describing the challenge:
Kaggle combines many of the popular current trends in the industry: gamification, crowdsourcing, virtual workforce, and, of course, Big Data. (Venture Capitalists must love this company).
Companies can build models in house or hire a consulting firm like Accenture. Kaggle’s crowdsourcing solution is a new third option. As Goldbloom points out, “Companies are beginning to see Kaggle as a leveraged arm of their own business.” How does it work? Companies and researchers post their data. Statisticians and data miners from all over the world compete to produce the best models. Companies identify a problem and then leverage Kaggle’s active community to solve it. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modeling task, and it is impossible to know at the outset which technique or analyst will be most effective.
Kaggle’s secret sauce is that there’s lots and lots of data out there, and a strong desire to play with this data.
In particular, Kaggle is gaining the most traction in financial services, in the technology sector, and in life sciences. Competitions filter talent and also let the best data solutions float to the top of the pack while people are giving objective feedback along the way.
As Goldbloom points out “The really nice thing about these predictive modeling tasks is you can back test people’s algorithms on historical data and get a sense for which algorithms perform well and which algorithms don’t perform so well.”
Most of the 45,o00 members on Kaggle call themselves data scientists, which is one of the hottest professions in Silicon Valley. Most of them, however, have an engineer or computer science degrees. Here’s a breakdown of their professions:
Kaggle has several public offerings:
Kaggle Prospect (in beta now), which Practice Fusion (another favorite company of mine), a vendor of electronic records, used by opting up their data to determine what types of problems could be solved, such as predicting who will develop diabetes.
Kaggle In-Class is another product, predicting the past or the future requires students to build models that are evaluated against past outcomes. For example, an instructor might host a predicting-the-past competition that requires students to build models to predict wine prices based on country of origin, vintage, and other factors. The winning model would then be that which most accurately predicts actual prices from a set of historical price outcomes (hidden from the students).
Kaggle has a great business model, one that should be considered by other crowdsourcing companies. As Goldbloom explains:
“Competitions are open to everybody. The sole purpose of these competitions is to qualify talent. So you if you finish in the top ten percent of two public competitions, we’ll label you as qualified talent.” Most of Kaggle’s commercial work, such as banks trying to predict who’s going to default on a loan is conducted via a private competition. “For private competitions we basically invite 15 of our strongest members. Each of them compete behind the scenes and the prize money is consistently – it’s a six figure sum and we also take a large fee on those private competitions.”
The private competitions require large data sets, and an invitation only crowd-sourcing process, both of which are kept private. All the participants received some sort of monetary reward.
Here are some examples on potential ROI vs. Realized ROI.
Transactional Fraud: A large credit card issuer.
Assuming the issuers has 50MM credit cards with their customers spending on average $500 per month. Based on current industry estimates, let’s assume the issuer experiences 10 basis points (1 basis point is 1/100th of 1%) in current fraud losses, will put total fraud losses per year in the neighborhood at $300MM / year (50MM * 500 * 12 * 10 basis points). Just a mere 5% reduction in fraud losses with a better model will generate an incremental return of $15MM / year. This can easily put the ROI in the double digits, especially when you can think about much time and how many people you would need to resolve these issues.
Retail consumer marketing: A large retailer
A big box retailer, with over 20MM customers, sends product promotions to their customers on a monthly basis. Typically the number of customers who respond to these offers is less than 1%. Assuming, each customer spends $200 on average because of the marketing offer, the retailer probably sees $40MM (20MM * 1% * 200) in incremental sales. A better predictive model through Kaggle can easily double or triple the response rates to these marketing offers, there by leading to $80MM to $120MM in incremental sales!
Goldbloom’s team’s grand vision is to create a Meritocracy, a labor market where the best people rise to the top, both in perms of skill and value.” (Meritocratic is a system where appointments and responsibilities are objectively assigned to individuals based upon their “merits,” namely intelligence, credentials, and education)
Goldbloom provides an example: “Roger Federer is ranked number one in the world because he wins more tennis matches than other tennis players. I would very much like to see us create the world’s first meritocratic valuable labor market. So, you know, I mount the argument that, Roger Federer is a phenomenal athlete, but he doesn’t generate, you know, a lot of value.” (most people in the audience, for example)
I highly recommend that you check out Kaggle.com!
ROI (Real Overall Impact!)
- Use a public area to identify potential leaders to participate in a private area
- Leverage a real time leaderboard which motivates people
- Enable the community to determine the content – what problem will be resolved.
- Check out Hacker News for a good implementation of the Thumbs up / Thumbs down process
- The platform for uniting free agents is important.
- People learn more by doing vs. sitting in a class or reading a user manual
Transcript of Interview:
What is Anthony reading?
Interview was recorded on June 20th, 2012 and written up in Boston at the Trident Bookstore in Boston, while watching another competition: Women’s Gymnastics at the London Olympics!
Thank you Nation!