This video, an informal presentation by Harvard professor Sendhil Mullainathan followed by a conversation with attendees (all distinguished in themselves, including Daniel Kahneman) is really interesting. In it, Professor Mullainathan talks about a new piece of work he is involved in that looks at the impact of big data on social science. He talks of the importance of starting with casting your net as wide as possible and using induction to see what comes out of that data, rather than deduction where you go head first with a specific goal in mind.
What do you go out and collect? The stuff that you think matters. That’s why deduction is so powerful. But once you collect all kinds of things, then you will have the ability to look at all these variables and see what matters, much like in word sense disambiguation. We’re no longer defining rules. We’re just throwing everything in.
It’s a lovely conversation, and you can see how his thought process evolves through it; his research is still a work in progress. I also really like the way he distinguishes between ‘long’ and ‘wide’ data when we refer to ‘big’ data, which more people should do:
We could break the word “big” into two parts: Long data and wide data. What do I mean by that? Long data is the number of data points you have. So if you picture the data set as sort of like a matrix, or written on a piece of paper, length is the length of that dataset. The width is the number of features that you have.
These two kinds of “big” work in exactly the opposite direction. That is, long is really, really good. Wide, some of it’s bad, and it poses a lot of problems. Why does wide pose a lot of problems? Picture the prediction function working as a search process. The search processes find the combinations of features that work well to predict why. You could see, with just a little back of the envelope calculation the mathematics are such that as the data gets even a little bit wider, this thing is growing exponentially, I mean, just crazy exponentially. As a result, when data gets wider, and wider, and wider, the problem gets harder, and harder, and harder, and algorithms do worse, and worse, and worse. As the data gets longer and longer, algorithms do better and better.
Watch the whole thing. The way the screen is split into 5 parts is also rather neat, giving multiple perspectives at the same time.