Thursday, June 26, 2008

All models are wrong, and increasingly you can succeed without them.

Says Peter Norvig, Google's research director, as an update to George Box's maxim.

This is an awesome article about Petabyte size datasets and why correlation of data is enough, instead of finding a reason why datasets are related and building a model around it. Read some excerpts here, and full article via this link.

Excerpts from the original article:
Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content.

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.

Petabytes allow us to say: "Correlation is enough."

Labels: , , , ,

Sunday, March 09, 2008

CAPTCHA is Dead, Long Live CAPTCHA!

Interesting post on coding horror. 3 of the most well known CAPTCHA's are now broken - Google, Hotmail and Yahoo!

Wisdom comes from Gunter Ollman, he notes:

CAPTCHAs were a good idea, but frankly, in today's profit-motivated attack environment they have largely become irrelevant as a protection technology. Yes, the CAPTCHAs can be made stronger, but they are already too advanced for a large percentage of Internet users. Personally, I don't think it’s really worth strengthening the algorithms used to create more complex CAPTCHAs – instead, just deploy them as a small "speed-bump" to stop the script-kiddies and their unsophisticated automated attack tools. CAPTCHAs aren't the right tool for stopping today's commercially minded attackers.

Read more here.

Labels: , , , , , , ,

Wednesday, November 14, 2007

algoGod update: Extending the submission date to December 14th 2007

November 14 2007: Here is an update on the algoGod contest.

First, we are extending the submission date to December 14th 2007.

We started the contest 4 weeks ago, and got an overwhelming response with 299 contestants registering for the event and actively solving the problem. Many contestants requested that the submission date be extended, so we have extended the last date.

Also, we would like to inform you that we are hiring machine learning experts who can work at Komli for making a world-class ad-optimization engine. If you are interested – check out our web page at http://www.komli.com/careers under ‘Machine Learning Expert’. We introduced a FAQ for algoGod, it is on the following web page: http://www.komli.com/algogod/faq.php If you do have any questions, please don’t hesitate sending it to algogod@komli.com.

Labels: , , , ,