Clustering Data -- How to Have Fun in n-Dimensions

*
Accepted Session
Short form
osb2009-0146
Scheduled: Wednesday, June 17, 2009 from 3:50 – 4:35pm in St. Johns

Excerpt

The amount of information freely available on the internet from sources like
Twitter and Github grows every day. This gives us new opportunities to leverage
the collective consciousness.

Clustering is a wonderful method for finding useful information in large
amounts of data. But it can be an intimidating topic for programmers without a
lot of academic background. In this talk I will introduce and explain some
practical techniques for clustering real-world data.

Description

I will introduce the theory and goals of clustering algorithms. The literature in statistical analysis is made up of dense mathematical equations; so I will translate equations into pseudocode to make the topic more accessible to programmers.

I will expand on the theoretical discussing by demonstrating a simple example of a clustering problem: how to group volcanos in Alaska by geographical proximity. I will move on to algorithms with real-world applications, such as how to group users with similar tastes given a database of user ratings.

I may touch on more advanced techniques to improve the accuracy of resulting clusters. I will also discuss current limitations of statistical analysis. As an example, Netflix’ ongoing competition for an algorithm that can predict whether or not a user will like the movie Napolean Dynamite.

The examples from the talk will be implemented using JavaScript and CouchDB. My hope is that people from many different language and environment backgrounds will have some experience with JavaScript. And the data-processing capabilities of CouchDB are well suited to clustering algorithms.

Tags

cluster analysis, statistical analysis, data mining, pattern recognition, CouchDB, javascript

Speaker

  • Ornithopter_portrait

    Jesse Hallett

    Portland JavaScript Admirers, Portland Ruby Brigade

    Biography

    Jesse Hallett graduated from Reed College. While there he studied theoretical computer science and linguistics. For his undergraduate thesis he wrote on lexical semantics for natural language processing, which is the design of data structures to represent word meaning.

    After graduating he has had more time to devote to his love of all things free and open source. Projects he has worked on include ZAML, buscatcher, and a little tinkering with Calagator.

    Professionally he has been designing web applications with Ruby on Rails. This has given him an excuse to keep up with the latest web technologies, from OpenID and OAuth to the HTTP and HTML standards themselves.

    Recently Jesse has become involved as an organizer of the Portland JavaScript Admirers. The state of JavaScript and client side programming is getting more interesting every day. Keeping up with it is a busy but rewarding task.

    In his spare spare time (i.e. when he’s not coding) he enjoys sci-fi westerns, competitive sheep herding, and long walks on the beach.

    Sessions

      • Title: Clustering Data -- How to Have Fun in n-Dimensions
      • Track: Cooking
      • Room: St. Johns
      • Time: 3:504:35pm
      • Excerpt:

        The amount of information freely available on the internet from sources like
        Twitter and Github grows every day. This gives us new opportunities to leverage
        the collective consciousness.

        Clustering is a wonderful method for finding useful information in large
        amounts of data. But it can be an intimidating topic for programmers without a
        lot of academic background. In this talk I will introduce and explain some
        practical techniques for clustering real-world data.

      • Speakers: Jesse Hallett