Joe Kington bio photo

Joe Kington

I'm a geologist or a geophysist, take your pick. I try to do interesting things (mostly in python) in my spare time. There's a hound dog and a banjo on my porch.

Email Google+ LinkedIn Github Stackoverflow Last.fm Resume

In honor of Stack Overflow reaching 10,000,000 questions, they’ve been collecting user’s “success stories” and giving away various swag. Originally, I shamelessly wrote this up in hopes of getting a free T-shirt. It’s a bit long for that, but I think it’s a story worth telling.

I’ve been participating on Stack Overflow for over 5 years now (since April of 2010, apparently). I don’t care much one way or another about the “gamification” part of things, but I’ve greatly enjoyed getting to solve various puzzles and help other people. It’s a very effective platform for that, despite the frequent criticism you’ll often see about closed questions, etc.

However, the most important and rewarding thing that I’ve gotten from Stack Overflow is the chance to learn by answering other people’s questions. I’ve learned a lot by asking questions, but I’ve always found that I don’t fully understand something until I’ve explained it to someone else.

This is the story of the SO question I learned the most answering.

Learning Programming

I’m a geologist by background. I don’t have any formal training in computer science. The bulk of what little I know about programming and software development I owe to Stack Overflow.

I’ve putzed around with writing simple odds and ends since the first time I came across a school computer in 8th grade. I was even paid to do web development back in the early 00’s. (Tip: don’t let a 20-year old who’s never owned a computer write his own PHP-based CMS for your website. Who needs https?? I’ll just write a bit of js to obfuscate the login form and give the admin page a funny name… I have no idea how that stuff was never hacked and defaced…) In grad school, I wrote abominations of stitched-together csh, awk, Fortran, and Matlab. Later on in grad school, I managed to make even Python completely unreadable.

However, I’d never learned much beyond the immediate bit I needed to get the job done. Stack Overflow was started about the time I was beginning to want to learn to do more than make “copy-pasta”.

It took a couple of years, but I eventually began participating instead of just lurking and reading other people’s questions and answers. Once I started answering questions, the rate at which I was learning skyrocketed. There were many questions I knew the immediate answer to, but didn’t know exactly why. I made a point of digging in until I could give a complete answer, and learned a ton every time.

Because I was more confident in what I was doing, I started to build larger and more complex side-projects. In the process, I learned how to be an at least occasionally-competent software engineer. Starting a few months ago, I’ve managed to wind up in a software development role at work. I would never have had the background or confidence to do that without Stack Overflow.

Learning More than Programming

I’ve learned to be a far better developer over the past five years on Stack Overflow. However, it’s not just about learning programming, or even proper software engineering. The best and most valuable things I’ve learned from SO are more general methods that can be applied to many problems.

One particular category stands out: I apply machine learning methods regularly now, but I was very intimidated and confused by them initially. That changed when I had some time one Christmas break to really dig in in answering a specific question. It’s the question I’ve learned the most answering by far, and it started me down a path that’s been very useful as well as fun.

A specific example: Eigenpaws

Largest 9 Eigenpaws

Over a few months in the fall of 2010, Ivo asked a series of fantastic questions (bottom 10-15 questions in the list) relating to analyzing “puppy paws” on a pressure plate. His initial question about peak detection spiraled into a wonderful series of semi-related questions as he built his application.

Because Ivo’s questions were clearly-stated and exceptionally fun (puppies, anyone?), they received a lot of attention and very good answers. I answered a few of them. One of my answers wound up being very popular, mostly because it had animations. However, the focus of this story is on one of Ivo’s later questions. He asked about ways to identify individual pawprints (e.g. left front, right hind, etc).

I felt very invested in answering this particular question. It was partly because it was a follow-up to a question where my answer had been very popular. Mostly, though, it was just a fun problem! Furthermore, Ivo happened to ask it right as I was leaving on a trip to visit my wife’s (girlfriend at the time) family. I had a long train ride and several days of “down-time” to dig into the problem in more detail at night.

Why Isn’t This Working?

In a nutshell, I had figured out how to correctly classify paws based on the temporal and spatial order they contacted the sensor in. However, this method only worked for a subset of them (dogs that were walking). I needed another method to classify the rest of the dataset.

I knew enough about classification problems to know I could use this subset as training data in a supervised classification problem. I thought it would be identical to the land-use coverage classification methods I was used to, where a simple distance is a good metric. Get a mean vector for each of the four types of paws and then classify things based on their distance to the nearest of the four mean vectors. The fact that the “vectors” are 20x20 images should be irrelevant. They’re just 400-dimensional vectors, right?

I was completely flummoxed when it failed miserably.

I couldn’t understand why distance wouldn’t be a good comparison of similarity. In the past, I’d compared plenty of images by subtracting them and summing the differences. It worked pretty well. Why was this so different??

A bit of googling about image classification led me to the concept of “eigenfaces”. A bit more reading led me to the “curse of dimensionality”. I started to understand that I’d need to reduce the dimensionality of the problem to something more manageable.

I tried to implement things using scikit-learn and a few other frameworks, but I couldn’t understand the terminology at all. Regardless, I wanted to understand what I was doing. The math behind eigenfaces looked pretty familiar and easier to understand (at the time) than a machine learning framework.

I decided to implement my own “eigenpaw” algorithm. I was amazed when it actually worked!

Long-term Result

What I learned when answering that question gave me the background and motivation to begin picking up machine learning methods. Not too long after, I became less intimidated by some of the frameworks like scikit-learn and began to use them regularly. Overall, it’s been an incredibly useful addition to my “toolkit”. I don’t think I would have picked up those methods without doing the research to answer that question. Thank you Ivo, and thank you SO!