Categories
Innovation

Programming Collective Intelligence

Programming Collective IntelligenceI’ve been reading a fantastic book written by Toby Segaran called Programming Collective Intelligence: Building Smart Web 2.0 Applications. I’m about two thirds of the way through, but it’s so good that I’m not going to wait until I finish reading it before blogging it. Essentially, it’s a recipe book for machine learning algorithms that you’re likely to find under the hood of many successful modern web sites: clustering, support vector machines, decision trees, simulated annealing, Bayesian classification and so on. The AI course at uni was a bit light on in terms of statistical machine learning techniques, but this book makes up for it. All the code in the book is written in Python and can be downloaded from the author’s website. The algorithms in the book may prove to be highly useful for my work in ubiquitous computing, too.

Coincidentally, according to the most recent entry in his blog, Toby will be giving a talk on a topic sort of related to one I’ve been thinking about as a possible project at NICTA: Creating Semantic mashups: Bridging Web 2.0 and the Semantic Web.

It turns out that Toby is also a fan of GTD, and he’s written his own web based GTD tool. It doesn’t look much, but it’s gained some favourable reviews.

Categories
Innovation

Startup: what you said

So it turns out that quite a few readers of this weblog use Bloglines. For some reason, Bloglines stopped sucking down the RSS feed for this weblog after March 19 until three days ago. Did other feed readers experience similar difficulties with my blog? I know Google Reader continued to work, as did the RSS screensaver on my Mac. Anyway, that partially explains why I had no feedback on my hypothetical question about startups.

Thanks to those who did end up responding. Here’s some snippets of what you said, along with some feedback I got via e-mail in no particular order:

  1. My 2c – I don’t believe there’s that much inherent difference between the major web platforms – in my opinion you’re best off going with what you have the most experience with (and what the people you can get have experience with). You could probably lose 2-3 months learning a new platform (primarily learning it’s idioms and gotchas) and it’s not clear that you’re ever going to get that back. Having said that, I would tend to recommend against the embedded scripting languages (PHP, ASP, etc) on any project of significant size – it’s not so much that they can’t scale, but they strongly encourage non-scalable design by their nature (and it can be harder to find developers who understand the difference). A ‘well-designed’ PHP application will often actually include it’s own sub-templating language (e.g. Smarty), and treat PHP as a pure programming language.
  2. Now you have the prototype you pretty much know the functionality – “recoding” is all about achieving the chrome and non-functional requirements. So, maybe Erlang – excellent for reliability/uptime (incl. for hot-upgrading and failover) and scalability – the two most important -ilities for webapps. Frontend “yaws”, plus some javascript library, and/or maybe an web framework like “erlyweb”Backend – hand-coded erlang apps. Database – “mnesia” or maybe “couchdb”
  3. Thing is, there are big sites being run on all combinations of your options. If you just need *something*, write it in what you feel comfortable in, if it needs to scale later you can use your first iteration as a learning experience. Personally, I think you’d be crazy to write anything in java, you can’t be nimble in java. I also think php is a dead end, it’s just too bodgy. Python/Ruby is the only way to go imho. Funnily enough I saw no mention of javascript for your web 2.0 site :) I would suggest that none of the backend stuff matters at all, and that the only thing that matters is which javascript library you use in the front end.
  4. I assume that this is for a “friend” :-). But I shouldn’t assume.I’m afraid to say I can’t add that much. Just don’t have the knowledge. For this kind of thing, being able to easily tinker with and evolve the system seems an important critiera, if just one of the relevant criteria.
  5. it doesn’t matter what technology you use, because you’ll rewrite the whole thing several times anyway. pick what’s fastest to explore the idea and the market now. time to market, and reaction time once you’re there, cost far more than another 100 servers while you’re in the early stage.who are your partners (or your VC’s partners)? do they have an affinity with any particular technology? what are your friends best at? you’ll need a pool of expertise (and employees), so choose something to maximise that opportunity.
  6. I was going to answer but my intial answer seemed too stupid and it was going to take too long to come up with an intelligent well thought out answer for a hypothetical that did require me to stretch my imagination to the extreme. It’s sort of like what I am finding with some of my 1st years. In one the subjects I am teaching they have to do a lot of hypothetical work and most of the time, the results are utter disasters because they simply, simply can’t stretch their brains into comprehending scenarios so outside their sphere of “being”. Since founding a startup is far, far outside my sphere of ‘being’ I decided I would much rather play pokemon. Now if you ever want to know which pokemon is best to use against a ground-type pokemon….

Mostly very useful feedback, and amusing otherwise. It’s interesting how similar most of the feedback was. Agility, ability to tinker and swiftness of development were common themes. For at least the alpha and beta, I think it would be best to go with tools/platforms where you can put something together fairly quickly, and make changes quickly if your users tell you they’re looking for something a bit different. I’m not sure Java fits that description (although I’ve always been a Java nut). There’s at least a compile step and possibly a deploy step, depending upon your development environment, between making a small change in the code and seeing the result in your browser. Ruby looks cool, but it is still lagging way behind the other serious contenders in terms of performance. PHP could be a contender, but if the system ever got really big and you had new graduates working on the application, I’d bet you’d soon end up with a mess, with business logic stuck in the presentation code and so forth – I really do agree with the first comment above on that point. So I’ve got to say that Python is looking good right now, despite its Makefile-like treatment of white space. Coupled with Django, it might be a winner. There’s also the fact that Google have provided a nice playpen for Python-based web applications.

Once again, thanks all for your input. Please keep the advice and opinions coming if you have more to add. It’s much appreciated.

Update (06:49 19/04/2008): I’m not sure my comment about Ruby performance is entirely fair. The performance difference between Ruby and Python is nothing (Python is a few times faster on most tests) compared to the difference between Python/Ruby and Java, for example (where Java is one and sometimes several orders of magnitude faster). By this reasoning, if one is happy to sacrifice some runtime performance and use Python instead of Java, one presumably wouldn’t be too worried that Ruby is slightly worse than Python. And Ruby doesn’t do too badly in terms of memory usage. Besides, if one was really worried about performance, one would use C.

Categories
Innovation

An underwhelming response

So, after waiting a few weeks, I still have no responses on this blog entry. Okay, I got one reply by e-mail, not including the advice I received from friends before posting the blog article. Was I silly to think people might actually respond? (Chorus: “Yes, Ricky, you’re very silly!”)

Categories
Innovation

Startup: a hypothetical scenario

Picture yourself in the following situation. You’ve come up with what you think is a cool idea for a so-called web 2.0 site. Furthermore, you’ve managed to convince some VC types to invest some (pre-)seed funding – enough to develop a public beta. You developed a quick and dirty proof-of-concept to show the VCs, but now it has to be thrown away. You have to start development on the real thing from scratch.

The question is, what technologies, programming languages, tools and platforms are you going to use to implement your idea? Language-wise, do you go for Python, Java, PHP, Ruby, or something else? If you take the PHP route, how do you ensure maintainability in the long term? If decide on Java, do you use JSP, Velocity or Freemarker? Would you use Struts or Spring? Do you need any of these frameworks at all? Do you run on Linux, Free BSD, Windows or Mac OS X Server? Why?

To make this question at least partly answerable, imagine for the moment we’re just considering the presentation tier, and not any of the back end magic. Also imagine that what you’re developing is similar to one of today’s social networking sites (Facebook, Bebo, MySpace or something), and that visualisations (e.g., of directed graphs) might need to be generated dynamically from data in the back end. You can assume that the beta version will have a small number of types of dynamically generated pages (less than 10, say) but later versions will end up with many more.

Answers along the lines of “It’s much of a muchness, so I would choose X, Y and Z because they’re what I know”, “I’d choose X, Y and Z because the newly graduated computer science students I’d have to hire are most likely to be comfortable with those” and “X, Y and Z are nice but too expensive for my startup, so I’d choose A, B and C instead” are completely acceptable.

I’ve already got some great input from my closest friends (at least the programmers among them), but I’d like to get some responses from a wider audience. I’m hoping some ex-DSTC engineers/researchers might have an opinion on this; you don’t need to have worked at a startup to give useful feedback!

I’m asking this question out of pure curiosity, nothing more, and I have my own feelings on this (represented by the sample answers above). Please leave your answer as a comment below.

Categories
Random observations

Python

Newsflash: Python would be okay if whitespace wasn’t meaningful beyond separating tokens. List comprehensions are kind of nice.

Categories
Innovation

Death by bigness

Big companies will slowly suck the life out of you. That’s one way of summarising Paul Graham‘s latest essay. To maximise your freedom, he says, join a start-up or start one yourself. It’s a theory that I find very appealing.

Categories
Random observations

Unit testing

Okay, I have something to confess: my record on using testing frameworks to debug software is not good. In fact, my record might show that pretty much all the testing I’ve done in the past has been conducted on an ad hoc basis, using a combination of debugger and strategically placed “print” statements. The only time I can remember having used a proper testing framework with repeatable tests was at Sun Labs as an intern, and that was because it was already set up for me. Perhaps it is common for a researcher to have shoddy testing procedures in place – I don’t know. All I know is that mine have been bad.

JUnit in EclipseFor the first time, I’m using the JUnit framework to conduct repeatable tests, and I’m doing this from within the Eclipse IDE. On the first day of use, it’s already paid dividends, quickly honing in on problems in my code. Running JUnit in combo with the debugger has proved especially useful. The only reason I decided to look into testing frameworks was because I’ll probably be handing this code over to someone else to work on soon, and that provided an incentive to be a bit professional about the way I’m doing my coding work. I should mention that it took absolutely no time at all to set up my environment, though it can take a little bit of time to get each unit test just right.

Of course, none of this will come as any surprise to many of the readers of this weblog (i.e., that researchers might have questionable software engineering practices and that repeatable tests are good).

Categories
Innovation

MapReduce

Last week I read a 2004 paper called MapReduce: Simplified Data Processing on Large Clusters. It was written by a couple of Google researchers, and details a simple programming model and library for processing large datasets in parallel. MapReduce is used by Google under the hood for lots of different things, from indexing to machine learning to graph computation. Very handy indeed.

So imagine my surprise to find in last Friday’s edition of ACM TechNews that this paper has been republished in Communications of the ACM this month, albeit in a slightly shorter form. Aside from a few cosmetic changes (updated figure and table), the content of the papers is the same. That is, you don’t gain any knowledge from reading one of the papers that you wouldn’t gain from reading the other. There is no indication in the more recent publication that so much content has been duplicated from an earlier paper, though there is a citation to the older paper. In short, this is not new material, having been first published more than three years ago. Communications of the ACM seems to be trialling a new model, whereby the best articles from conferences are modified and republished for the ACM audience. But seriously, the modifications in the republished MapReduce article are negligible. What gives?

Categories
Innovation

Android – the open platform for mobile apps

So Android has been released. As I suspected, Google has not actually released a phone of their own. Could be an interesting platform for researchers in the mobile/ubiquitous computing space who want to develop prototypes quickly. One of the creators of the platform hopes that someone develops an application that can help interpret his wife’s thoughts…

Categories
Innovation

My two bob’s worth on the Don Norman simplicity debate

Don Norman, respected usability guru, wrote an article on the demise of simplicity as a selling point, and it’s caused reverberations all around the world. In fact, his article has been so controversial that he’s found it necessary to write a clarifying addendum for the essay (added to the bottom of the article), fearing that many of his readers interpreted his article as concluding that simplicity should no longer be a design goal. Norman’s point is that a product with a greater number of features is more appealing than a similar product with fewer features. The “more complicated” product is therefore more likely to sell. In other words, feature creep is driven by the knowledge that consumers will be suckered in to paying for a product that looks more complicated, even though, in many cases, they might complain about the difficulty of using the product when they get home.

I think there’s a difference between giving a user too many choices and too many features. Confusion and frustration arises when the user is presented with an array of subtly different choices. Joel Spolsky provides an excellent example: the Windows Vista shutdown menu. Windows Vista provides the user with umpteen slightly different ways of shutting down the computer. Why? On the other hand, providing lots of features that do different things need not result in frustrating the user, because, well, they are for accomplishing distinct tasks, and the user can clearly separate them in their mind. Take those Japanese toilets, for instance. These toilets have an integrated bidet, dryer, seat warmer, massage options, automatic flushing and so on and so forth. The existence of these features does not mean that the toilet isn’t simple to use, per se. If however, each of those features had a confusing list of subtly different settings, then that could be a problem!

Norman’s essay could have been made easier to read and resulted in less confusion if it had been written more clearly and more carefully. The following is just the most confusing of a number of errors that can be found in his article:

Notice the question: “pay more money for a washing machine with less controls.” An early reviewer of this paper flagged the sentence as an error: “Didn’t you mean ‘more money’?” the reviewer asked?

But it already says “more money”. Somehow Norman and his reviewer have conspired to introduce an error that is similar to the one they were seeking to avoid. If that’s not irony, I don’t know what is.