A Visual Search Engine for iPhone

Today kooaba released their iPhone client. It’s a visual search engine – you take a picture of something, and get search results. The YouTube clip below shows it in action.  Since this is the kind of thing I work on all day long, I’ve got a strong professional interest. Haven’t had a chance to actually try it yet, but I’ll post an update once I can nab a friend with an iPhone this afternoon to give it a test run.

You need to a flashplayer enabled browser to view this YouTube video

At the moment it only recognises movie posters. Basically it’s current form is more of a technology demo than something really useful. Plans are to expand to recognise other things like books, DVDs, etc. I think there’s huge potential for this stuff. Snap a movie poster, see the trailer or get the soundtrack. Snap a book cover, see the reviews on Amazon. Snap an ad in a magazine, buy the product. Snap a resturant, get reviews. Most of the real world becomes clickable. Everything is a link.

The technology is very scalable – The internals use an inverted index just like normal text search engines. In my own research I’m working with hundreds of thousands of images right now. It’s probably going to be possible to index a sizeable fraction of all the objects in the world –  literally take a picture of anything and get search results. The technology is certainly fast enough, though how the recognition rate will hold up with such large databases is currently unknown.

My only question is – where’s the buzz, and why has it taken them so long?

Update: I gave the app a spin today on a friend’s iPhone, and it basically works as advertised. It was rather slow though – maybe 5 seconds per search. I’m not sure if this was a network issue (though the iPhone had a WiFi connection), or maybe kooaba got more traffic today than they were expecting. The core algorithm is fast – easily less than 0.2 seconds (and even faster with the latest GPU-based feature detection).  I am sure the speed issue will be fixed soon. Recognition seemed fine, my friend’s first choice of movie was located no problem. A little internet sleuthing shows that they currently have 5363 movie posters in their database. Recognition shouldn’t be an issue until the database gets much larger.

Mobile Manipulation Made Easy

GetRobo has an interesting interview with Brian Gerkey of Willow Garage. Willow Garage are a strange outfit – a not-for-profit company developing open source robotic hardware and software, closely linked to Stanford. They’re funded privately by a dot com millionaire. They started with several projects including autonomous cars and autonomous boats, but now concentrate on building a new robot called PR2.

The key thing PR2 is designed to support is mobile manipulation. Basically research robots right now come in two varieties – sensors on wheels, that move about but can’t interact with anything, and fixed robotic arms, that manipulate objects but are rooted in place. A few research groups have build mobile manipulation systems where the robot can both move about and interact with objects, but the barrier to entry here is really high. There’s a massive amount of infrastructure you need to get a decent mobile manipulation platform going – navigation, obstacle avoidance, grasping, cross-calibration, etc. As a result, there are very very few researchers in this area. This is a terrible shame, because there are all sorts of interesting possibilities opened up by having a robot that can both move and interact. Willow Garage’s PR2 is designed to fill the gap – an off-the-shelf system that provides basic mobile manipulation capabilities.

Brian: We have a set of demos that we hope that the robot can do out of the box. So things like basic navigation around the environment so that it doesn’t run into things and basic motion planning with the arms, basic identifying which is looking at an object and picking it out from sitting on the table and picking it up and moving it somewhere. So the idea is that it should have some basic mobile manipulation capabilities so that the researcher who’s interested in object recognition doesn’t have to touch the arm part in order to make the object recognizer better. The arm part is not to say that it can be improved but good enough.

If they can pull this off it’ll be great for robotics research. All the pieces don’t have to be perfect, just enough so that say a computer vision group could start exploring interactive visual learning without having to worry too much about arm kinematics, or a manipulation group could experiment on a mobile platform without having to write a SLAM engine.

Another interesting part of the interview was the discussion of software standards. Brian is one of the lead authors of Player/Stage, the most popular robot OS. Player is popular, but very far from universal – there are nearly as many robot OSes as there are robot research groups (e.g. CARMEN, Orca, MRPT, MOOS, Orocos, CLARAty, MS Robotics Studio, etc, etc). It seems PR2 will have yet another OS, for which there are no apologies:

I think it’s probably still too early in robotics to come up with a standard. I don’t think we have enough deployed systems that do real work to have a meaningful standard. Most of the complex robots we have are in research labs. A research lab is the first place we throw away a standard. They’re building the next thing. So in robotics labs, a standard will be of not much use. They are much more useful when you get to the commercialization side to build interoperable piece. And at that point we may want to talk about standards and I think it’s still a little early. Right now I’m much more interested in getting a large user community and large developer community. I’m less interested in whether it’s blessed as a standard by a standard’s body.

Anyone working in robotics will recognise the truth of this. Very much a sensible attitude for the moment.

Big Data to the Rescue?

Peter Norvig of Google likes to say that for machine learning, you should “worry about the data before you worry about the algorithm”.

Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm … is performing better than the best algorithm on less training data.

It’s a rallying cry taken up by many, and there’s a lot of truth to it.  Peter’s talk here has some nice examples (beginning at 4:30). The maxim about more data holds over several orders of magnitude. For some examples of the power of big-data-simple-algorithm for computer vision, check out the work of Alyosha Efros’ group at CMU.  This is all pretty convincing evidence that scale helps. The data tide lifts all boats.

What I find more interesting, though, is the fact that we already seem to have reached the limits of where data scale alone can take us. For example, as discussed in the talk, Google’s statistical machine translation system incorporates a language model consisting of length 7 N-grams trained from a 10^12 word dataset. This is an astonishingly large amount of data. To put that in perspective, a human will hear less than 10^9 words in an entire lifetime. It’s pretty clear that there must be huge gains to be made on the algorithmic side of the equation, and indeed some graphs in the talk show that, for machine translation at least, the performance gain from adding more data has already started to level off. The news from the frontiers of the Netflix Prize is the same – the top teams report that the Netflix dataset is so big that adding more data from sources like IMDB makes no difference at all! (Though this is more an indictment of ontologies than big data.)

So, the future, like the past, will be about the algorithms. The sudden explosion of available data has given us a significant bump in performance, but has already begun to reach its limits. There’s still lots of easy progress to be made as the ability to handle massive data spreads beyond mega-players like Google to more average research groups, but fundamentally we know where the limits of the approach lie. The hard problems won’t be solved just by lots of data and nearest neighbour search. For researchers this is great news – still lots of fun to be had!

Google Street View – Soon in 3D?

Some Google Street View cars were spotted in Italy this morning. Anyone who works in robotics will immediately notice the SICK laser scanners. It looks like we can expect 3D city data from Google sometime soon. Very interesting!

Street View car spotted in Rome

More pictures of the car here, here and here.

The cars have two side-facing vertical scanners, and another forward-facing horizontal scanner. Presumably they will do scan matching with the horizontal laser, and use that to align the data from the side-facing lasers to get some 3D point clouds. Typical output will look like this (video shows data collected from a similar system built by one of my labmates.)

The other sensors on the pole seem to have been changed too. Gone are the Ladybug2 omnidirectional cameras used on the American and Australian vehicles, replaced by what looks like a custom camera array. This photo also shows a third sensor, which I can’t identify.

So, what is Google doing with 3D laser data? The obvious application is 3D reconstruction for Google Earth. Their current efforts to do this involve user-generated 3D models from Sketchup. They have quite a lot of contributed models, but there is only so far you can get with an approach like that. With an automated solution, they could go for blanket 3D coverage. For an idea of what the final output might look like, have a look at the work of Frueh and Zakhor at Berkeley. They combined aerial and ground based laser with photo data to create full 3D city models. I am not sure Google will go to quite this length, but it certainly looks like they’re made a start on collecting the street-level data. Valleywag claims Google are hiring 300 drivers for their European data gathering efforts, so they will soon be swimming in laser data.

Frueh and Zakhor 3D city model

 

Google aren’t alone in their 3D mapping efforts. Startup Earthmine has been working on this for a while, using a stereo-vision based approach (check out their slick video demonstrating the system). I also recently built a street-view car myself, to gather data for my PhD research. One way or another, it looks like online maps are headed to a new level in the near future.

Update:  Loads more sightings of these cars, all over the world. San Francisco, Oxford, all over Spain. Looks like this is a full-scale data gathering effort, rather than a small test project.

Clever Feet

Check out this great TED talk by UC Berkeley biologist Robert Full. His subject is feet – or rather, all the clever ways animals have evolved to turn leg power into forward motion.
It’s a short, fun talk, and rather nicely makes the point that the secret to success for many of nature’s creations resides not in sensing or intelligence, but in good mechanical design. The nice thing about this is that nature’s mechanical innovations are much easier to duplicate than her neurological ones. The talk ends with examples of robotic applications, such as Boston Dynamics’ cockroach-inspired RHex and Stanford’s gecko-inspired climbing robots.

Hat tip: Milan

OpenGL Invades the Real World

Augmented reality systems are beginning to look pretty good these days. The videos below show some recent results from an ISMAR paper by Georg Klein. The graphics shown are inserted directly into the live video stream, so that you can play with them as you wave the camera around. To do this, the system needs to know where the camera is, so that it can render the graphics with the right size and position. Figuring out the camera motion by tracking features in the video turns out to be not that easy, and people have been working on it for years. As you can see below, the current crop of solutions are pretty solid, and run at framerate too. More details on Georg’s website.

You need to a flashplayer enabled browser to view this YouTube video

You need to a flashplayer enabled browser to view this YouTube video

Back in 2005, Andy Davison’s original augmented reality system got me excited enough that I decided to do a PhD. The robustness of these systems has improved a lot since then, to the point where they’re a fairly short step from making good AR games possible. In fact, there are a few other cool computer-vision based game demos floating around the lab at the moment. It’s easy to see this starting a new gaming niche. Basic vision-based games have been around for a while, but the new systems really are a shift in gear.

There are still some problems to be ironed out – current systems don’t deal with occlusion at all, for example. You can see some other issues in the video involving moving objects and repetitive texture. Still, it looks like they’re beginning to work well enough to start migrating out of the lab. First applications will definitely be of the camera-and-screen variety. Head-mounted display style systems are still some way off; the reason being that decent displays just don’t seem to exist right now.

(For people who wonder what this has to do with robotics – the methods used for tracking the environment here are basically identical to those used for robot navigation over larger scales.)

Citation: Parallel Tracking and Mapping for Small AR Workspaces“, Georg Klein and David Murray, ISMAR 2007.

Big Dog on Ice

Boston Dynamics just released a new video of Big Dog, their very impressive walking robot. This time it tackles snow, ice and jumping, as well as its old party trick of recovering after being kicked. Apparently it can carry 150 Kg too. This is an extremely impressive demo – it seems light-years ahead of other walking robot’s I’ve seen.

You need to a flashplayer enabled browser to view this YouTube video

I must admit to having almost no idea how the robot works. Apparently it uses joint sensors, foot pressure, gyroscope and stereo vision. Judging from the speed of the reactions, I doubt vision plays much of a role. It looks like the control is purely reactive – the robot internally generates a simple gait (ignoring the environment), and then responds to disturbances to try and keep itself stable. While they’ve obviously got a pretty awesome controller, even passive mechanical systems can be surprisingly stable with good design – have a look at this self-stabilizing bicycle.

The one part of the video where it looks like the control isn’t purely reactive is the sped-up sequence towards the end where it climbs over building rubble. There it does seem to be choosing its foot placement. I would guess they’re just beginning to integrate some vision information. Unsurprisingly, walking with planning is currently much slower than “walking by moving your legs”.

Either way, I guess DARPA will be suitably impressed.

Update: More details on how the robot works here.

Deep Learning

After working in robotics for a while, it becomes apparent that despite all the recent progress, the underlying machine learning tools we have at our disposal are still quite primitive. Our standard stock of techniques like Support Vector Machines and boosting methods are both more than ten years old, and while you can do some neat things with them, in practice they are limited in the kind of things they can learn efficiently. There’s been lots of progress since the techniques were first published, particularly through careful design of features, but to get beyond the current plateau it feels like we’re going to need something really new.

For a glimmer of what “something new” might look like, I highly recommend this wonderful Google Tech Talk by Geoff Hinton: “The Next Generation of Neural Networks“, where he discusses restricted Boltzmann machines. There are some stunning results, and an entertaining history of learning algorithms, during which he amusingly dismisses SVMs as “a very clever type of Perceptron“. There’s a more technical version of the talk in this NIPS tutorial, along with a workshop on the topic. Clearly the approach scales beyond toy problems – they have an entry sitting high on the Netflix Prize leaderboard.

These results with deep architectures are very exciting. Neural network research has effectively been abandoned by most of the machine learning community for years, partly becuase SVMs work so well, and partly because there was no good way to train multi-layer networks. SVMs were very pleasant to work with – there was no parameter tuning and black magic involved, you just throw data at them and press start. However, it seems clear that to make real progress we’re going to have to return to multi-layer learning architectures at some point. It’s good to see progress in that direction.

Hat tip: Greg Linden

More from ISRR

ISRR finished today. It’s been a good conference, low on detailed technical content, but high on interaction and good for an overview of parts of robotics I rarely get to see.

One of the highlights of the last two days was a demo from Japanese robotics legend Shigeo Hirose, who put on a show with his ACM R5 swimming snake robot in the hotel’s pool. Like many Japanese robots, it’s remote controlled rather than autonomous, but it’s a marvellous piece of mechanical design. Also on show was a hybrid roller-walker robot and some videos of a massive seven-ton climbing robot for highway construction.

You need to a flashplayer enabled browser to view this YouTube video

Another very interesting talk with some neat visual results was given by Shree Nayar, on understanding illumination in photographs. If you take a picture of a scene, the light that reaches the camera can be thought of as having two components – direct and global. The “direct light” leaves the light source and arrives at the camera via a single reflection off the object. The “global light” takes more complicated paths, for example via multiple reflections, subsurface scatter, volumetric scatter, etc. What Nayar showed was that by controlling the illumination, it’s possible to separate the direct and global components of the lighting. Actually, this turns out to be almost embarrassingly simple to do – and it produces some very interesting results. Some shown below, and many more here. It’s striking how much the direct-only photographs look like renderings from simple computer graphics systems like OpenGL. Most of the reason early computer graphics looked unrealistic was due to the difficulty of modelling the global illumination component. The full paper is here.

Scene Direct Global

Lots of other great technical talks too, but obviously I’m biased towards posting about the ones with pretty pictures!

Citation: “Visual Chatter in the Real World”, S. Nayar et. al., ISRR 2007