This project page is outdated. For most recent computer vision projects at the Intel Labs Seattle, please visit our new website.
The ongoing computer vision research projects at Intel Labs Seattle are in close collaboration with the Intel Everyday Sensing and Perception team and focus on solving vision problems in everyday life settings. Specifically we have been studying several fundamental problems in egocentric vision, which employs a wearable camera to perceive objects, activities and scenes from a first-person perspective. We co-organized the First Workshop on Egocentric Vision with the Carnegie Mellon University at CVPR 2009, where we presented empirical studies on a large-scale day-to-day object use dataset collected from an egocentric view.
Object and Activity Recognition
Studies have shown that recognizing handled objects, or the objects that a person is manipulating and interacting with at the moment, can provide key context information about the person's activities and can lead to many life assistance applications such as in health-care and social networking. An egocentric camera enjoys many natural advantages for recognizing handled objects comparing to an array of mounted environmental cameras. It is also a challenging problem having to deal with a range of difficulties including poor image quality, hand occlusion and varying illumination (see this preview video from our object use dataset).
We have been developing a system for the egocentric recognition problem that uses a combination of techniques, including boundary shape matching using 2D templates, texture matching using SIFT descriptors, color matching for object instances, motion-based figure-ground segmentation (details below) and video-based temporal integration. While object recognition in everyday life is a hard problem that eludes any of the individual techniques, the combination covers all the major challenges, and our current system achieve 91% recognition accuracy for the 42-object dataset with >50,000 video frames (chance is 2.4%).
The following are a few examples of the recognition system at work:
Complete recognition results can be found in this video showing four cases: our combined approach, as compared to standard SIFT-based recognition (12% accuracy), with and without temporal integration (green - correct classification; red - incorrect).
Motion Segmentation and Tracking
One of the key building blocks of our recognition system is a figure-ground segmentation algorithm that separates moving hands and the object-in-hand from the largely "static" but constantly jiggling background. Our empirical studies have shown that the distraction from background clutter is one of the main sources of difficulty for object recognition in this setting. In this work we aim at a motion-based algorithm that can reliably separate the foreground (hands and object) from the background under all the circumstances in object manipulation.
Motion segmentation in general is a very hard problem and egocentric video presents additional challenges such as fast hand motion, unpredictable camera motion and slow frame rate. On the other hand, we do know a lot about this application domain, such as that hands and objects tend to appear near the center of the frame and move fast, or that body motion tends to be lateral and small. These domain-specific cues, when combined with generic motion analysis, lead to a robust algorithm that works in most cases and consistently improves object recognition accuracy. For example, it improves the performance of the SIFT-based recognition system (with temporal integration) from 25% to 60%.
A few examples of the segmentation algorithm at work:
Complete segmentation results can be found in this video.
Indoor Localization
One other application of egocentric vision we are currently exploring is the problem of high-precision (self-)localization in indoor environments using a wearable camera. With the assistance of a pre-computed map (a map of 3D feature points reconstructed using Bundler from static photos), we are able to accurately localize oneself in our lab using a low-quality wearable camera
Here are preliminary results of our localization system showing the input video frame, the static photo in the database that the video frame matches to, and the estimation of its localization via a particle filter:


