It is an interesting problem because it involves three different things.
It involves visual perception and communication. People are the ones who are selecting objects, and so what people see is important. Perceptual entities, perceptual components of the image are important, and also communication is important, because people are going to be saying somehow to the system what object they want to extract. So there is a communication done by humans to convey to the machine as to which objects they are interested in, and the machine then has to transform that description that the humans give to an exact part of the image, and that is where computer vision comes in.
You have to have a segmentation of the image, image segmentation which has a relationship, which has a correspondence with the perceptual segmentation and the communication that the human uses to select a part of the image, for example sketch showing which part of the image the human wants. And that sketch is to be analysed and understood by the machine and is to be mapped on to a specific subject on the pixels of the image.
The human-computer interaction is the third component that is of relevance here, because all of these interactions, the visual communication, and the computer analysis have to be done in a mode where this thing is easy for the human. After all it is the human-computer interface. We would like to make it very friendly, user-friendly, we would like to make very efficient and to make it quick, so that it does not take too long to do the task.