In the book of the Canadian author Margaret Atwood "Blind Assassins," she says that "touch comes to view, before the word. It's the first and the last language and she always says the truth. "
While our sense of touch gives us a channel to feel the physical world, our eyes help us immediately understand the full picture of these tactile signals.
Robots that have been programmed to see or feel can not use these signals interchangeably. To better fill this sensory gap, researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSINA) have developed a predictive artificial intelligence (AI) capable of learning to see by touching and to learn to feel when seeing.
The team system can create realistic tactile signals from visual inputs and predict which object and which part is touched directly from these touch inputs. They used a KUKA robot arm with a special touch sensor called GelSight, designed by another MIT group.
With the help of a simple web camera, the team recorded nearly 200 objects, including tools, household products, fabrics, etc., having been affected over 12,000 times. Cutting out these 12,000 video sequences into static images, the team compiled "VisGel," a dataset of more than 3 million visual / tactile pair images.
"Looking at the scene, our model can imagine the sensation of touching a flat surface or sharp edge," says Yunzhu Li, PhD student in CSAIL and lead author of a new paper on the system. "By blindly touching, our model can predict interaction with the environment only from tactile feelings. Bringing these two senses together could give the robot more power and reduce the data we might need for tasks involving manipulation and seizure of objects. "
Recent work to equip robots with more human physical meanings, such as the 2016 MIT project, which uses deep learning to visually indicate sounds, or a model that predicts the response of objects to physical forces, both use D & B. Large sets of data not available for understanding interactions. between vision and touch.
The team technique can work around this problem by using the VisGel dataset and a system called Generative Accident Network (GAN).
GANs use visual or tactile images to generate images in the other way. They work by using a "generator" and a "discriminator" that compete with each other, the generator having for objective to create real images to fool the discriminator. Whenever the discriminator "catches" the generator, it must expose the internal reasoning of the decision, which allows the generator to improve repeatedly.
Vision to the touch
Humans can deduce what an object feels just by seeing it. To better give this power to the machines, the system had to first locate the position of the touch, then deduce information on the form and the user-friendliness of the region.
Reference images – without robot-object interaction – helped the system to encode details about objects and the environment. Then, when the robot arm was working, the model could simply compare the current frame to its reference image and easily identify the location and scale of the touch.
This may look like giving the system the image of a computer mouse, then to "see" the area where the model predicts that the object should be touched for pickup – which could greatly help machines plan safer and more efficient actions.
Touch to vision
To touch the vision, the goal was for the model to produce a visual image based on tactile data. The model analyzed a tactile image and then determined the shape and material of the contact position. He then turned to the reference image to "hallucinate" the interaction.
For example, if, during testing, the model received tactile data on a shoe, this could give an image of where that shoe might be affected.
This type of ability can be useful for performing tasks in cases where there is no visual data, such as when a light is off, or if a person ventures blindly into an unknown box or area.
Look to the front
The current dataset only contains examples of interactions in a controlled environment. The team hopes to improve this by collecting data in more unstructured areas or by using a new tactile glove designed by MIT, to better increase the size and diversity of the dataset.
There are still details that can be difficult to infer from fashion changes, such as saying the color of an object simply by touching it, or saying how soft a sofa is without it being necessary to l & # 39; support. The researchers say that this could be improved by creating more robust models of uncertainty, in order to broaden the distribution of possible outcomes.
In the future, this type of model could contribute to a more harmonious relationship between vision and robotics, especially for object recognition, input, better understanding of the scene, and seamless human-robot integration in an environment of assistance or manufacture.
"This is the first method to convincingly translate visual and tactile cues," said Andrew Owens, a postdoctoral fellow at the University of California, Berkeley. "Methods like this can be very useful for robotics, where you have to answer questions like," Is this object hard or soft? "Or" if I lift this cup by its handle, what will be my grip? "It's a very difficult problem because the signals are so different, and this model has shown great capacity."
Li wrote the paper alongside MIT professors Russ Tedrake and Antonio Torralba, and MIT postdoc Jun-Yan Zhu. It will be presented next week at the Computer Vision and Pattern Recognition Conference in Long Beach, California.