Swapping giraffes for sheep. Switching skirts with jeans. It might sound far-fetched, but those are just a couple of the feats a machine learning algorithm designed by researchers at the Korea Advanced Institute of Science and Technology and the Pohang University of Science and Technology can accomplish after ingesting large datasets of images. It’s described in a new paper (“InstaGAN: Instance-Aware Image-to-Image Translation“) published on the preprint server Arxiv.org this week.

Image-to-image translation systems — that is, systems that learn the mapping from input image to output image — aren’t anything new, to be clear. Only earlier this month, Google AI researchers developed a model that can realistically insert an object in a photo by predicting its scale, occlusions, pose, shape, and more. But as the creators of InstaGAN wrote in the paper, even state-of-the-art methods aren’t perfect.

Recommended videos
 

“Unsupervised image-to-image translation has gained considerable attention due to the recent impressive progress based on generative adversarial networks (GANs),” they said. (For the uninitiated, GANs are two-part neural networks consisting of generators that produce samples and discriminators that attempt to distinguish between the generated samples and real-world samples.) “However, previous methods often fail in challenging cases, in particular, when an image has multiple target instances and a translation task involves significant changes in shape[.]”

 

InstaGAN

Above: InstaGAN trained on the COCO dataset.

The researchers’ solution is a system — InstaGAN — that incorporates instance information of multiple target objects. In this case, that’s object segmentation masks (groups of pixels that belong to the same object), which usefully incorporate the boundaries of objects while ignoring details such as color.

Novelly, InstaGAN translates both an image and the corresponding set of instance attributes while aiming to preserve the background context. When combined with an innovative technique that allows it to handle a large number of instance attributes on conventional hardware, it can generalize for images with many instances.

“To the best of our knowledge, we are the first to report image-to-image translation results for multi-instance transfiguration tasks,” the researchers wrote. “Unlike the previous results in a simple setting, our focus is on the harmony of instances naturally rendered with the background.”

The researchers supplied InstaGAN with two classes from various datasets including multi-human parsing, MS COCO, and clothing co-parsing. Compared to CycleGAN, an accepted baseline for transformation between two images, InstaGAN was more successful at producing “reasonable shapes” of the target instances while keeping the original contexts.

In one example, InstaGAN generated giraffes, which it convincingly replaced with sheep. In other tests, it created bare human legs it superimposed on a runway model’s body, and swapped cups for bottles.

“The experiments on different datasets have shown successful image-to-image translation on the challenging tasks of multi-instance transfiguration, including new tasks, e.g., translating jeans to skirt in fashion images,” the researchers wrote. “Investigating new tasks and new information could be an interesting research direction in the future.”