Barbershop [Eng]

Peihao et al. / Barbershop; GAN-based Image Compositing using Segmentation Masks / SIGGRAPH Asia 2021

1. Problem definition

Image compositing is creating an actual image by synthesizing the desired image. One of these methods is to create a new image by collecting and mixing the features of multiple images. Recent advances in GAN have led to the results of these studies on image synthesis, but differences in lighting, geometry, and partial occlusion among images cause difficulty in mixing. In particular, mixing or editing facial images is particularly difficult because there are many image patches with different characteristics, such as hair, eyes, and teeth.

Recently, face image editing using GAN has been largely divided into two ways. One is to manipulate the latent space of the learned network (usually StyleGAN[3,4]). This is effective in changing the overall characteristics of the image, such as gender, facial expression, skin color, and pose. Another method is to input desired attribute change information using a conditional GAN structure. Among the methods of converting hairstyles using these methods, creating the area by inputting information about the area to be synthesized using conditional GAN structure shows excellent results. Traditional methods use pre-trained expanding networks to solve cases where areas disappear, such as when the background is exposed by changing from long to short hair. However, this often results in quality differences between the results of the image synthesis network and the printing network, resulting in awkward junctions or unwanted artifacts.

In this paper, to solve this problem, they use the GAN-inversion method to synthesize better-quality images by utilizing only one network.

2. Motivation

Recently, generative models such as StyleGAN[3,4] have been able to generate various facial images of high quality, and research has been conducted to synthesize the characteristics of multiple images. In particular, the GAN inversion technique was mainly used. First, high-quality face images were created by mapping the reference face image to a latent code using a pre-learned generation model and finding a new latent code representing the desired synthesis result image using optimization. However, it was not easy to synthesize the latent codes of images with different characteristics because of the spatial correlation between the features.

This paper aims to find the latent code that represents a better composite result image in the pre-learned generation model by referring to the segmentation map.

GAN-inversion

StyleGAN [3,4] shows incredible quality in the resulting images. Also, by manipulating the latent space of StyleGAN, attributes such as pose and gender of the generated image can be naturally changed. However, since all the images generated by StyleGAN are fake images, not real images, we have to make the StyleGAN represent the real images so that the manipulating can also be used for real images. Therefore, GAN-inversion studies want to find a latent space where pre-trained StyleGAN can represent real images. There are two main ways to do this. One is an embedded method that optimizes the input zz of the latent code or mapping network by calculating the gradient for the actual target image. The other is a project method that training an encoder that converts the real image into the code of the latent space of StyleGAN. The embedded method, such as I2S[5] and StyleGAN authors, produces the target real image with high quality. Still, since it utilizes the optimization method, it takes a long time to express a single image. The methods(psp[6], e4e[7]) of learning encoders that convert the actual image into the corresponding latent code now have the advantage of expressing the image quickly because only one network feed-forward is needed. Still, they produce images of lower quality than the embedded method.

How to control the output result images of GAN?

Recent studies have shown that latent spaces created through mapping networks, like StyleGAN, have much information and can change the characteristics of generated images by manipulating them. Among the methods of controlling the image by manipulating the latent space, there are methods of embedding and projection. Embedding methods for adjusting the latent space of pre-trained StyleGAN include Styleflow[8], which reflects the attribute in the latent code, and StyleCLIP[9], which manipulates the latent space using text input. The projection methods of controlling input using the Image2 Image translation structure include SPADE[10] and SEAN[11], which receive segmentation maps as input and transform them to image, or StarGAN-v2[12], which transforms the style of the entire image by receiving the attribute you want to modify as a reference image. Barbershop used the embed method for quality. It also used segmentation maps to embed real images into pre-trained StyleGAN by utilizing loss functions targeting different images in each area. And then, it can synthesize an image with different styles for different regions.

Idea

Barbershop used the embed method for quality. It also used segmentation maps to embed real images into pre-trained StyleGAN by utilizing loss functions targeting different images in each area. And then, it can synthesize an image with different styles for different regions.

3. Method

This paper used segmentation maps to real images into pre-trained StyleGAN using loss functions targeting different images in each area.

Overview

In this paper, segmentation maps are used for image synthesis. Therefore, the form of the synthesized image depends on the form of the target segmentation map. The primary method used in image synthesis is the GAN-inversion method which projects the actual image into the StyleGAN latent space. This paper used StyleGANv2[4] and II2S[13] for the embedded method. In utilizing II2S, they propose a new latent code, C=(F,S)C = (F,S), for the detail of the resulting image. StyleGANv2 uses 18 later codes, FF means the output feature map of StyleGANv2's eighth style block, and the remaining ten latent codes are named $S,applicationcode.Therefore,theII2Smethodfinds, application code. Therefore, the II2S method finds C$$ values representing the real image. The methods mentioned above follow the steps below to synthesize images.

  1. Obtain the segmentation map of the reference images to be used for a style change.

  2. Create a target segmentation map by aligning the segmentation map obtained.

  3. Align the reference images to fit the target segmentation map.

  4. Embed the aligned images to find the corresponding C=(F,S)C = (F,S) value for each image.

  5. Using the masked-appearance loss function, which learns the reference image differently for each area of the target segmentation map, find the CC value that mixes the appearance and structure of multiple images.

Target Segmentation

Creating a target segmentation map is simple. As shown in Figure 2, the target segmentation map is created by extracting the desired area for each image and mixing it appropriately. If the target map has many empty areas, fill the area in a simple way, as shown in Figure 2.

Image align & embedding

First, to synthesize images, align the reference images to the target segmentation MM. The align method consists of two steps. First, find the latent code CkrecC_{k}^{rec} that restores each reference image ZkZ_{k} using the embedding method with the loss function below.

The latent space that they used is called FSFS space, and consists of a feature map FF in the eighth block of SytleGAN and 10 later codes SS in the remaining blocks. They found FkalignF_{k}^{align} by using CkalignC_{k}^{align} because there are more spatial details in FF code, and it is more efficient than the method of finding latent codes from the first step for fits the target segmentation MM. To find walignw^{align} that represents an aligned reference image, they leverage two things: masked style-loss and cross-entropy loss for segmentation output.

Ik(Z)=1Segment(Z)=kI_{k}(Z) = 1{Segment(Z) = k} is a mask from each reference image ZkZ_{k} that only corresponds to the region used for MM, so represents the extraction from that region only. KK means the gram matrix used for style loss. XEntXEnt is the creoss-entropy loss to compare the segmentation map of the generated image with the target segmentation MM. They found FkalignF_{k}^{align} based on wkalignw_{k}^{align}. The FF of the overlapping region (HH) is taken from FkrecF_{k}^{rec} and the rest from wkalignw_{k}^{align} to make the final FkalignF_{k}^{align}.

Image Blending

The Ckalign=(Fkalign,Skalign)C_{k}^{align} = (F_{k}^{align}, S_{k}^{align}) is used to synthesize the image. Among CblendC^{blend} representing the final composite image, FblendF^{blend} is made by combining the image codes FkalignF_{k}^{align} of the corresponding area α\alpha in the target mask MM. SblendS_{blend} representing the remaining features is found by calculating the LPIPS loss for each area corresponding to the mask MM and calculating the weight uu applied for each corresponding latent code.

4. Experiment & Result

4.1. Experimental Setup

This paper utilizes images of size 1024x1024 of 120 embedded images through II2S. Through this, we conduct an image synthesis experiment with 198 image pairs. They did the 400iterations of embedding process for CkrecC_{k}^{rec}, 100iterations for .

4.2. Result

They compared the existing methods with RMSE, PSNR, SSIM, perfect similarity, LPIPS, and FID for quantitative evaluation. The baseline in the table is the result of embedding the existing W+W+ space instead of FSFS space without applying the align method. In addition, the image quality assessment was conducted with 396 participants, taking a higher choice of 378:18 compared to the result of LOHO[15] and 381:14 compared to the result of MichiGAN[16].

The above results show that Barbershop has much better image composition than other existing methods. It also shows incredible results in changing hairstyles and face-swapping.

5. Conclusion

In this paper, the image was synthesized using the pre-trained generation model and segmentation mask. In particular, they proposed a new latent space FSFS space to be used in the embed method. They showed high-quality synthesis results using the masked-style loss function to specify a target mask and aligning all reference images accordingly.

Take home message (오늘의 교훈)

Using latent space of GAN is useful

Author / Reviewer information

Author

조영주 (Youngjoo Jo)

  • KAIST AI

  • [github](https://github.com/run-youngjoo)

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

  1. Zhu, Peihao, et al. "Barbershop: GAN-based image compositing using segmentation masks." ACM Transactions on Graphics (TOG) 40.6 (2021): 1-13.

  2. Official GitHub repository : https://github.com/ZPdesu/Barbershop

  3. Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

  4. Karras, Tero, et al. "Analyzing and improving the image quality of stylegan." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

  5. Abdal, Rameen, Yipeng Qin, and Peter Wonka. "Image2stylegan: How to embed images into the stylegan latent space?." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

  6. Richardson, Elad, et al. "Encoding in style: a stylegan encoder for image-to-image translation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

  7. Tov, Omer, et al. "Designing an encoder for stylegan image manipulation." ACM Transactions on Graphics (TOG) 40.4 (2021): 1-14.

  8. Abdal, Rameen, et al. "Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows." ACM Transactions on Graphics (TOG) 40.3 (2021): 1-21.

  9. Patashnik, Or, et al. "Styleclip: Text-driven manipulation of stylegan imagery." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

  10. Park, Taesung, et al. "Semantic image synthesis with spatially-adaptive normalization." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

  11. Zhu, Peihao, et al. "Sean: Image synthesis with semantic region-adaptive normalization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

  12. Choi, Yunjey, et al. "Stargan v2: Diverse image synthesis for multiple domains." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

  13. Zhu, Peihao, et al. "Improved stylegan embedding: Where are the good latents?." arXiv preprint arXiv:2012.09036 (2020).

Last updated