📝
Awesome reviews
  • Welcome
  • Paper review
    • [2022 Spring] Paper review
      • RobustNet [Eng]
      • DPT [Kor]
      • DALL-E [Kor]
      • VRT: A Video Restoration Transformer [Kor]
      • Barbershop [Kor]
      • Barbershop [Eng]
      • REFICS [ENG]
      • Deep texture manifold [Kor]
      • SlowFast Networks [Kor]
      • SCAN [Eng]
      • DPT [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Eng]
      • Patch Cratf : Video Denoising by Deep Modeling and Patch Matching [Eng]
      • LAFITE: Towards Language-Free Training for Text-to-Image Generation [Kor]
      • RegSeg [Eng]
      • D-NeRF [Eng]
      • SimCLR [Kor]
      • LabOR [Kor]
      • LabOR [Eng]
      • SegFormer [Kor]
      • Self-Calibrating Neural Radiance Fields [Kor]
      • Self-Calibrating Neural Radiance Fields [Eng]
      • GIRAFFE [Kor]
      • GIRAFFE [Eng]
      • DistConv [Kor]
      • SCAN [Eng]
      • slowfastnetworks [Kor]
      • Nesterov and Scale-Invariant Attack [Kor]
      • OutlierExposure [Eng]
      • TSNs [Kor]
      • TSNs [Eng]
      • Improving the Transferability of Adversarial Samples With Adversarial Transformations [Kor]
      • VOS: OOD detection by Virtual Outlier Synthesis [Kor]
      • MultitaskNeuralProcess [Kor]
      • RSLAD [Eng]
      • Deep Learning for 3D Point Cloud Understanding: A Survey [Eng]
      • BEIT [Kor]
      • Divergence-aware Federated Self-Supervised Learning [Eng]
      • NeRF-W [Kor]
      • Learning Multi-Scale Photo Exposure Correction [Eng]
      • ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [Eng]
      • ViT [Eng]
      • CrossTransformer [Kor]
      • NeRF [Kor]
      • RegNeRF [Kor]
      • Image Inpainting with External-internal Learning and Monochromic Bottleneck [Eng]
      • CLIP-NeRF [Kor]
      • CLIP-NeRF [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Kor]
      • DatasetGAN [Eng]
      • MOS [Kor]
      • MOS [Eng]
      • PlaNet [Eng]
      • MAE [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Eng]
      • Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning [Kor]
      • PointNet [Kor]
      • PointNet [Eng]
      • MSD AT [Kor]
      • MM-TTA [Kor]
      • MM-TTA [Eng]
      • M-CAM [Eng]
      • MipNerF [Kor]
      • The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [Eng]
      • Calibration [Eng]
      • CenterPoint [Kor]
      • YOLOX [Kor]
    • [2021 Fall] Paper review
      • DenseNet [Kor]
      • Time series as image [Kor]
      • mem3d [Kor]
      • GraSP [Kor]
      • DRLN [Kor]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Eng]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Kor]
      • NeSyXIL [Kor]
      • NeSyXIL [Eng]
      • RCAN [Kor]
      • RCAN [Eng]
      • MI-AOD [Kor]
      • MI-AOD [Eng]
      • DAFAS [Eng]
      • HyperGAN [Eng]
      • HyperGAN [Kor]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Kor]
      • UPFlow [Eng]
      • GFP-GAN [Kor]
      • Federated Contrastive Learning [Kor]
      • Federated Contrastive Learning [Eng]
      • BGNN [Kor]
      • LP-KPN [Kor]
      • Feature Disruptive Attack [Kor]
      • Representative Interpretations [Kor]
      • Representative Interpretations [Eng]
      • Neural Discrete Representation Learning [KOR]
      • Neural Discrete Representation Learning [ENG]
      • Video Frame Interpolation via Adaptive Convolution [Kor]
      • Separation of hand motion and pose [kor]
      • pixelNeRF [Kor]
      • pixelNeRF [Eng]
      • SRResNet and SRGAN [Eng]
      • MZSR [Kor]
      • SANforSISR [Kor]
      • IPT [Kor]
      • Swin Transformer [kor]
      • CNN Cascade for Face Detection [Kor]
      • CapsNet [Kor]
      • Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [Kor]
      • CSRNet [Kor]
      • ScrabbleGAN [Kor]
      • CenterTrack [Kor]
      • CenterTrack [Eng]
      • STSN [Kor]
      • STSN [Eng]
      • VL-BERT:Visual-Linguistic BERT [Kor]
      • VL-BERT:Visual-Linguistic BERT [Eng]
      • Squeeze-and-Attention Networks for Semantic segmentation [Kor]
      • Shot in the dark [Kor]
      • Noise2Self [Kor]
      • Noise2Self [Eng]
      • Dynamic Head [Kor]
      • PSPNet [Kor]
      • PSPNet [Eng]
      • CUT [Kor]
      • CLIP [Eng]
      • Local Implicit Image Function [Kor]
      • Local Implicit Image Function [Eng]
      • MetaAugment [Eng]
      • Show, Attend and Tell [Kor]
      • Transformer [Kor]
      • DETR [Eng]
      • Multimodal Versatile Network [Eng]
      • Multimodal Versatile Network [Kor]
      • BlockDrop [Kor]
      • MDETR [Kor]
      • MDETR [Eng]
      • FSCE [Kor]
      • waveletSR [Kor]
      • DAN-net [Eng]
      • Boosting Monocular Depth Estimation [Eng]
      • Progressively Complementary Network for Fisheye Image Rectification Using Appearance Flow [Kor]
      • Syn2real-generalization [Kor]
      • Syn2real-generalization [Eng]
      • GPS-Net [Kor]
      • Frustratingly Simple Few Shot Object Detection [Eng]
      • DCGAN [Kor]
      • RealSR [Kor]
      • AMP [Kor]
      • AMP [Eng]
      • RCNN [Kor]
      • MobileNet [Eng]
  • Author's note
    • [2022 Spring] Author's note
      • Pop-Out Motion [Kor]
    • [2021 Fall] Author's note
      • Standardized Max Logits [Eng]
      • Standardized Max Logits [Kor]
  • Dive into implementation
    • [2022 Spring] Implementation
      • Supervised Contrastive Replay [Kor]
      • Pose Recognition with Cascade Transformers [Eng]
    • [2021 Fall] Implementation
      • Diversity Input Method [Kor]
        • Source code
      • Diversity Input Method [Eng]
        • Source code
  • Contributors
    • [2022 Fall] Contributors
    • [2021 Fall] Contributors
  • How to contribute?
    • (Template) Paper review [Language]
    • (Template) Author's note [Language]
    • (Template) Implementation [Language]
  • KAIST AI
Powered by GitBook
On this page
  • 📑 1. Problem Definition
  • 📑 2. Local Implicit Image Function (LIIF)
  • 📑 3. Pipeline
  • 📑 4. Additional Engineering
  • 📑 5. Experiments
  • 📑 6. Conclusion
  • Take Home Message
  • 📑 Author / Reviewer information
  • 📰 Related Sites

Was this helpful?

  1. Paper review
  2. [2021 Fall] Paper review

Local Implicit Image Function [Eng]

Chen et al. / Learning Continuous Image Representation with Local Implicit Image Function / CVPR 2021

PreviousLocal Implicit Image Function [Kor]NextMetaAugment [Eng]

Last updated 3 years ago

Was this helpful?

한국어로 쓰인 리뷰를 읽으려면 를 누르세요.

📑 1. Problem Definition

Image as a Function

We usually consider image as pixel-RGB pairs. However, there is another view point, image as a function. Function is a mapping which takes input and outputs something. YYY value changes according to XXX. There are many easy functions like Figure 1 such as polynomial, exponential or the function may be very complex like Figure 2.

Figure 1
Figure 2

The form of easy shape function can be estimated easily

When the function outputs vary like image function. It is hard to estimate the implicit functional form of pixel and RGB.

Image -> Function : Image can be considered as a function whose input is (x,y)(x,y)(x,y) and outputs RGB values. Like Figure 2, the image function is complex and finding proper polynomial or trigonometrical function is impossible. Therefore it is not easy to an image function and there are attempts to find the function using neural network. This field is called Neural Implicit Representation (NIR).

Why we need NIR??

There are two benefits by knowing the image function

  1. If the number of parameters is less than the size of the image, it is data compression

  2. Image is basically discrete (pixel1, pixel2, ...), but we can know the RGB value in between two pixels if we know the continuous representation of the image. (⭐)

In this posting, I will introduce LIIF paper which is published in CVPR 2021. THis paper handled the (⭐) second benefit (Continuous Representation). This posting explains two contributions of the paper.

  • Training method of continuous image from discrete image

  • How to get higher resolution from continuous representation

📑 2. Local Implicit Image Function (LIIF)

Definition

A function which predicts RGB value from a given position xxx can be formulated as s=fθ(x)s = f_\theta (x)s=fθ​(x). The model predicts RGB value or Grey scale value using the pixel position. The suggested Local Implicit Image Function (LIIF) uses Latent Codes from the Image M∈RH×W×DM \in \mathbb{R}^{H\times W \times D}M∈RH×W×D and is trained to learn continuous image III. The LIIF model considers not only position information but also latent code of the image.

s=fθ(z,x)s = f_\theta (z,x)s=fθ​(z,x)

  • s=fθ(z,x)s = f_\theta (z,x)s=fθ​(z,x)

  • sss : RGV value at a pixel position

  • xxx : Position in Continuous space

  • zzz : Latent Code

  • f,θf, \thetaf,θ :neural network , neural network's parameters

Latent Code for continuous position

When there is [0,H]×[0,W][0,H]\times [0,W][0,H]×[0,W] size image, there are H×WH \times WH×W latent codes like the Figure 3. When we have a position xxx, we choose the closest latent code from the position. In Figure 4, we choose 4 latent codes instead of 1 (which is called local ensemble) for better performance. I will explain it in 4.3.

Figure 3
Figure 4

There are 4x4 Latent codes in 4x4 pixel image. These coes are distributed equally.

🧐 Few remarks on the latent codes

Q1. What is the value of Latent Code?

A1. Feature vector of an image from Pretrained Encoder(EDSR or RDN)

Q2. are latent codes shared when there are several images?

A2. (No) There are seperate latent codes because we feed an image to the pretrained model.

Q3. Do latent code change while LIIF Training?

A3. (Yes), We don't freeze the encoder.

Continuous Representation using Latent Code

We compute the RGB value of position xxx in continuous image represenation based on the position of the latent code. The difference between the position of the latent code v∗v*v∗ and xxx is used for the input of the LIIF model. The continuous representation using latent codes and a relative distance of xxx is

I(x)=∑t∈{00,01,10,11}StS⋅fθ(zt∗,x−vt∗)I(x) = \sum_{t \in \{ 00, 01,10,11 \}} \frac{S_t}{S} \cdot f_\theta (z_t^*, x - v_t^*)I(x)=∑t∈{00,01,10,11}​SSt​​⋅fθ​(zt∗​,x−vt∗​)

  • I(x)=∑t∈{00,01,10,11}StS⋅fθ(zt∗,x−vt∗)I(x) = \sum_{t \in \{ 00, 01,10,11 \}} \frac{S_t}{S} \cdot f_\theta (z_t^*, x - v_t^*)I(x)=∑t∈{00,01,10,11}​SSt​​⋅fθ​(zt∗​,x−vt∗​)

  • sss : RGV value at a pixel position

  • xxx : Position in Continuous space

  • zzz : Latent Code

  • f,θf, \thetaf,θ :neural network , neural network's parameters

  • StS_tSt​ : the area of the square generated by xxx and ztz_tzt​

  • S=∑t∈{00,01,10,11}StS = \sum_{t \in \{ 00, 01,10,11 \}} S_tS=∑t∈{00,01,10,11}​St​

Because we use the relative distance from the latent code, we can get continuous image representation by feeding continous distance. As described in Figure 5, we can choose any continous xxx in the image domain and continuous relative position x−vt∗x - v_t^*x−vt∗​ is computed.

📑 3. Pipeline

In the above section, we have the meaning of Latent Code and LIIF Function. The author suggest the Self-Supervised Learning training scheme to train the LIIF model. Now, we will see how to prepare data and train the model.

  1. ✔️ Data Preparation

  2. ✔️ Training

Data Preparation

In data preparation, we prepare Down-sampling image (reduced number of pixel) and original position xhrx_{hr}xhr​ and RGV value shrs_{hr}shr​. As described in Figure 6, we predict the RGB value of the original image from the down-samppling image. Note that, we just have same image with low resolution. Therefore, it's purpose is high resolution.

Training

In training, we feed the down-sampling image (48×4848\times4848×48) to the pretrained encoder and have a feature vector. This vector is used for the latent code of the iamge and pretrained model keeps the size of the image. In Figure 7, we predict the RGB value shrs_{hr}shr​ from the xhrx_{hr}xhr​ and latent codes by the LIIF model. The author used L1L1L1 Loss.

🚨 The role of the encoder is generating separate latent code for each image. Therefore, we don't have to train the model for separate images. It is different from the NIR which trained the model with a single image.

🧐 How can we can 224 x 224 size image from 48 x 48?

🧐 How can we can 224 x 224 size image from 48 x 48?

Even the the pixel size is different, 48x48 and 224x224 represents a same image. Therefore we can just normalize the image by pixel size and gets [0,1]x[0,1] image representation which is independent of the pixel size.

Therefore, the ground truth image in data preparation step is [0,1] range, not [0,224]

📑 4. Additional Engineering

We can boost the performance using additional engineering with LIIF model. The author proposed three methods and we get the best performance when we have all the methods.

  1. ✔️ Featuer Unfolding : Concatenation of the latent code with 3x3 neighboor latent codes

  2. ✔️ Local Ensemble : Choosing 4 latent codes for continuous position xxx, instead of 1

  3. ✔️ Cell Decoding : additional cell size input when decoding

Feature Unfolding

We get feature vector from the encoder. In feature unfolding, we concatentate 3x3 features so that we have better represenation of the input image. However, the input size becomes x9.

M^jk=Concat({Mj+l,k+m}l,m∈{−1,0,1})\hat{M}_{jk} = Concat(\{ M_{j+l, k+m} \}_{l,m \in \{-1,0,1\}})M^jk​=Concat({Mj+l,k+m​}l,m∈{−1,0,1}​)

Local Ensemble

There is a problem when using distances based latent code selection. As described in Figure 8, two latent codes are different even though the positions for each latent code is close. Therefore, there is sudden change of the latent code. To fix this problem, we choose 4 latent codes in local ensemble and only half changes as described in Figure 9.

Figure 8
Figure 9

If we choose just a single latent code, there is sudden change fo the latent code

Cell Decoding

LIIF model has latent code and position information. However, there is no guide for the degree of target resolution. For example, when we improve the resolution from 48×4848\times 4848×48 to 224×224224 \times 224224×224 , we give coordinates and latent codes but there is no information that we want to up-scaling x4. Therefore, in cell decoding, we use cell size information for the cell decoding.

s=fcell(z,[x,c])s = f_{cell} (z, [x,c])s=fcell​(z,[x,c])

  • s=fcell(z,[x,c])s = f_{cell} (z, [x,c])s=fcell​(z,[x,c])

  • c=[ch,cw]c = [c_h, c_w]c=[ch​,cw​]

📑 5. Experiments

High Resolution Benchmark

Figure 10 shows performance of LIIF on High Resolution Benchmark인 DIV2K dataset. The first row group represents Enhanced Deep Residual Networks for Single Image Super-Resolution (EDSR) encoder and the second row group represents Residual Dense Network (RDN) encoder.

  • When we have EDSR encoder, it outperforms other method with the same encoder. Also, LIIF model shows better performance for out-of-distribution tasks which requries higher resolution such as x6 and x30 when the the model is only trained with x1~x4 resolution. The LIIF model shows better performance because it is based ont eh distance from the latent codes and the information of the latent code is usefull. .

  • RDN encoder shows similar performance on in-distirbution tasks and outperforms out-of-distribution tasks.

💡 As a result, we conclude that the LIIF model outperforms other method when it requires higher resolution.

Continuous Representation

If the model is well trained continuous representation, we must have continuous image when the image is zoomed in. the image generated by the LIIF model shows cleaner and smooth patterns compared to other NIR methods. Other models have blur effect while LIIF model has very smooth image.

📑 6. Conclusion

In this paper, Local Implicit Image Function(f(z,x−v)f(z, x-v)f(z,x−v)) is suggestd for continuous image representation. The target position is considered by the location of the latent codes which makes the continuous image representation possible. Also, the pretrained model is used and the model is trainined for all the images together.

Image has different RGB value at each pixel adn compressing it while guarantting the high resolution is hard task. However, by using the LIIF we can compress the image in the neural network. If this property generalize well, we can transfer neural network instead of the image in the future.

Take Home Message

Implicit Neural Represenation usually learns from a raw data and we should train a separate model for new data. By using deeplearning, we can extract features efficiently, we can train a generalized model using the representation. Also, continuous domain as a distance from the features is a good approach.

📑 Author / Reviewer information

Author

  1. 박범진 (Bumjin Park): KAIST / bumjin@kaist.ac.kr

Reviewer

  • None

📰 Related Sites

For continuous position , is the 4 closest latent codes at position .

Wehn we choose 4 latent codes for the quadrant, only half of them change. For the left , we choose the closest 4 latent codes

xxx
z∗z^*z∗
xxx
xxx
z12,z13,z22,z23z_{12}, z_{13}, z_{22}, z_{23}z12​,z13​,z22​,z23​
여기
LIIF official Github
Enhanced Deep Residual Networks for Single Image Super-Resolution (EDSR)
Residual Dense Network (RDN)
Figure 5
Figure 6
Figure 7
Figure 10
Figure 11