Stroke를 통해서 painting을 재현하는 방식의 논문이다. 특히, 데이터 셋 없이 Self-training pipeline 을 활용하여 만들어 냈다는 점이 인상깊었다.
Abstract
Neural Painting
Procedure of producing a series of strokes for a given image + non-photo realistic recreation using NN.
RL?
Can generate a stroke sequence step by step
But, training stable RL agents is not easy.
Iterative stroke optimization methods
Stroke optimization methods search for as set of stroke parameters "iteratively" in a large search space, making it less efficient.
A novel Transformer-based framework that predicts the parameters of a stroke set with as FFN.
It can generate a set of strokes in parallel and obtaion the final painting of size 512 X 512 in near real time.
Self-training pipeline
No dataset is available for training of the Paint Transformer.
The researchers created self-training pipeline that can be trained w/o any off-the-shelf dataset.
With cheaper traning and inference costs, the methode achieves better painting performance.
Introduction
The goal of this paper seems to be creating a human-creation-like painting, since humans draw painting by stroke-by-stroke procedure. Especially for painting with oil paint or watercolor, the generated paintings can look more like real human.
Previous works
RNN
- Sequential process of generating strokes 1-by-1
- Referred to Sketch-RNN
Step-wise greedy search
- Sequential process of generating strokes 1-by-1
RL
- Sequential process of generating strokes 1-by-1
- Pros: Inference is fast
- Cons: Long training time, Unstable agents
Stroke parameter searching
- iterative optimization
- Pros: results are attractive
- Cons: not enough efficiency and effectiveness.
Overview
Stroke set prediction instead of stroke sequence generation.
Given (initial canvas & target natural image) => predicts set of strokes and render them on the initial canvas to minimize the difference between the rendered image and target one.
This is repeated at K coarse-to-fine(coarse parameter set -> finer parameter set close to the best one in coarse set) scales.
The initial canvas is the output of the previous scale.
The paper suddenly comes with a new concept here.
Set prediction problem? => Object detection!
On paper "End-to-end object detection with transformers (2020)" by Facebook AI.
Lack of Data
Unlike object detection, annotated data is unavailable.
The authors propose a novel self-training pipeline of following steps which utilizes synthesized stroke image.
1. Synthesize a background canvas image with some randomly sampled strokes.
2. Randomly sample a foreground stroke set, and render them on canvas image to derive a target image.
This way, the predictor predicts the foreground stroke set, and the training objective becoms minimizing the difference between synthesized canvas image and the target image.
The optimization is conducted on both stroke and pixel level.
Stroke Based Painting
RNN and RL in sequential manner
Object Detection
Recent paper of DETR, performs set prediction w/o post-processing (as non-max suppression)
Methods
Overall Framework
Paint Transformer consists of two modules: Stroke Predictor and Stroke Renderer.
Their relation can be expressed like:
Ir=PaintTransformer(Ic, It)
Stroke Predictor
Input: It(target image) & Ic(intermediate canvas image)
Generate: Set of parameters to determine current stroke set Sr
Trainablity?: Contains trainable parameters
Stroke Renderer
Input: Sr & Ic
Output: Resulting image Ir (Sr drawn onto Ir).
Trainablity: Parameter free, differentiable module.
Self-training pipeline (stroke-image-stroke-image)
It uses randomly synthesized strokes, so that we can generate infinite data for training and do not rely on any off-the-shelf dataset.
The training iterations is as following.
1. Ramdomly Sample Sf(foreground stroke set) and Sb(background stroke set)
2. Generate Ic with StrokeRenderer(Sb)
3. Produce It(target image) by rendering Sf onto Ic.
4. Sr=StrokePredictor(Ic,It)
5. Ir=StrokeRenderer(Sr,Ic)
Training objective
L=Lstroke(Sr,Sf)+Lpixel(Ir,It)
Lstroke is stroke loss, and Lpixel is pixel loss.
Stroke definition and Renderer
A stroke s can be denoted as {x,y,h,w,θ,r,g,b} and we consider only straight stroke.
Stroke Renderer
- Geometric transformation based (no NN)
- differentiable (enabling end-to-end learning of Stroke Predictor)
- whole process can be achieved by linear transformation
Iout=StrokeRenderer(Iin,S) (S={si}i=1n) With a primitive brush Ib and a stroke si, we can draw the stroke like Fig.3, obtaining Ibiˉ.
αi is defined as
- binary mask of si
- generated single-channel alpha map
- same shape with Ibiˉ
Denoting Imid0=Iin,
The stroke rendering process is like this:Imidi=αi⋅Ibiˉ+(1−αi)⋅Imidi−1 Output of the stroke renderer is Iout=Imidn.
Stroke Predictor
The goal of stroke predictor is to predict a set of strokes that can cover the difference between Ic and It.
The authors hoped for few strokes prediction while covering most of the differences.
Sr=StrokePredictor(Ic,It)
- Input: Ic,It∈R3×P×P
- 2 CNNs: Extract feature maps as Fc,Ft∈R3×P/4×P/4
- Encoder: Fc,Ft and a learnable positional encodings are concatenated and flattened as the input of Transformer Encoder.
- Decoder: Use N learnable stroke query vectors as input.
- 2 branches of Fully-connected layers to predict
- Srˉ={si}i=1N (initial stroke params)
- Cr={ci}i=1N (stroke confidence)
- convert to a decision di=Sign(ci).
- Sign(x)=1 if x>=0,else 0
- di is used to determine whether a stroke should be plotted in canvas.
- Special form in backward phase due to back propagation.
Loss Function
Loss consists of pixel loss and stroke loss.
Pixel Loss
Lpixel=∣∣Ir−It∣∣1
Stroke Loss
Lstroke=n1i=1∑n(gYi(λL1DL1XiYi+λWDWXiYi)+λbceDbceXiYi)
L1 metric: dismisses different scales for big and small strokes
DW metric: Wasserstein distance (related to rotation)
Dbce metric: BCE of decisions.
X,Y: optimal permutations for predicted strokes.
Inference
Experiments
Implementation details
- Size P=32
- CNNs: 3X[Conv-BatchNorm-ReLU]
- Transformer: D=256, 3 layers for encoder and decoder each.
- Training time: 4 hours on 2080Ti.
Comparison