[ICLR 2022] Denoising Likelihood Score Matching for Condition Score-Based Data Generation

Elsa Lab
8 min readMar 17, 2022

ICLR 2022 Full Paper

Keywords

Score-based generative model, conditional sampling, DLSM

Introduction

Score-based generative models are probabilistic generative models that estimate score functions, i.e., the gradients of the log density for some given data distribution. According to the definition of the pioneering work, the process of training score-based generative models is called Score Matching (SM), in which a score-based generative model is iteratively updated to approximate the true score function. Recently, the authors in [1] proposed an unified framework based on Denoising Score-Matching (DSM) [2] and achieved remarkable performance on Cifar-10. Their success inspired several succeeding works, which together contribute to making score-based generative models an attractive choice for contemporary image generation tasks.

A favorable aspect of score-based generative models is its flexibility to be easily extended to their conditional variants. This characteristic comes from a research direction that utilizes Bayes’ theorem to decompose a conditional score into a mixture of scores. In particular, some recent researchers applied this method to the field of class-conditional image generation tasks and proposed the classifier-guidance method. The authors in [3] showed that the classifier guidance method is able to achieve improved performance on large image generation benchmarks.

In spite of their success, our analysis indicates that the conditional generation methods utilizing a score model and a classifier may suffer from a score mismatch issue, which is the situation that the estimated posterior scores deviate from the true ones. This issue causes the samples to be guided by inaccurate scores during the diffusion process, and may result in a degraded sampling quality consequently.

To resolve this problem, we first analyze the potential causes for the score mismatch issue through a motivational low-dimensional example. Then, we theoretically formulate a new loss function called Denoising Likelihood Score-Matching (DLSM) loss, and explain how it can be integrated into the current training method.

Methodology

Analysis on the Score Mismatch Issue

To further investigate the score mismatch issue, we first leverage a motivational experiment on the inter-twining moon dataset to examine the extent of the discrepancy between the estimated and true posterior scores. In this experiment, we consider five different methods to calculate the posterior scores denoted as (a)∼(e):

(a) Base Method: The posterior scores are estimated using a score model trained with DSM Loss [2] and a classifier trained with cross entropy loss.
(b) Scaling Method: The posterior scores are estimated in a similar fashion as method (a) except that the scaling technique [3] is adopted to scale the likelihood scores.
(c) Posterior SM Method: The posterior scores for different class conditions, in this case, c1 and c2, are separately estimated using different score models trained with DSM Loss.
(d) Ours: The posterior scores are estimated in a similar fashion as method (a) except that the classifier is trained with our proposed loss, which is described below.
(e) Oracle: The oracle posterior scores are directly computed using the conditional variant of the equation:

Figure 1: The visualized results on the inter-twining moon dataset. The plots presented in the first two rows correspond to the visualized vector fields for the posterior scores of the class c1 (upper crescent) and c2 (lower crescent), respectively. The plots in the third row are the sampled points. Different columns correspond to different experimental settings.

Fig. 1 visualizes the posterior scores and the sampled points based on the five methods. It is observed that the posterior scores estimated using methods (a) and (b) are significantly different from the true posterior scores measured by method (e). This causes the sampled points in methods (a) and (b) to deviate from those sampled based on method (e). On the other hand, the estimated posterior scores and the sampled points in method (c) are relatively similar to those in method (e). The above results therefore suggest that the score mismatch issue is severe under the cases of methods (a) and (b), but is alleviated when method (c) is used.

Table 1: The experimental results on the inter-twining moon dataset. The quality of the sampled data for different methods are measured in a quantitative manner.

In order to inspect the potential causes for the differences between the results produced by methods (a), (b), and (c), we incorporate metrics for evaluating the sampling quality and the errors between the scores. The sampling quality is evaluated using precision and recall metrics. On the other hand, the estimation errors of the score functions are measured by the expected values of Dp and DL, which are formulated according to the Euclidean distances between the estimated scores and the oracle scores. The expressions of D(P) and D(L) are represented as following:

where the subscripts P and L denote ‘Posterior’ and ‘Likelihood’, while the terms S(P) and S(L) correspond to the estimated posterior score and the likelihood score for a certain pair (x, y), respectively.

In Table 1, it can be seen that the numerical results are consistent with the observations revealed in Fig. 1 since the expectations of D(P) are greater in methods (a) and (b) as compared to method (c), and the evaluated recall values in methods (a) and (b) are lower than those in method (c).

The above experimental clues therefore shed light on two essential issues to be further explored and dealt with. First, although employing a classifier trained with cross entropy loss to assist estimating the oracle posterior score is theoretically feasible, this method may potentially lead to considerable discrepancies in practice. This implies that the score mismatch issue may be the result of the inaccurate likelihood scores produced by a classifier. ​​Second, the comparisons between methods (a), (b), and (c) suggest that score matching may potentially be the solution to the score mismatch issue.

Denoising Likelihood Score Matching

In this section, we introduce the proposed denoising likelihood score-matching (DLSM) loss, a new training objective that encourages the classifier to capture the true likelihood score.

As discussed above, a score model trained with the score-matching objective can potentially be beneficial in producing a better posterior score estimation. In light of this, a classifier may be enhanced if the score-matching process is involved during its training procedure. An intuitive way to accomplish this aim is through minimizing the explicit likelihood score-matching loss (ELSM loss) L(ELSM) , which is defined as the following:

This loss term, however, requires the derivation of the true likelihood score, which is computationally unaffordable during the classifier training. In order to reduce the computational cost, we follow the derivation of DSM as well as Bayes’ theorem, and formulate an alternative objective DLSM loss L(DLSM):

As second and third terms can be computed in a tractable manner, the classifier can be updated by minimizing the approximated variant of DLSM loss L(DLSM’), defined as:

Following the theoretical derivation above, we next discuss the practical aspects during training, and propose to train the classifier by jointly minimizing the approximated denoising likelihood score-matching loss (DLSM′ loss) L(DLSM) and the cross-entropy loss L(CE). In practice, the total training objective of the classifier can be written as follows:

Figure 2: The training procedure of the proposed methodology.

Fig. 2 depicts a two-stage training procedure adopted in this work. In stage 1, a score model is updated using the DSM loss L(DSM). In stage 2, the weights of the trained score model are fixed, and a classifier is updated using the total loss L(Total).

Experimental Results and Analysis

Results on Cifar-10 and Cifar-100

In this section, we examine the effectiveness of the base method, the scaling method, and our proposed method on the Cifar-10 and Cifar-100 benchmarks with several key evaluation metrics. We adopt the Inception Score (IS) and the Fréchet Inception Distance (FID) as the metrics for evaluating the overall sampling quality by comparing the similarity between the distributions of the generated images and the real images. We also evaluate the methods using the precision (P), recall (R), density (D), and coverage (C) metrics to further examine the fidelity and diversity of the generated images. In addition, we report the classification accuracy score (CAS) to measure if the generated samples bear representative class information.

Table 2: The evaluation results on the Cifar-10 and Cifar-100 datasets. The arrow symbols ↑ / ↓ represent that a higher / lower evaluation result corresponds to a better performance.

Table 2 reports the quantitative results of the above methods. It is observed that the proposed method outperforms the other two methods with substantial margins in terms of FID and IS, indicating that the generated samples bear closer resemblance to the real data.

Figure 3: The sorted differences between the proposed method and the base method evaluated on the Cifar-10 and Cifar-100 datasets for the class-wise P / R / D / C metrics. Each colored bar in the plots represents the differences between our method and the base method evaluated using one of the P / R / D / C metrics for a certain class. A positive difference represents that our method outperforms the base method for that class.

Another insight is that the based method achieves relatively better performance on the precision and density metrics in comparison to our method. However, it fails to deliver an analogous tendency on the CAS metric. This behavior indicates that the base method may be susceptible to generating false positive samples, since the evaluation of the P / R / D / C metrics does not involve the class information, and thus may fail to consider samples with wrong classes. Such a phenomenon motivates us to further introduce a set of class-wise (CW) metrics, which takes the class information into account by evaluating the P / R / D / C metrics on a per-class basis. The evaluation results of these metrics shown in Fig. 3 reveal that our method outperforms the base method for a majority of classes.

Conclusion

In this paper, we highlighted the score mismatch issue in the existing conditional score-based data generation methods, and theoretically derived a new denoising likelihood score-matching (DLSM) loss, which is a training objective for the classifier to match the true likelihood score. We have demonstrated that, by adopting the proposed DLSM loss, the likelihood scores can be better estimated, and the negative impact of the score mismatch issue can be alleviated. Our experimental results validated that the proposed method does offer benefits in producing higher-quality conditional sampling results on both Cifar-10 and Cifar-100 datasets.

Paper Download

[OpenReview]

[arXiv]

Please cite this paper as follows:

C.-H. Chao, W.-F. Sun, B.-W. Cheng, Y.-C. Lo, C.-C. Chang, Y.-L. Liu, Y.-L. Chang, C.-P. Chen, and C.-Y. Lee, “Denoising likelihood score matching for conditional score-based data generation”, Int. Conf. on Learning Representations (ICLR), Apr., 2022

Reference

[1] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Proc. of Conf. on Neural Information Processing Systems (NeurIPS), 2019.
[2] P. Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
[3] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233, 2021.

--

--

Elsa Lab

ELSA Lab is a research laboratory focusing on Deep Reinforcement Learning, Intelligent Robotics, and Computer Vision.