Keywords

1 Introduction

Tropical cyclones (TC) are among the most catastrophic weather events that can cause injuries and deaths as well as huge economic losses. Tropical cyclone monitoring and forecasting are among the most concerned missions for meteorologists and weather service centers worldwide. The very first and critical step of making TC forecasts is intensity estimation, which is defined as the maximum sustained surface wind speed near the TC center (measured in knot, 1 kt \(\approx \) 0.51 ms\(^{-1}\)). Since tropical cyclones usually occur on the open ocean, satellite images are mostly used for estimating the intensity. Traditional methods, such as Dvorak [8], DAV [22], and ADT [21] are based on cloud patterns recognized from satellite images. Recently, many efforts have been made in developing Neural Network (NN) based models [1,2,3, 6, 10, 24, 29, 31] for the task of TC intensity estimation, which has become a promising direction to achieve more accurate estimations. All of these models are discriminative models that are inspired by the ability of automatically learning useful features from satellite images with various network architectures, data pre-processing methods or physics guided feature extractions. Backbone models of these works are often CNN based regression models, which predict numerical TC intensities directly.

Alternative to discriminative models, generative models are trained on a harder task, forcing them to learn a deeper and more sophisticated comprehension of the data so as to synthesize new samples, thereby improving their potential for discriminative tasks especially under limited data [11, 20]. Inspired by the recent advancements of diffusion models that show promising ability in synthesizing high quality images following class or text conditions [7, 12,13,14, 26,27,28], there emerge a number of studies [4, 5, 17, 18, 23] that aim to unleash the potential of diffusion models on discriminative tasks. Among them, Diffusion-TTA [23] uses a pre-trained conditional diffusion model to tune an image classifier during test time and observes improvements on accuracy over the original classifier. The two models are attached in a way that, the classifier output serves as the condition to the diffusion model, such that the classifier can be adapted by back-propagating the diffusion loss.

It is natural to utilize diffusion models in a similar way on regression tasks to achieve improved TC intensity estimations. Unlike classification tasks, where the predicted attributes are categorical, regression tasks predict ordinal and numerical attributes, and face additional challenges. To successfully tune a regressor in a gradient descent manner, it should hold that given a biased prediction, the gradient of the diffusion loss on the condition points to the direction toward the ground truth, considering the ordinal nature of attributes to be regressed on. Existing studies like Diffusion-TTA focus on classification tasks and have not yet inspected into this problem. Furthermore, the level of diffusion loss should be connected to the degree in which the condition is biased, so as to encourage the expected gradient. However conditional diffusion models are typically trained with only correct conditions, lacking penalties for the incorrect ones let alone such “distance awareness”, which could result in sub-optimal results.

In this paper, we propose a method driven by a contrastive learning enhanced diffusion model that meets the aforementioned challenges and can better resolve the tropical cyclone intensity estimation task. The main contributions of this paper are the following:

  1. 1.

    We propose a test-time adaptation method Diff-RTTA to improve performances of regression models utilizing diffusion models, and observe favorable loss characteristics that lead adaptations towards more accurate predictions.

  2. 2.

    We propose the ConDiff-RTTA method to enhance the diffusion model with contrastive learning such that it is aware of the distance between true and false conditions, which further optimizes the model to be more aligned with the regression task.

  3. 3.

    We conduct experiments on a benchmark dataset of TC intensity estimation and observe performance gains with our method, especially on high intensity tropical cyclones.

The rest of this paper is organized as follows. We first give a brief overview of related work in Sect. 2. Then, we introduce the preliminary knowledge on both diffusion models and test-time adaptation with diffusion models, and propose our constractive learning enhanced diffusion models in Sect. 3. Experimental results of our proposed method on TC benchmark dataset TCIR are shown in Sect. 4. Finally, we make concluding remarks in Sect. 5.

2 Related Work

Neural Network Models for TC Intensity Estimation. Neural network based models for the TC intensity estimation problems fall into two categories in terms of their outputs, i.e., classification models and regression models. Classification models, e.g. [10, 24], output TC categories or TC intensity ranges instead of the numerical intensity value, whose performance is inferior to that of regression models [1,2,3, 6, 29, 31] in terms of estimation accuracy in RMSE or MAE. For regression models, recent works mostly focus on physics guided methods, through using extra data or features as inputs [2, 31, 32], or designing loss functions with TC knowledge [29, 31]. Some works also focus on the network design [1, 2], suggesting that the neural network should not be too deep and need to exclude dropout layers.

Diffusion Generative Models for Discriminative Tasks. There have been continuing attempts in aiming to unleash the potential of generative models on discriminative tasks, dated back to early studies [11, 20, 25]. With recent advancements in diffusion models, a number of works [4, 5, 17] face this challenge by sharing the idea that, a mildly noised image should be denoised by a diffusion model with the best effect when given the correct condition. In this light, they transform either class-conditional or text-to-image diffusion models to image classifiers by enumerating through classes and converting their corresponding diffusion losses to class probabilities. Diffusion models can also be seen as teacher models to optimize dedicated discriminative student models. DreamTeacher [18] distills knowledge from generative models pre-trained on large datasets onto a discriminative backbone, which is later trained on small downstream datasets. Diffusion-TTA [23] back-propagates the diffusion loss to a classifier, allowing test-time adaptation to improve the classification accuracy of the discriminative model. Our work is more similar to the latter than the former as we target at TC intensity estimation, an area where data are limited due to the fact that satellite images of TCs are only available since past few decades and have to be fully used by both generative and discriminative models, not allowing for an up- and down-stream split.

Contrastive Learning to Capture Data Divergence. By contrasting positive samples with negative ones, contrastive learning serve as a powerful tool to capture various forms of divergence in the data. Such divergence could be data mismatching, label differences, or even the precise distances between values. For instance, SupCon [15] projects data to positions in the embedding space according to their class labels, and Rank-N-Contrast [30] further extends the idea to continuous label values, making embeddings repel each other in a degree of their label distances. CoDi [16] aims to generate tabular data entries which consist of both continuous and discrete parts by two co-evolving diffusion models, and penalizes mismatching between the two parts by utilizing contrastive learning. Inspired by these works, we use contrastive learning to make our diffusion model not only able to capture image-condition mismatching, but also be “distance aware” of correct and biased conditions.

3 Methodology

3.1 Preliminaries

Diffusion Models. For an image x sampled from the real data distribution \(x\sim q(x)\), a diffusion model learns to approximate the data distribution by gradually adding noise to x in the diffusion process and predict the noise in the reverse process. Conditional diffusion models further learns the distribution q(x|c), where c is the condition input corresponding to the image x. The diffusion process, where a sequence of noise are added to the original input image x (now denoted as \(x_{0}\)) generating a noised image sequence \(x_{1},x_{2},...,x_{T}\), is formally defined [12] as:

$$\begin{aligned} \begin{aligned} &q(x_{1:T}|x_{0}):=\prod _{t=1}^{T}q(x_{t}|x_{t-1}),\\ &q(x_{t}|x_{t-1}):=N(x_{t};\sqrt{1-\beta _{t}}x_{t-1},\beta _{t} {\textbf {I}}), \end{aligned} \end{aligned}$$
(1)

where \(\beta _1,...\beta _T\), is a variance schedule that controls the level of the noise. We can further sample \(x_{t}\) from \(x_{0}\) using

$$\begin{aligned} x_{t} = \sqrt{\bar{\alpha }_{t}}x_{0} + \sqrt{1-\bar{\alpha }_{t}}\epsilon , \epsilon \sim \mathcal {N}(0,{\textbf {I}}), \end{aligned}$$
(2)

where \(\alpha _t:=1-\beta _t\) and \(\bar{\alpha }_t:=\prod _{s=1}^{t}\alpha _s\)

A diffusion denoising network \(\epsilon _{\phi }(x_{t},t)\) learns to predict the noise with noisy image \(x_t\) and the noise level t as inputs. For conditional diffusion models that takes c as an input condition during the reverse process, the diffusion loss for training is defined as:

$$\begin{aligned} \mathcal {L}_{\text {diff}}(\phi ;\mathcal {D}) = \frac{1}{\mathcal {|D|}}\sum _{(x^{i},c^{i})\in \mathcal {D}} \parallel \epsilon _{\phi }(\sqrt{\bar{\alpha }_{t}}x^{i} + \sqrt{1-\bar{\alpha }_{t}}\epsilon , c^i, t) - \epsilon \parallel ^{2} \end{aligned}$$
(3)

where \(\mathcal {D}=\{(x^i,c^i)\}_{i=1}^{N}\) is a training batch of N images with their corresponding conditions (labels).

Note that for the sake of simplicity, the above formulations are from DDPM [12], an origin of diffusion models. In our work we follow the framework of EDM [14] which includes altered design choices that boost the generative ability.

Test-Time Adaptation with Diffusion Models. Test-time adaptation refers to a procedure in which a pre-trained model is adapted on unlabeled test data [19]. Without labels, what is helpful to the adapted model can be another model that contains better knowledge about the test data. Diffusion-TTA [23] tackles this by using a pre-trained diffusion model, and tune an image classifier in an iterative manner. First, the classifier does inference on an image to provide an initial guess of class probabilities, from which a class condition is synthesized as weighted mixing of class embeddings. Then, a noise batch of different strengths is added onto the image, as inputs into the diffusion model along with the synthesized condition to compute the conditional diffusion loss. Last, loss gradients are back-propagated to the classifier, updating it to produce new class probabilities for the next iteration. After a specified number of iterations, the classifier is optimized on the image sample to produce a more accurate classification result with the help of the diffusion model, yielding better performance on the test set.

3.2 Conditional Diffusion Model for a Regression Task

Existing works that use diffusion models in discriminative tasks are limited to using categorical conditions such as one-hot class labels or text embedding during the training and inference of diffusion models. This raises a direct question that whether regression tasks can benefit from conditional diffusion models as well. In our TC intensity estimation task, the intensity is a numerical number with its range from 10 kt to 180 kt. Given the continuous nature of labels in regression tasks, it is infeasible to build a generative regressor by enumerating through labels as conditions and infer the target from corresponding conditional diffusion losses as in [4, 17]. Therefore we build our method on top of Diffusion-TTA which is gradient based.

Towards this goal, we migrate Diffusion-TTA to TC intensity estimation in a simple yet effective fashion. We follow the process of Diffusion-TTA and make modifications to take the TC intensity value as the condition, instead of class text embedding as in Diffusion-TTA. First we train a conditional diffusion model on an open dataset of TCs (will be described in Sect. 4.1), where the intensity condition is passed through a linear layer, projected to an embedding vector and taken by the diffusion model. Then we take a CNN-based TC intensity regression model [31] and conduct TTA on it in an instance-wise manner. We denote this method as Diff-RTTA, whose overall architecture and pseudo code are shown in Fig. 1 and Algorithm 1, respectively.

Fig. 1.
figure 1

Overall Architecture for Test-time Adaptation

Algorithm 1
figure a

Test-time Adaptation

We see improvements on Diff-RTTA over the regression model (reported in Sect. 4.4), but in this phase of study what we mainly want to inspect is the reason why a diffusion model can indeed benefit regression tasks. To demonstrate it, for every TC image we enumerate the intensity condition as an integer from 10 kt to 180 kt, and collect diffusion losses over the enumeration. It is expected that the diffusion loss should be minimal at the correct intensity of the TC image. Figure 2 shows the loss enumerations on test set for TCs of CAT1-CAT5 categories (well be defined in Sect. 4.1) and the entire set. It can be observed that the loss curves tend to be U-shaped with the valley near the correct condition (denoted by c). With the U-shaped loss curves, it is made possible that a biased proposed value of intensity could be optimized towards the ground truth intensity by steps of gradient descent, whereas enumeration on possible conditions is far more costly for continuous values.

Fig. 2.
figure 2

Diffusion loss enumerations over conditions by Diff-RTTA: For each TC image, diffusion losses are calculated on each condition enumerated from range [c-40,c+40], where c is the true condition of the corresponding TC image. The average of diffusion losses on each condition offset value of all TC images from CAT1 to CAT5 categories and from the entire test set are shown in (a) and (b), respectively.

3.3 Contrastive Learning Enhanced Diffusion Model

Observations on Diff-RTTA indicate that, by following the vanilla training procedure, a diffusion model conditioned on numerical values can exhibit our expected characteristic: the loss enumeration curve is U-shaped around the true condition. In other words, the diffusion model denoises noisy images the best around the true condition, and behaves worse when the proposed condition is farther away. Nevertheless, we suppose this favorable characteristic can be even strengthened, since the vanilla training way of the conditional diffusion model assumes the conditions are always correct, thus paying no attention on the relation between true and false conditions and their distances. It is reasonable because such knowledge can hardly be of use for pure generation, but it comes to importance in the context of our study. We expect that explicitly relating diffusion loss to condition distances can point the gradient more to the correct direction, and reduce the bias between the loss minimum point and correct condition.

Similar ideas can be seen in contrastive learning literature such as [30]. This motivates us to explore supervised contrastive learning for the enhancement. Contrastive learning works by contrasting similar samples (positive samples) with dissimilar ones (negative samples). In the TC estimation scenario, for a TC image x with its true condition c, a positive-negative pair is defined as:

$$\begin{aligned} \begin{aligned} \text {Positive} : [\text {aug}(x),c_{\text {pos}}], \\ \text {Negative} : [\text {aug}(x),c_{\text {neg}}], \end{aligned} \end{aligned}$$
(4)

where \(\text {aug}(\cdot )\) is a data augmentation function, \(c_{\text {pos}} := c\) and \(c_{\text {neg}}\) is a false condition not equal to \(c_{\text {pos}}\). Concerning the ordinal nature of our conditions and the local gradient we pursue, we sample the negative condition in a neighborhood of the positive, which also serves as a harder negative compared to some arbitrarily positioned one. There are also common observations that regression models tend to exhibit larger estimation errors on TCs of high intensities [1, 2, 29, 31, 32], therefore we enlarge the sampling neighborhood for high intensities to cover the potential error bar with negative samples. The sampling strategy is defined as

$$\begin{aligned} c_{\text {neg}} = c_{\text {pos}} + \text {rand}(-(\log {c_{\text {pos}}})^2,(\log {c_{\text {pos}}})^2), \end{aligned}$$
(5)

where \(\text {rand}(a,b)\) draws a random number from a uniform distribution on the interval (ab).

With the defined positive-negative pair, we propose a contrastive loss term in the form of a triplet loss, which is formally defined as

$$\begin{aligned} \mathcal {L}_\text {con} = \max {(\mathcal {L}_{\text {diffpos}} - \mathcal {L}_\text {diffneg} + \text {Margin}(c_{\text {pos}},c_{\text {neg}}), 0)}, \end{aligned}$$
(6)

where \(\mathcal {L}_{\text {diffpos}}\) is the diffusion loss for the positive sample and \(\mathcal {L}_{\text {diffneg}}\) is the diffusion loss for the negative sample. In the standard triplet loss, margin is defined as a constant to keep the positive away from the negative in a certain degree. Here, we propose margin as a distance aware function so that it adjusts the margin between positive and negative losses according to the distance between the corresponding conditions. With a larger distance, the negative loss should exceed the positive loss to a greater extent. The Margin function is defined as

$$\begin{aligned} \text {Margin}(c_{\text {pos}},c_{\text {neg}})=\log {(1+D(c_{\text {pos}},c_{\text {neg}}))}*\mathcal {L}_\text {diffpos}, \end{aligned}$$
(7)

where \(D(c_{\text {pos}},c_{\text {neg}})\) is the distance between true and false conditions, and \(\mathcal {L}_{\text {diffpos}}\) here only provides the value without contributing a gradient. We choose the current form to let the margin shrink when \(c_{\text {neg}}\) gets close to \(c_{\text {pos}}\). The margin is also proportional to \(\mathcal {L}_{\text {diffpos}}\) because the loss scale differs through conditions and the margin should be adjusted in a relative manner. With this distance aware margin, the diffusion model learns to increase the diffusion loss under a false condition adaptive to the condition distance and the loss scale.

We propose the following contrastive learning enhanced diffusion loss for continuous training on the previously trained diffusion model,

$$\begin{aligned} \mathcal {L}_{\text {ConDiff}} = \mathcal {L}_{\text {diff}} + \lambda \mathcal {L}_{\text {con}} \end{aligned}$$
(8)

where \(\lambda \) is the weight for the contrastive loss, which is a hyper parameter. The training procedure is modified from the standard procedure of training a conditional diffusion model, where in each iteration the batch is doubled to construct the negative half whose conditions are sampled according to Eq. 5, and the doubled batch is fed into the model to update it via \(\mathcal {L}_{\text {ConDiff}}\). The contrastive learning enhanced diffusion model is then used in the TTA stage. We denote this improved method as ConDiff-RTTA.

The pipeline for the contrastive enhanced diffusion model training phase is shown in Fig. 3. The overall pseudo code for training is shown in Algorithm 2.

Fig. 3.
figure 3

Overall Architecture for Constructive Enhanced Diffusion Model Training

Algorithm 2
figure b

Training of Contrastive Learning Enhanced Diffusion Model

4 Experiments

4.1 Dataset

We use a publicly available benchmark dataset, the Tropical Cyclone Dataset for Image Intensity Regression (TCIR)Footnote 1 [1]. TCIR contains TCs in the North Eastern Pacific, the North Western Pacific, and the Atlantic Ocean. The satellite observations in TCIR are derived from two open datasets, GridSat and CMORPH. The best track intensities (IBTrACS) are derived from the Joint Typhoon Warning Center (JTWC) and the Atlantic Hurricane Database (HURDAT2).

As shown in Table 1, we classify TCs according to the Saffir-Simpson Hurricane Wind Scale, which consists of 7 classes, with higher classes representing higher maximum sustained winds. We use a total of 36566 image frames from TCs in 2003-2013 as training data, 3245 from TCs in 2014 as validation data, and 7570 frames from TCs in 2015-2016 as testing data. Each frame has \(201 \times 201\) pixels and a total of 4 channels per pixel, i.e., infrared (IR), water vapor (WV), visible channel (VIS), and passive microwave rain-rate (PMW). In our experiments, we use the IR channel, and normalize it to have zero mean and unit standard deviation, and resize it to \(65\times 65\) pixels as the input.

Table 1. Number of Frames in TCIR from [1]

4.2 Models and Metrics

Regression Model. To achieve SOTA performance on TC intensity estimation, it is needed to include physics-guided features in the regression network, and conduct special post-processing such as sliding windows and rotation ensembles [1, 2, 29, 31, 32]. These techniques are along different dimensions compared to our method, and will require a lot of extra efforts and computational resources. Therefore we set our goal to explore the ability of diffusion models on improving the intensity estimation performance of CNN based backbone models. We use ResNet-18 [9] as the backbone for feature extraction and train it on the TCIR training set with L2 loss. This regression model achieves comparable performance to backbone models in [1, 31] on the TCIR validation set and test set. We refer to this model as Regression or Reg Model in the experiments.

Diffusion Models. For diffusion models, we use the implementation framework of EDM [14] and the U-Net backbone model from [28]. Diff-RTTA: This is a diffusion model trained with our modification for the regression task as discussed in Sect. 3.2. The trained diffusion model is then used to adapt the Reg Model during test time. ConDiff-RTTA: We fine-tune the above diffusion model with our proposed \(\mathcal {L}_{\text {ConDiff}}\) loss. Same as in Diff-RTTA, \(\mathcal {L}_{\text {diff}}\) is used during TTA stage.

Evaluation Metrics. We report the TC intensity estimation accuracy of various models in terms of Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).

4.3 Implementation Details

Models are trained on 8 RTX 4090 GPUs. We train a total of 21M TC images randomly sampled from the training dataset with batch size 256 for the Diff-RTTA and continue to train 3M TC images with batch size 128 for the ConDiff-RTTA, with the rest of training settings following the default of EDM. For test-time adaptation, the noise batch size is 200 with 10 adaptation steps and Adam optimizer is used with a learning rate of \(5\times 10^{-5}\).

4.4 Overall Performance

Diff-RTTA as Regression Model. To get a better understanding on using diffusion model alone as a regression model, we test the performance of Diff-RTTA model without using the pre-trained Reg Model but instead with 50 kt as the initial conditional inputs for all the test TC images. 50 kt is the mean value of the TC intensities from training set and has an overall RMSE of 30.39 on the entire test set. The performance of using 50 kt as initial conditions with Diff-RTTA is shown in Table 2 labelled as Diff-RTTA (50). Even with the initial condition of 50 kt, the overall RMSE of Diff-RTTA improves to 14.83, showing its ability as a regression model. In Fig. 4, Diff-RTTA (50) results for each TC image are ordered by the true conditions (from IBTrACS) from left to right. We can see that the predicted intensities are spread along the true conditions, which indicates the important fact that the correct adaptation directions are likely to be found using the diffusion loss as the feedback.

Fig. 4.
figure 4

Diff-RTTA (50) results for each TC image

Comparisons to Baselines. The performances of Reg Model, Diff-RTTA and ConDiff-RTTA are shown in Table 2. Diff-RTTA shows an improvement of 0.33 on RMSE over Reg Model, from 11.22 to 10.89. ConDiff-RTTA further improves the overall performance to 10.76. Although ConDiff-RTTA achieves mildly better results over Diff-RTTA in overall performance, a detailed inspection reveals that improvements on each TC category are made differently, as shown in Fig. 5. ConDiff-RTTA shows more significant improvements over Diff-RTTA as the TC intensity becomes higher, roughly between 0.6 to 1.0 compared to Reg Model on CAT1-5, in which the most destructive TCs reside. We attribute this observation to the stronger contrastive effect on high intensities due to larger contrastive margins and wider negative sampling windows, which we design deliberately to enhance the regression model’s performance on strong TCs.

Table 2. RMSE and MAE results on TCIR test set
Fig. 5.
figure 5

Improvements on different categories over baseline Reg Model

4.5 Diffusion Loss Analysis

We show the training curves of \(\mathcal {L}_{\text {diffpos}}\) and \(\mathcal {L}_{\text {diffneg}}\) in Fig. 6 (a). \(\mathcal {L}_{\text {diffpos}}\), the diffusion loss given true conditions, remains at a low level while \(\mathcal {L}_{\text {diffneg}}\), the diffusion loss given false conditions, increases significantly during training. Diffusion loss enumerations of the diffusion model trained in Diff-RTTA and that trained in ConDiff-RTTA are shown in Fig. 6 (b) and (c), on CAT1 to CAT5 TCs and the entire test set respectively. We can see that on both figures, the enumeration curves (yellow) with ConDiff-RTTA are sharper than the curves (blue) with Diff-RTTA. The valley of the curve with ConDiff-RTTA also shifts more towards the center (true condition c) compared to Diff-RTTA. These figures comply with our intention to still learn p(x|c) as well as impose stronger constraints on false condition scenarios.

Fig. 6.
figure 6

(a) Training loss curves and diffusion loss enumerations over conditions on (b) CAT1-CAT5 and (c) the entire test set

4.6 Parameter Study

A parameter study is conducted using validation set on the hyper parameter \(\lambda \), which is the weight of our proposed contrastive loss. Figure 7 (a) shows the overall improvements over baseline Reg Model with different \(\lambda \) values of 0.1, 0.5, 1.0, 2.0. It shows that the overall performance improves even with a small \(\lambda \) value. \(\lambda =0.5\) is selected according to our parameter study for reporting ConDiff-RTTA results. We also perform another parameter study w.r.t. the number of adaptation steps during test-time adaptation and the results are shown in Fig. 7 (b). It shows that by extending the adaptation steps, the overall RMSE keeps decreasing. As more adaptation steps lead to more running time, we stop the adaptation step at 10.

Fig. 7.
figure 7

Parameter study on (a) Hyper Parameter \(\lambda \) and (b) Adaptation Steps

4.7 Case Study

We select from our test set the Super Typhoon Meranti, one of the most disastrous typhoons of this century for our case study. Meranti impacted South Eastern Asia and Southern China areas in September 2016, causing numerous deaths and injuries along with massive economic loss. It was recognized as a CAT5 typhoon during its peak times.

Figure 8 shows the best track intensities (from IBTrACS) and model intensity estimations of Meranti throughout its lifetime. The regression model underestimates the peak intensities, which is likely due to the rareness of violent typhoons in the nature and therefore in the TCIR dataset. As a comparison, Our proposed method ConDiff-RTTA revises the estimations upward such that they are closer to IBTrACS values. This case demonstrates that with the assist of our contrastive learning enhanced diffusion model, over-fitting in the regression model can be mitigated, resulting in a more accurate discriminative estimation on rare data.

Fig. 8.
figure 8

The intensities of Super Typhoon Meranti over its lifetime

5 Conclusion

In this paper, we propose a new method ConDiff-RTTA to improve TC intensity estimation performance. We find that TC regression network can be optimized during test time by a diffusion model conditioned on ordinal intensity numbers instead of categorical labels as in previous works. Furthermore, we enhance the diffusion model by training in a contrastive learning approach in order to improve the alignment between diffusion losses and prediction errors of the regression model. Experimental results show that the diffusion model pre-trained from TC satellite images improves TC estimation performance, and ConDiff-RTTA achieves further overall performance gains, especially significant on high intensity TCs.