Dynamically Instance-Guided Adaptation: A Backward-free Approach for Test-Time Domain Adaptive Semantic Segmentation

In this paper, we study the application of Test-time domain adaptation in semantic segmentation (TTDA-Seg) where both efficiency and effectiveness are crucial. Existing methods either have low efficiency (e.g., backward optimization) or ignore semantic adaptation (e.g., distribution alignment). Besides, they would suffer from the accumulated errors caused by unstable optimization and abnormal distributions. To solve these problems, we propose a novel backward-free approach for TTDA-Seg, called Dynamically Instance-Guided Adaptation (DIGA). Our principle is utilizing each instance to dynamically guide its own adaptation in a non-parametric way, which avoids the error accumulation issue and expensive optimizing cost. Specifically, DIGA is composed of a distribution adaptation module (DAM) and a semantic adaptation module (SAM), enabling us to jointly adapt the model in two indispensable aspects. DAM mixes the instance and source BN statistics to encourage the model to capture robust representation. SAM combines the historical prototypes with instance-level prototypes to adjust semantic predictions, which can be associated with the parametric classifier to mutually benefit the final results. Extensive experiments evaluated on five target domains demonstrate the effectiveness and efficiency of the proposed method. Our DIGA establishes new state-of-the-art performance in TTDA-Seg. Source code is available at: https://github.com/Waybaba/DIGA.

. Top: Illustration of test-time domain adaptive semantic segmentation (TTDA-Seg). Bottom: Comparison with different TTDA methods. The proposed DIGA is a holistic method that has the properties of effectiveness (distribution&semantic adaptation and avoid unstable training&error accumulation) and efficiency (backward-free).

Abstract
In this paper, we study the application of Test-time domain adaptation in semantic segmentation (TTDA-Seg) where both efficiency and effectiveness are crucial. Existing methods either have low efficiency (e.g., backward optimization) or ignore semantic adaptation (e.g., distribution alignment). Besides, they would suffer from the accumulated errors caused by unstable optimization and abnormal distributions. To solve these problems, we propose a novel backward-free approach for TTDA-Seg, called Dynamically Instance-Guided Adaptation (DIGA). Our principle is utilizing each instance to dynamically guide its own adaptation in a non-parametric way, which avoids the error accumulation issue and expensive optimizing cost. Specifically, DIGA is composed of a distribution adaptation module (DAM) and a semantic adaptation module (SAM), enabling us to jointly adapt the model in two indispensable aspects. DAM mixes the instance and source BN statistics to

Introduction
Semantic segmentation (Seg) [3,40,45,46,49] is a fundamental task in computer vision, which is an important step in the visual-based robot, autonomous driving and etc. Modern deep-learning techniques have achieved impressive success in segmentation. However, one serious drawback of them is that the segmentation models trained on one dataset (source domain) may undergo catastrophic performance degradation when applied to another dataset sam-This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.
Except for this watermark, it is identical to the accepted version; the final published version of the proceedings is available on IEEE Xplore. pled from a different distribution. This phenomenon will be even more serious under complex and ever-changing contexts, e.g., autonomous driving.
To solve this well-known problem caused by domain shifts, researchers have devoted great effort to domain generalization (DG) [6,11,18,19,21,30] and domain adaptation (DA) [23,47,47,50]. Specifically, DG aims to learn generalized models with only labeled source data. Traditional DA attempts to adapt the model on the target domain by using both labeled source data and unlabeled target data. However, both learning paradigms have their own disadvantages. The performance of DG is limited especially when evaluated on a domain with a large gap from the source since it does not leverage target data [11]. DA assumes that the unlabeled target data are available in advance and can be chronically exploited to improve target performance. This assumption, however, can not always be satisfied in realworld applications. For example, when driving in a new city, the data are incoming sequentially and we expect the system to dynamically adapt to the ever-changing scenario.
To meet the real-world applications, [41] introduces the test-time domain adaptation (TTDA), which aims at adapting the model during the testing phase in an online fashion (see Fig. 1 Top). Generally, existing methods can be divided into two categories: backward-based methods [1,22,27,37,41] and backward-free methods [15,25,28,33]. The former category (see Fig. 1 (a)) focuses on optimizing the parameters of models with self-supervision losses, such as entropy loss [27,41]. In this way, both distribution adaptation and semantic adaptation can be achieved, which however has the following drawbacks. (1) Low-Efficiency : Due to the requirement of back-propagation, the computation cost will be multiplied, leading to low efficiency. (2) Unstable Optimization & Error Accumulation: Since the gradient is calculated with single sample by weak supervision, the randomness could be high thus leading to unstable optimization. Although this problem can be mitigated in some certain by increasing the testing batch size, it still cannot be solved well. In such cases, the accumulated errors may lead the model to forget the original well-learned knowledge and thus cause performance degradation.
The second category aims to adapt the model in the distribution level by updating statistics in batch normalization (BN) [25] layers, which is very efficient as it is directly implemented in forward propagation with a light computation cost. Instance normalization [28] (see Fig. 1 (b)) directly replaces the source statistics with those from each instance, which is sensitive to the target variations due to discarding the basic source knowledge and thus is unstable. Mirza et al [25] (see Fig. 1 (c)) study the impacts of updating the historical statistics by instance statistics with fixed momentum or dynamically fluctuating momentum. However, these methods also suffer from the error accumulation is- sue caused by abnormal target distributions as well as the neglect of semantic adaptation, both of which will result in inferior adaptation performance.
To this end, we propose a holistic approach (see Fig. 1 (d)), called Dynamically Instance-Guided Adaptation (DIGA), for TTDA-Seg, which takes into account both effectiveness and efficiency. The main idea of DIGA is leveraging each instance to dynamically its own adaptation in a non-parametric manner, which is efficient and can largely avoid the error accumulation issue. In addition, our DIGA is implemented in a considerate manner by injecting with distribution adaptation module (DAM) and semantic adaptation module (SAM). Specifically, in DAM, we compute the weighed sum of the source and current statistics in BN layers to adapt target distribution, which enables the model to obtain a more robust representation. In SAM, we build a dynamic non-parametric classifier by mixing the historical prototypes with instance-level prototypes, enabling us to adjust the semantic prediction. In addition, the nonparametric classifier can be associated with the parametric one, which can further benefit the adaptation results. Our contributions can be summarized as follows: • Efficiency. We propose a backward-free approach for TTDA-Seg, which can be implemented within one forward propagation with a light computation cost.
• Effectiveness. We introduce a considerate approach to adapt the model in both distribution and semantic aspects. In addition, our method takes the mutual advantage of two types of classifiers to achieve further improvements.
• Usability. Our method is easy to implement and is model-agnostic, which can be readily injected into existing models (see Fig.2).
• Promising Results. We conduct experiments on three source domains and five target domains based on driving benchmarks and show that our method produces new state-of-the-art performance for TTDA-Seg. We also study the continual TTDA-Seg and verify the superiority of our method in this challenging task.

Related Work
Test-time Domain Adaptation (TTDA) aims to adapt models on the target domain only in test time. It is firstly proposed in TTT [37] and has been applied to many fields such as instance tracking [9], object detections [16] and reinforcement learning [12]. The early works (TTT [37] and its extensions [20,22]) require an extra training process on source data, making it inapplicable when only the source model is available. In this paper, we focus on the more practical Fully Test-time Domain Adaptation setting proposed in [41]. Current fully TTDA methods can be categorized into two main branches: backward-based adaptation and backward-free adaptation. As the pioneer of the self-supervision adaptation methods, TENT [41] proposes to minimize the entropy by updating BN affine parameters during test time. EATA [27] shows that skipping low entropy samples would achieve higher efficiency and performance. Also, an updated regularization term is utilized to alleviate the forgetting problem. [1] introduce an efficient framework by introducing contrastive learning. The problem with these methods is that they cost a long time and large GPU memory due to backpropagation, which largely limited the application for real-time inference. As for the backward-free branch, most of the approaches focus on BN statistics adaptation. IN [28] directly uses batch statistics while Momenmtum [33] and DUA [25] use running average to update the statistics. This branch is much more efficient while they only work on distribution, ignoring the semantic adaptation, leading to discounted adaptation power. Besides the above two branches, T3A [14] proposes to denoise the classification results in the post-processing stage, where adaptation of the model itself is not well exploited. Domain Adaptive for Semantic Segmentation (DASS) aims to bridge the domain gap between the training and testing data. The early works in DASS mainly focus on building adversarial training architectures to learn the domaininvariant features [24,38,43]. Complementary modules have been introduced to facilitate the training [24,38,40,43]. Another category exploits the self-training techniques such as entropy minimization [40] and pseudo-labeling [45,46,49,51]. While these approaches require the co-existence of source-target data. Source-free DA (SFDA) is more practical and closer to our setting as they assume the source data is not available during adaptation. [40] proposes to recover source information by utilizing the BN statistics. MAS 3 [17] proposes to store source distribution information as prototypes and then use them during adaptation.
[35] uses a multi-head structure to increase the reliability of pseudo-labeling for self-supervised training. Zhao et al. [48] present a special augmentation module to diversify samples with various patch styles at the feature level and then use them for generalization ability improvement. However, these methods can not handle TTDA well. Firstly, they do not consider the efficiency problem [17,35,40,48]. Moreover, they often require to visit samples repeatedly in large batch sizes [17,35,48].

Methodology
Problem Definition. In test-time domain adaptation in semantic segmentation (TTDA-Seg), we are given a segmentation model f θ : x → y pretrained on a source domain D S , which will be directly deployed to unseen domains for evaluation. Due to domain shifts, the model f θ would normally produce a poor performance on unseen testing domains. The goal of TTDA-Seg is to adapt the model by utilizing continuously incoming testing data in an online fashion (see Fig. 1). For example, at each testing step t, the model f θ receives an instance x t and simultaneously performs adaptation as well as produces segmentation predictionŷ t . At the next step t + 1, the model f θ will perform adaptation and prediction on instance x t+1 without the access to previous data x 1→t .

Overview
In this section, we propose a Dynamically Instance-Guided Adaptation (DIGA) method for TTDA-Seg, which is backward-free and non-parametric. As shown in Fig. 3, our DIGA includes two adaptation modules, the distribution adaptation module (DAM) and the semantic adaptation module (SAM), which are both guided by instance-aware information. Specifically, given a testing sample, we first input it into the source pretrained model and perform distribution alignment by DAM in each BN layer. The distribution alignment is implemented by weighted summing of the source statistics and instance statistics. After this, we apply semantic adaptation at the last feature level by SAM, in which we build a dynamic non-parametric classifier by weighted mixing the historical prototypes with instanceaware prototypes. This allows us to adjust the semantic prediction. Lastly, we obtain the final prediction by taking the mutual advantage between the original parametric classifier and the dynamic non-parametric classifier.
In Fig. 3 (a-g), we show an illustration of how our DIGA helps to adapt the model with the guidance of instanceaware information. (a-d) Due to the large domain shifts (e.g., light variations), the segmentation results on the target sample might be poor. After distribution alignment by DAM, the segmentation results could be improved, especially the instances that are similar to the source (e). However, there might still exist poorly-recognized pixels that are very different from the source. Our DAM further leverages the reliable pixels to guide the predictions of other pixels in a non-parametric way (f), enabling us to achieve more accurate results (g).

Distribution Adaptation Module (DAM)
The most common way to adapt the distribution is based on adversarial training [10,19,23] and minimization of distribution gap metrics [4,36]. However, these methods are not suitable for TTDA due to limited available training data and the high-cost backpropagation. Recent works [25,28,33] show that the static mismatch between domains in the Batch Normalization (BN) layers is a major reason that causes the performance degradation in cross-domain testing. We thus first revisit the mechanism of BN. Specifically, for each BN layer, given the input feature representation F , the corresponding output is given by: where γ and β are trainable parameters for scaling and shifting. E[F ] and V ar[F ] are expected value and variance of input feature F . In practice, due to the batch-wise training process, their values are calculated by running mean [42] during training as follows: whereμ s t andσ s t are serving as estimation for E[F ] and V ar[F ] of source domain respectively.
Out of the motivation that the amount of training sample is usually much larger than the testing batch and thus more stable, the last valueμ s t ,σ s t would be frozen and serve as the estimation for E[F ] and V ar[F ] for the test data the during the test phase.
However, it has been shown that when applied to a different environment, the source statistics can hamper performance significantly. To solve this problem, DUA [42] proposes to adapt the statistics γ, β of BN layers to the target domain with a dynamic learning module. Despite the efficiency, its performance is still not satisfactory. One possible reason is that the updating rate is usually very small so that instance-level information is not fully considered during each instance evaluation.
Different from [25,33], instead of updating the γ, β, the proposed Distribution Adaptation Module (DAM) dynamically merges the source and instance BN statistics to constitute the estimationμ T t and (σ T where µ T t and (σ T t ) 2 are the mean and variance calculated with the t-th instance during testing.

Semantic Adaptation Module (SAM)
The proposed DAM is a category-agnostic since it only aligns the distribution of feature maps globally. However, category-specific is also important to segmentation adaptation because the distribution of each category varies a lot even in the same image. Hence, we argue that it is also important to implement semantic adaptation in TTDA-Seg. To achieve this, two straightforward methods are entropy maximization [41] and pseudo-labeling [22]. However, both of them require gradient-based backpropagation and thus limit the testing efficiency. Inspired by the prototypebased methods in few-shot learning [34] and domain adaptation [29,45], we introduce the semantic adaptation module (SAM) for category-specific adaptation.
As shown in Sec. 3.1, even though distribution alignment is implemented by DAM, the model still produces wrong predictions for pixels that are very different from the source. Fortunately, we could observe that pixels of the same ob-ject share several of the same properties, e.g., the appearance within a car, the same texture color within a road, the same outfit within a person, the light intensity within an image, etc. Motivated by this, we propose to leverage the similarities between pixels to further guide the recognition of wrongly recognized pixels. To this end, we propose the semantic adaptation module (SAM) to adjust the semantic predictions by dynamic instance-aware prototypes.
The segmentation model f θ can be separated into two learnable parts: i) an encoder h ϕ for dense visual feature extraction which maps each pixel x (h,w) to feature z (h,w) ∈ R D , and ii) a classifier g ψ for following prediction which maps z (h,w) to a distributionp (h,w) (c|x) over C classes. Formally, it can be denoted as: The value of logits indicates the confidence of the corresponding classes it would belong to. Thus, the largest value max cp (h,w) t,c of the prediction distribution can be considered as the confidence of the prediction for one pixel. For one input image x t , we select pixels whose confidences are larger than P 0 to calculate the centroids of each class in feature space, which are called as instance-aware prototypes q t and can be formulated as follows: Using instance-aware prototypes only may produce unstable predictions due to the instance variance. To make the prediction more stable, we additionally calculate the moving average of the prototypes of different instances for each category, which are called historical prototypes.
Since the historical prototypes are calculated by averaging prototypes from a large number of target instances, they are more stable than instance-aware prototypes.
Given the instance-aware prototypes, we can obtain the instance-aware prediction for each class p (h,w) (c|x t , q) by: .
The historical prediction p (h,w) (c|x t ,q) could be obtained in a similar way. By combing the predictions of the two types of prototypes, we form a dynamic non-parametric classifier and the predictions are formulated as:

Algorithm 1 DIGA (Testing Phase)
Input: Model f θ , target testing sample x t . Output: Prediction of x t . 1. Produce feature z t and predictionp t (x t ) with distribution alignment of DAM (Eq. 3).
where λ P controls the importance of two types of prototypes.

Classifier Association
To this end, we could have two types of predictions: one from the original parametric classifier (p) and one from the introduced non-parametric prototype classifier (p). To leverage the mutual benefit between them, we obtain the final prediction by weighted sum of them, formulated as: (10) where λ F balances the importance of two classifiers. The overall process of DIGA is shown in Alg. 1.
Evaluation. The mean intersection-over-union (mIoU) is used as the evaluation metric. As in [3,40], for source models pretrained on GTA5 and GTA5+Synthia, we report the mIoU of 19 shared semantic categories. Due to missing of annotations of some classes, we report the mIoU of 16 shared semantic classes for the model pretrained on the Synthia dataset.  [8] is used as the backbone. It is worth mentioning that globally consistent parameter sets are used for our DIGA, which achieves consistently good performance in all experiments. Specifically, we set the momentum updating rate (ρ P and ρ BN ) both to 0.1. The weights of DAM, SAM, and classifier association (λ BN , λ P and λ F ) are all set to 0.8. The confidence bar for prototype selection P 0 is 0.9. All the experiments are conducted with one RTX3090 GPU.

Comparison with State of the Art
We first compare our method with the state-of-the-art approaches. Generally, the compared methods can be divided into two categories: backward-based methods and backward-free methods.
Backward-based methods: TENT [41] performs adapting by minimizing the output entropy and updating the learnable parameters of BN layers. As an extension to TENT [41], EATA [27] proposes to skip the high-entropy samples and only leverage reliable samples during model optimization, which can effectively increase testing efficiency. Both of them are initially designed for image classification. We implement them for TTDA-Seg by minimizing the entropy of pixel-level output. For EATA [27], we skip the low-entropy pixels during optimization. Backward-free methods: IN [28] uses instance statistics to replace source ones in BN at each testing step. Momentum [33] utilizes the instance statistics to update BN in a momentum-based manner. DUA [25] proposes a decaying strategy to adaptively control the momentum of BN updating. SITA [15] leverages extra augmented samples to obtain stable instance statistics, which are then mixed with the source statistics.
To make a fair comparison, we implement all the methods with the same source models. Note that, we report the results of the compared methods by selecting the best parameters for each source-target pair. In contrast, in our method, we only use one parameter setting for all experiments to better meet the real-world applications.
The following observations can be made from the results reported in Tab. 1. First, backward-based methods can consistently improve the performance when evaluating on CityScapes. However, the improvements on other target domains are limited or even negative. For example, when using Synthia as the source domain, TENT [41] increases the mIoU from 30.87% to 34.89% on CityScapes while largely reduces the mIoU from 21.01% to 16.99% for BDD100K. This indicates that using self-supervision only may not be a good choice for TTDA-Seg. Second, except for IN [28], the backward-free methods are generally effective on CityScapes and BDD100K while failing to achieve consistent improvements on other datasets, even though we have well-tuned them for each target domain. On the other hand, IN [28] largely reduces the average mIoU due to ignoring the source statistics. Third, the proposed DIGA consistently improves the mIoUs of the source models on all settings and outperforms all the compared methods by a large margin in most cases. Specifically, our DIGA is higher than the best competitor (Momentum [33]) by 4.41%, 7.07%, and 3.4% in average mIoU for GTA5, Synthia, and GTA5+Synthia settings, respectively. In Fig. 4, we provide the qualitative comparison of different methods. It is clear that our DIGA consistently improves the segmentation results of the source model and outperforms other stateof-the-art methods. The above observations demonstrate the effectiveness and universality of the proposed method for solving TTDA-Seg.

Ablation Study
We conduct ablative study to investigate the effectiveness of the components of the proposed DIGA, i.e., domain adaptation module, semantic adaptation module, and classifier association. Experiments are evaluated on five target domains with the source model pretrained on GTA5. Results are reported in Tab. 2.
Effectiveness of DAM. In the BN branch of Tab. 2, "Historical" indicates directly using BN statistics of the source for normalization, which can be regarded as the baseline or source model. "Instance" represents using instance statistics for normalization. Two observations can be made. First, the "Instance" model produces worse performance than the "Historical" model on all target domains, especially on IDD and Cross-City. The average mIoU of "Instance" is 4.49% lower than the "Historical". This indicates that using instance statistics only is not suitable for TTDA-Seg. Second, DAM improves the results in most cases and obtains an improvement of 1.48% in average mIoU over the "Historical" model. Specifically, DAM boosts the mIoU by 3.39% and 3.5% on CityScapes and BDD100K, respectively. Even when the gap between "Historical" and "Instance" is too large (e.g., 9.25% on IDD), our DAM is not deteriorated by the negative impact of "Instance" too much and still produces competitive results to the "Historical" with a marginal gap of 0.82%. These two observations suggest that our DAM can effectively merge the guidance of instance knowledge into historical statistics to achieve an effective and stable adaptation process.
Effectiveness of SAM. In the semantic branch, "Instance" indicates the instance-specific prototypes calculated by reliable pixels in the current testing image. "Historical" represents the historical prototypes. Notice that, the semantic branch is conducted based on DAM, where the features for calculating prototypes are obtained after distribution alignment. We can make the following conclusions. First, the "Historical" classifier and the parametric classifier (DAM) achieve a very similar average mIoU. Second, the "Instance" classifier obtains lower average mIoU than the "Historical" classifier. However, by taking a close look at the results on five target domains, we can find that the "Instance" classifier outperforms the "Historical" classifier on three datasets (CityScapes, BDD and Mapillary). This indicates that these two non-parametric classifiers have particular merits in particular datasets. Third, SAM clearly outperforms both non-parametric classifiers in average mIoU. Specifically, our SAM surpasses the "Historical" classifier by 2.36% in average mIoU. Fourth, similar to the BN branch, when the gap between "Historical" and "Instance" classifiers is large, SAM may not bring improvement, e.g., the IDD case. However, our SAM still remains the high performance without influencing by the inferior classifier. The above observations verify the appropriateness of using the prototype classifiers and also the effectiveness of the proposed SAM across different target domains. cally, the average mIoU is increased by 4.44% for DAM and 2.09% for SAM. This validates the effectiveness of leveraging the mutual benefit between parametric and nonparametric classifiers.

Continual TTDA-Seg
In real-world applications, such as autonomous driving, the environments are ever-changing and complex. To better simulate such practice scenarios, we design a continual TTDA-Seg experiment. Specifically, the dynamic environment is built by sequentially incoming target domains. The domain-stream is "BDD→CC→CS→IDD→MA", which is sorted by alphabetical order for simplicity and performed by two rounds. We use the Synthia-pretrained model as the source model and report the mIoU after meeting each target domain. In Fig. 5, we compare our method with TENT [41]. We implement two versions for TENT [41]. TENT-ContinualAdapt: continually adapt the model with the incoming target domains. TENT-SourceAdapt: directly adapt the source-pretrained model on the given domain.
We can observe that "TENT-ContinualAdapt" suffers from significant performance degradation when compared to "TENT-SourceAdapt". For example, when testing on the CityScapes dataset, "TENT-ContinualAdapt" is 19% and 67% lower than "TENT-SourceAdapt" in mIoU at the first round and second round, respectively. This phenomenon can also be observed in other domains. This is mainly because that TENT will accumulate the errors during adaptation and thus leads a worse model. Instead, our DIGA does not have the error accumulation problem and consistently performs well on all domains. This further validates the effectiveness of our method in real-world TTDA-Seg.

Computational Cost
In TTDA-Seg, efficiency is also very important. In Tab. 3, we investigate the computational costs of We conduct experiments on the "GTA5→CityScapes" setting. For the inference time, we report the average time (T Avg /ms) and the maximize time (T M ax /ms) for each testing sample. In addition, the GPU memory cost is also estimated. We can find that the backward-based methods significantly improve the inference time and GPU memory cost. For example, TENT [41] increases the average time from 134ms to 411ms and the memory cost from 3.5GB to 14.5GB. Even though EATA skips the unreliable pixels during optimization and leads to a lower average inference time than TENT, it still introduces large extra computational cost over the source model. Since SITA [15] uses extra augmented image during testing, its computational cost is doubled. Our DIGA and the other two backward-free methods (Momentum [33] and DUA [25]) produce very limited extra computational cost benefiting from their lightweight designs. However, our DIGA significantly surpasses than Momentum [33] and DUA [25] in mIoU. This experiment suggests that our DIGA is an effective and efficient TTDA-Seg method.