## Abstract

Resistance spot welding (RSW) is a widely adopted joining technique in automotive industry. Recent advancement in sensing technology makes it possible to collect thermal videos of the weld nugget during RSW using an infrared (IR) camera. The effective and timely analysis of such thermal videos has the potential of enabling in situ nondestructive evaluation (NDE) of the weld nugget by predicting nugget thickness and diameter. Deep learning (DL) has demonstrated to be effective in analyzing imaging data in many applications. However, the thermal videos in RSW present unique data-level challenges that compromise the effectiveness of most pre-trained DL models. We propose a novel image segmentation method for handling the RSW thermal videos to improve the prediction performance of DL models in RSW. The proposed method transforms raw thermal videos into spatial-temporal instances in four steps: video-wise normalization, removal of uninformative images, watershed segmentation, and spatial-temporal instance construction. The extracted spatial-temporal instances serve as the input data for training a DL-based NDE model. The proposed method is able to extract high-quality data with spatial-temporal correlations in the thermal videos, while being robust to the impact of unknown surface emissivity. Our case studies demonstrate that the proposed method achieves better prediction of nugget thickness and diameter than predicting without the transformation.

## 1 Introduction

Resistance spot welding (RSW) is a widely used technique for joining metal sheets. It has the advantages of low cost, high speed, reliable, and simple operations [1]. These merits have led to a wide adoption of this technique in the automotive industry, for joining lightweight metals such as aluminum (Al) alloys and lightweight steels [2]. As shown in Fig. 1(a), during the RSW process, two or more metal sheets are clamped together and placed between two water-cooled electrodes. Electrical current passes through the metal sheets, generating heating and creating a molten *nugget* (i.e., spot of welding) at the faying surface. After a specified holding time, the electrical current is shut down to let the nugget solidify [1]. A welding spot is therefore formed.

Lightweight materials are increasingly used in cars and trucks to decrease weight while preserving strength. However, there is a lack of understanding of RSW for joining lightweight alloys due to the mechanical aspects and imperfect operation [1]. Defects commonly occur in the nuggets, compromising the utilization of lightweight parts in industry. As shown in Fig. 1(b), the major defects from RSW include insufficient/no fusion, porosity, expulsion (or excessive indentation, i.e., the ejection of molten metal), and cracks [3]. These defects are usually caused by variations in the electrical current, extra/insufficient holding time, and other uncertainties during manufacturing. The quality of the weld can be implied by the *size* of the weld nugget—*thickness* and *diameter* in particular. Therefore, there is a pressing need for the nondestructive evaluation (NDE) of nugget thickness and diameter, and such evaluation can be used to provide information about possible defects.

The traditional evaluation of RSW nuggets uses destructive methods such as the chisel test and peel test. These methods are time-consuming, costly, and can only be done after welding [4]. There is an imperative need to develop in situ NDE methods for RSW. Recent development in inline sensing technology has enabled real-time acquisition of thermal images for RSW nuggets. Infrared (IR) camera aims at the welding spot with a tilted angle and captures thermal images at a high frequency of 100 fps. Figure 2(a) shows the setup of data acquisition and Fig. 2(b) shows selected thermal images in a video. The pixel values in each thermal image reflect the IR radiation of the nugget at the time of data recording. The resulted data is a thermal video that conveys precise, real-time information of the nugget formation process. These data allow us to predict nugget size without destroying the part.

Deep learning (DL) has demonstrated to be effective in analyzing imaging data in many applications including manufacturing. Deep neural network was adopted in Imani et al. [5] to learn the geometric variation of additive manufacturing (AM) part from layer-wise imaging profile; convolutional neural network (CNN) was used in Zhang et al. [6] for predicting AM part porosity from the thermal images of melt pool; a convolutional-artificial NN model was developed in Francis and Bian [7] for predicting AM part distortion. Janssens et al. [8] deployed CNN on lab-generated thermal videos for machine condition monitoring. These studies revealed the promises in exploiting DL’s learning ability, high accuracy, and real-time prediction for thermal image-based process monitoring and quality prediction.

In this study, we apply DL on thermal videos of weld nugget for in situ NDE. Specifically, a CNN regression model is developed to predict the *thickness* and *diameter* of the nugget. However, the thermal videos in RSW present unique data-level challenges that compromise the effectiveness of most existing DL models. First, although each thermal video captures the entire RSW process, the useful information about weld nugget is only available after electrodes lift to expose the surface of the welded region. Hence, the starting time (i.e., frame index in a video) to extract the useful information out of the entire video needs to be determined. The frames before this starting time are *uninformative* of the nugget and thus should be discarded. Second, the nugget profile can be “blurry.” A nugget has no clear contrast of dents and spikes or any sharp edges/vertex, thus naturally a “blurry” object. Meanwhile, the IR images may have a limited resolution that further compromise the nugget clarity. Third, there is spatial-temporal correlation in thermal images. Within an image, the pixel values are related to their position in the nugget, implying spatial correlation among pixels; on the other hand, the nugget profile in a thermal video evolves with the timestamp of recording (or frame index), indicating temporal correlation across images. Such spatial-temporal correlation should be preserved in data processing.

Existing studies emphasize the development of new DL models or customization of DL architectures for learning from thermal images, but rarely address these data-level challenges. If not resolved during data processing (before model training), these issues will significantly compromise the learning outcome. In this work, we propose an innovative data processing approach based on image normalization and segmentation, which effectively removes uninformative images and enhance the clarity of patterns in nuggets. The processed thermal images are used to build a CNN regression model that achieves an improved performance in nugget quality prediction.

The rest of this paper is organized as follows. Section 2 will introduce the thermal video data from RSW of Boron steel that motivates this study. Section 3 will provide the technical detail of our method. A case study will be presented in Sec. 4 to demonstrate the performance improvement in nugget quality prediction using the proposed method. Section 5 will end the paper with concluding remarks and future research directions.

## 2 Data Description

We obtain in situ thermal videos from lab implementation of RSW for joining two sets of Boron steels: (i) a 2T stack of bare boron steel sheets, 1 mm thickness each; (ii) a 3T stack of Al-coated Boron steel sheets, 1 mm thickness for the top and bottom sheets and 2 mm thickness for the middle sheet. A thermal video in (i) consists of 600∼602 frames and that in (ii) consists of 500∼504 frames. Each frame (in both datasets) is a grayscale thermal image of size 61 × 81. Depending on the recording time, the pixel values in the image may have different ranges. For an early frame such as Fig. 2(b1) and (b2), the nugget is blocked by the weld head and thus not fully appear in the image. Consequently, the pixel values (for IR intensity) are all small (less than 20) in the early frame. For a later frame captured when the nugget is fully formed and stabilized such as those shown in Fig. 2(b3)–(b5), the pixel values are higher and typically range between 20 and 100. There are 25 videos in dataset (i) and 22 videos in dataset (ii).

Nugget thickness and diameter were measured in the lab using destructive testing methods. Table 1 shows the measurements for the welds corresponding to ten selected videos, where “Dmin” and “Dmax” are the minimal diameter and maximal diameter of the weld nugget, respectively. Each row in Table 1 corresponds to one nugget, whose formation is shown in one video. It then follows that all the thermal images in a video correspond to the same measurements of nugget thickness and diameter.

Video | Thickness (mm) | Dmin (mm) | Dmax (mm) |
---|---|---|---|

1 | 1.899 | 3.135 | 3.311 |

2 | 1.905 | 3.135 | 3.289 |

3 | 1.871 | 4.923 | 4.923 |

4 | 1.875 | 4.875 | 4.945 |

5 | 1.861 | 5.740 | 5.762 |

6 | 1.863 | 5.740 | 5.784 |

7 | 1.857 | 5.673 | 5.828 |

8 | 1.853 | 5.717 | 5.784 |

9 | 1.811 | 6.336 | 6.424 |

10 | 1.787 | 6.336 | 6.446 |

Video | Thickness (mm) | Dmin (mm) | Dmax (mm) |
---|---|---|---|

1 | 1.899 | 3.135 | 3.311 |

2 | 1.905 | 3.135 | 3.289 |

3 | 1.871 | 4.923 | 4.923 |

4 | 1.875 | 4.875 | 4.945 |

5 | 1.861 | 5.740 | 5.762 |

6 | 1.863 | 5.740 | 5.784 |

7 | 1.857 | 5.673 | 5.828 |

8 | 1.853 | 5.717 | 5.784 |

9 | 1.811 | 6.336 | 6.424 |

10 | 1.787 | 6.336 | 6.446 |

## 3 Method

This section presents the technical details of the proposed data processing method. It consists of four steps: video-wise normalization (Sec. 3.1), identification of uninformative images (Sec. 3.2), image segmentation (Sec. 3.3), and spatial-temporal instance construction (Sec. 3.4). The processed data are used to train a CNN regression model for predicting the nugget thickness and diameter (Sec. 3.5).

### 3.1 Video-Wise Normalization.

The IR signal values in a thermal image can be noisy due to environmental uncertainties, emissivity (i.e., the effectiveness in emitting energy as thermal radiation [9]) fluctuation, and recording errors. Such noise may distort the nugget profile. Within a thermal video, all the images are associated with the same nugget. The images record the nugget’s formation in temporal changes. By normalizing the images along the timeline, noise and errors in individual frames should be substantially reduced. The true patterns of nugget can be better revealed.

*n*th thermal video in a dataset by

*P*_{n},

*n*= 1, 2, …, and the

*t*th image (pixel matrix) in it by

*r*and

*c*are the number of rows and columns in the pixel matrix, respectively. We propose

*video-wise normalization*to normalize all the frames in the video along the timeline. Specifically, we flatten each pixel matrix to a vector, $p~n(t)$, of length

*r*×

*c*and concatenate all the vectors (of the video) to a matrix:

*T*is the total number of frames in the video. Next, we normalize each column of pixels in $P~n$, which is equivalent to normalizing the pixel value of a fixed position in image across all the frames. Let $P~~n$ denote the normalized $P~n$. Each row in $P~~n$ is converted back to a pixel matrix, i.e., a normalized thermal image of nugget, expressed as

*t*= 1, 2, …,

*T*. The video-wise normalization procedure is illustrated in Fig. 3.

### 3.2 Identification of Uninformative Images.

The normalized thermal videos need to be screened for uninformative images. Any frames containing insufficient information of nugget should be removed from the analysis. Preliminary inspection (Sec. 2) shows that early frames in a thermal video tend to have low IR radiation intensity due to the absence of nugget (blocked by electrodes). After electrodes lifted, the welded area was exposed, resulting in a sudden increase of the IR intensity captured the camera. Therefore, the sufficiency of nugget information can be evaluated by thresholding the pixel magnitude in an image—the image is considered informative only if all its pixels are not smaller than a threshold, *q*_{n}.

*q*

_{n}be the

*Q*th percentile of all pixel values in the normalized thermal video $P~~n$, i.e.,

*q*

_{n}is the $\u2308Q/100\u22c5rcT\u2309$th smallest pixel value in $P~~n$, where

*rcT*is the total number of pixels in $P~~n$. We then compare the pixels in each normalized image, $p~~n(t)$, with

*q*

_{n}, and preserve the image only if

As mentioned earlier, the stabilized nugget has higher pixel values in thermal images. If sorting all pixels in $P~~n$ from the smallest to the largest, then a recommended value of *Q* is the percentile corresponding to the smallest (normalized) pixel value in the frames of stabilized nugget. With such parameter selection, the images with insufficient nugget information should be discarded while those with the stabilized nugget should be preserved. We define a set Ω* _{n}* for the indices of preserved frames in thermal video $P~~n$.

### 3.3 Image Segmentation.

A major obstacle for improving the learning outcome of DL from RSW thermal images is the blurry nugget profile. In this section, we propose a watershed-like image segmentation method [10] to characterize the nugget profile and elucidate potential defects such as porosity, cracks, and irregular nugget size from nonsegmented images.

The complementary set of $Xn(ln(t))$ is denoted by $X\xafn(ln(t))$, which contains all the remaining pixels in $p~~n(t)$. We let $ln(t)$ be the *L*th percentile of $p~~n(t)$, i.e., $ln(t)$ is the $\u2308L/100\u22c5rc\u2309$th smallest pixel value in $p~~n(t)$. We further define

The pixels belonging to set $Xn(ln(t))$ are then segmented from the rest of the image. Figure 5 displays a sample image and its segments after the thresholding. The resulted segments have dark areas (value 1) as above the watershed contour and the white regions (value 0) below the watershed.

There are other methods for drawing watersheds [11,12]. However, in this study, the classic way is simple yet effective in contouring the nugget profile. To clarify interesting patterns in the nugget, we define multiple levels, $ln1(t),ln2(t),\u2026,lnM(t)$, for segmenting one image. For each level, we produce an image segment. Eventually, a single thermal image will be transformed into *M* segments, each describing the nugget profile at a different altitude. Figure 5 demonstrates the segmentation with *M* = 5 levels (simplified notations with superscript (*t*) and subscript *n* omitted in $ln1(t),\u2026,ln5(t)$). If a weld defect, e.g., porosity, or irregular shape arise in the nugget, the proposed image segmentation method will capture the irregularity in certain segments, given properly selected levels. With the clearly contoured morphemes in these image segments, DL models can better learn the regular and irregular nugget profiles, thus making more accurate predictions for nugget thickness and diameter.

### 3.4 Construction of Spatial-Temporal Instances.

Now, the only remaining challenge in DL-based quality prediction for RSW is the spatial-temporal correlation in thermal videos. In Sec. 3.3, a single image is transformed into *M* segments. These segments together reflect the spatial patterns in the nugget, thus should be considered as one sample. By adopting a CNN regression model, the spatial correlation in the sample should be automatically learned. The remaining question is how to incorporate the temporal correlation. Temporal correlation arises due to the evolution of nugget profile with time. In other words, consecutive frames in a video form a time series of thermal images. If we take a sequence of frames every *δ* timestamps in the video as one sample, then the temporal correlation should be incorporated. Denote the sequence length by *S*, 1 < *S* < *T*, and let the time increment *δ* ≥ 1. Since the IR camera has a high speed, adjacent frames may have similar nugget profiles. Having *δ* ≥ 1 can avoid information duplication in the image sequence. The best *δ* value depends on the IR camera frequency—a very high frequency of capturing images implies a relatively large *δ*. To simultaneously accommodate the spatial and temporal correlation, we construct a spatial-temporal instance by concatenating the *M* segments of *S* frames, which forms an instance of shape (*r*, *c*, *M* · *S*). Figure 6 demonstrates a spatial-temporal instance for (*r*, *c*, *M* · *S*) = (61, 81, 15) with *S* = 3, *M* = 5, *δ* = 5. The image segments of each thermal image make clear the pattern variations in space; concatenating the 15 segments in a sequence preserves the temporal evolutions of nugget across the three frames. After building one spatial-temporal instance, we move 1-frame forward and use the next *S* thermal images (for every *δ* frames) in the video to build the next instance.

### 3.5 Convolutional Neural Network Regression for Nugget Quality Prediction.

The spatial-temporal instances imply that customization is necessary on a conventional CNN regression model to make it compatible with the input data. For CNN input, a single image (segment) is typically reshaped to a square pixel matrix. In our case, a single image segment has an original shape of (61, 81), which can be readily reshaped to (64, 64) without severe distortion of the information. A spatial-temporal instance now has shape (64, 64, *M* · *S*). Such 3-dimensional (3D) instances should be handled by a 3D CNN model. However, to reduce computational burden, we treat each image segment in the 3D instance as a *channel*, i.e., source of information to DL models, and use a 2D CNN model to learn from the 3D input. The *filters* of the first convolutional layer in this model are customized to be 3D with a depth of *M* · *S* in order to learn from all the *M* · *S* input channels. Figure 7 provides the architecture of our spatial-temporal CNN regression. It consists of three convolutional layers and two dense layers. Eventually, the input is mapped to the response, ** y** = [

*Thickness, Dmin, Dmax*]. The spatial-temporal correlation is considered in the input layer and first convolutional layer, and the rest model structure follows conventional design. The model parameters, e.g., filter size and dropout rate, are determined after fine tuning. For model training, the loss function is mean squared error (MSE) per the convention of regression [13]; the optimizer is chosen to be “Adam” (adaptive moment estimation) for its superior efficiency [14].

## 4 Case Study

In this case study, we apply the proposed data processing method on the RSW datasets (i) and (ii). For each dataset, we compare the performance of CNN regression (in both training and test phase) when using the processed data to that using the original data. By “original,” we mean the raw thermal videos, where each frame is reshaped to (64, 64) as an instance for CNN input. When building the spatial-temporal instances, the parameters are *Q* = 50, *M* = 5, *S* = 3, where the five levels are $(l1,l2,l3,l4,l5)=(50%,60%,70%,80%,90%)$ percentiles of a normalized thermal image. We experiment with different time increments, *δ* ∈ {1, 3, 5}, to produce the results.

If using original data, dataset (i) has around 13,750 instances and dataset (ii) has around 11,000. Yet, these sizes will decrease when constructing spatial-temporal instances. For example, the total number of instances will be 1566 for (i) and 1416 for (ii) if *δ* = 3 (other parameters are as given above). To avoid overfitting due to the small data size, we do sixfold cross-validation (CV) in model training. The instances of a dataset are randomly shuffled and assigned to six equal-sized folds without replacement. In one run of model training, one of the six folds is preserved as testing data; 80% of the rest five are training data and the last 20% are training-phase validation data. 100 epochs without batching are used to train the model. MSE loss is adopted as the performance metric for either training or prediction.

### 4.1 Training Performance.

Figures 8 and 9 show the training performance for datasets (i) and (ii), respectively. In each subplot, the horizontal axis is the number of epochs and the vertical axis is MSE loss. The (blue) curve with dot markers represents training loss, and the (red) curve with triangle markers represents training-phase validation loss. From left to right, each column of three plots are for conventional CNN regression without data processing, spatial-temporal CNN regression with *δ* = 1, spatial-temporal CNN regression with *δ* = 3, and spatial-temporal CNN regression with *δ* = 5; from top to bottom, each row of plots are for the first run of CV, third run of CV, and sixth run of CV. Note that, even though all titled “CV1” (or “CV3”/“CV6”), the training/testing instance are in different shapes and orders across Figs. 8(a)–8(d) due to the way we process the data (same for Figs. 9(a)–9(d)). But the comparison is solid and comprehensive as it provides the typical training performance across different CVs.

We see that the column (a) subplots in both Figs. 8 and 9 show volatile training/validation loss. We notice that the training loss surges suddenly in Figs. 8(a1) and 9(a1), indicating that the model did not sufficiently learn from the data, resulting in underfitting. It is also noticed that the validation loss can increase and remain high for a couple of epochs, as in Fig. 9(a1–3), implying that the model over-characterized the training set and resulted in overfitting. Such underfitting/overfitting phenomena during model training show that unprocessed thermal videos can cause difficulty for CNN model convergence. As a contrast, after processing the data with our proposed method, the model training becomes rather efficient and smooth—columns (*b*)–(*d*) in either Fig. 8 or Figure 9 show fast model convergence, as demonstrated by the stable, low training/validation loss after epoch 10. With the processed data, dataset (i) has better training performance—the training and validation loss are close after model convergence, indicating no serious overfitting. Spatial-temporal instances with *δ* = 1, 3, or 5 led to similar performance, so any of them is a satisfying choice for this dataset. For dataset (ii), certain plots for using processed data, e.g., Figs. 9(b) and 9(c), has larger validation loss, implying certain overfitting with *δ* = 1 and *δ* = 3. When taking *δ* = 5, the validation loss gets closer to the training loss (after model convergence), so spatial-temporal instances built with *δ* = 5 (and all other aforementioned parameter values) are recommended for this dataset.

### 4.2 Prediction Performance.

Evaluation of the prediction (testing) performance is even more crucial. We consider the prediction MSE loss (the smaller the better) on average for the sixfold CV to evaluate the overall prediction accuracy. The minimal, mean, median, and maximal prediction MSE are calculated for each run of CV, then taken average. Table 2 shows the values of minimal, mean, median, and maximal average prediction MSE for datasets (i) and (ii). The “Parameter” column shows the parameter in CNN.

Dataset | Parameter | Min MSE | Mean MSE | Median MSE | Max MSE |
---|---|---|---|---|---|

(i) | Conventional | 0.0000 | 0.0836 | 0.0295 | 7.3843 |

$\delta =1$ | 0.0001 | 0.1208 | 0.0221 | 2.9996 | |

$\delta =3$ | 0.0001 | 0.1015 | 0.0149 | 1.9122 | |

$\delta =5$ | 0.0001 | 0.1089 | 0.0188 | 2.2179 | |

(ii) | Conventional | 0.0004 | 4595.19 | 0.3842 | 4,461,273.40 |

$\delta =1$ | 0.0004 | 0.7198 | 0.0699 | 14.0236 | |

$\delta =3$ | 0.0003 | 0.5125 | 0.0505 | 11.7648 | |

$\delta =5$ | 0.0003 | 0.3307 | 0.0392 | 8.1566 |

Dataset | Parameter | Min MSE | Mean MSE | Median MSE | Max MSE |
---|---|---|---|---|---|

(i) | Conventional | 0.0000 | 0.0836 | 0.0295 | 7.3843 |

$\delta =1$ | 0.0001 | 0.1208 | 0.0221 | 2.9996 | |

$\delta =3$ | 0.0001 | 0.1015 | 0.0149 | 1.9122 | |

$\delta =5$ | 0.0001 | 0.1089 | 0.0188 | 2.2179 | |

(ii) | Conventional | 0.0004 | 4595.19 | 0.3842 | 4,461,273.40 |

$\delta =1$ | 0.0004 | 0.7198 | 0.0699 | 14.0236 | |

$\delta =3$ | 0.0003 | 0.5125 | 0.0505 | 11.7648 | |

$\delta =5$ | 0.0003 | 0.3307 | 0.0392 | 8.1566 |

Note: Bold values marks the best performance for each metric.

For dataset (i), as shown in the top half of Table 2, the minimal, mean, and median MSE values are close for conventional CNN and our spatial-temporal CNN with different *δ*s. All these MSEs are rather low and below 0.02, but the maximal average MSE is significantly larger for the conventional CNN, which is consistent with its underfitting/overfitting in model training—some predictions made were too far away from their true values. For our spatial-temporal CNN, the maximal average MSE remains small—typically below 2. Our model with *δ* = 3 achieves the lowest maximal average MSE and is the best option for dataset (i). For dataset (ii), as shown in the bottom half of Table 2, the prediction performance of conventional CNN is much worse than the spatial-temporal CNN—its mean and maximal average MSEs exceed 4000. With our spatial-temporal CNN, the average MSEs are maintained at a low level and similar to those in dataset (i). Among the three *δ* values, *δ* = 5 leads to the lowest average MSE. The desirable prediction performance of *δ* = 5 is consistent with its outstanding training performance.

To supplement the average results, Table 3 further provides the standard deviation (std) of minimal, mean, median, and maximal prediction MSE values across the 6-fold CV. The standard deviation measures the variability of a performance metric across the 6 runs of prediction. If a model is robust, then the model trained with different training sets can achieve similarly good prediction performance, hence a small std value for each prediction performance metric. For dataset (i), all std values are rather small. Our spatial-temporal CNN, trained on instances constructed with *δ* = 3, leads to the lowest std for mean, median, and maximal MSEs, indicating the best model robustness. Dataset (ii), however, shows rather large std when using conventional CNN regression—the std for minimal and median MSEs are small but the mean and maximal MSEs are overwhelming. This phenomenon implies that the model is not robust against extreme instances—a couple of severe outliers have led to skewed mean and maximal MSEs. The prediction performance for using the conventional CNN on dataset (ii) is rather unstable. This is also expected from the severe underfitting/overfitting in training as shown in Fig. 9(a). Fortunately, with the proposed spatial-temporal CNN regression, the std values for prediction MSE are reduced to a low level. The best robustness is achieved by spatial-temporal CNN along with instances constructed with *δ* = 3 (or *δ* = 5 if focusing on the min and mean MSEs). Our data processing has effectively improved the training data quality and built a more robust CNN regression model for NDE.

Dataset | Parameter | Min MSE | Mean MSE | Median MSE | Max MSE |
---|---|---|---|---|---|

(i) | Conventional | 0.00005 | 0.05496 | 0.03360 | 0.66647 |

$\delta =1$ | 0.00009 | 0.03106 | 0.01153 | 2.25708 | |

$\delta =3$ | 0.00017 | 0.02905 | 0.00393 | 0.57477 | |

$\delta =5$ | 0.00006 | 0.04838 | 0.00988 | 0.97715 | |

(ii) | Conventional | 0.00048 | 9757.38 | 0.49556 | 10,168,225.32 |

$\delta =1$ | 0.00036 | 0.11326 | 0.03292 | 2.40647 | |

$\delta =3$ | 0.00052 | 0.13074 | 0.01277 | 1.95331 | |

$\delta =5$ | 0.00014 | 0.10799 | 0.01497 | 3.31112 |

Dataset | Parameter | Min MSE | Mean MSE | Median MSE | Max MSE |
---|---|---|---|---|---|

(i) | Conventional | 0.00005 | 0.05496 | 0.03360 | 0.66647 |

$\delta =1$ | 0.00009 | 0.03106 | 0.01153 | 2.25708 | |

$\delta =3$ | 0.00017 | 0.02905 | 0.00393 | 0.57477 | |

$\delta =5$ | 0.00006 | 0.04838 | 0.00988 | 0.97715 | |

(ii) | Conventional | 0.00048 | 9757.38 | 0.49556 | 10,168,225.32 |

$\delta =1$ | 0.00036 | 0.11326 | 0.03292 | 2.40647 | |

$\delta =3$ | 0.00052 | 0.13074 | 0.01277 | 1.95331 | |

$\delta =5$ | 0.00014 | 0.10799 | 0.01497 | 3.31112 |

Note: Bold values marks the best performance for each metric.

### 4.3 Discussion and Recommendation.

Both overfitting and underfitting can lead to poor model performance. In order to limit overfitting/underfitting, we recommend that the number of weld nuggets (equivalently, their in situ thermal videos) for model training to be no fewer than the case study provided here, i.e., 20∼25. If an inline sensor with lower speed (e.g., <100 fps) is used, more nugget videos should be collected to form the training set.

In our case study, the proposed method is applied on the two RSW datasets separately, resulting in two models although they follow the same framework. Since the two datasets come from two different experimental conditions, as explained in Sec. 2, the underlying physics differ. Therefore, we recommend developing one spatial-temporal CNN model for each type of experiment.

Another thing worth mentioning is the “in situ” manner in NDE. When the proposed data processing method is adopted, incoming new thermal videos are first processed with video-wise normalization, uninformative image removal, image segmentation, and spatial-temporal instance construction. The processed new data are fed to the spatial-temporal CNN for NDE of nugget thickness and diameter. The processing time is short (typically less than 1 min for a video) and will not increase the computational burden or compromise the timeliness of NDE.

In online prediction with a trained spatial-temporal CNN, a plausible way to further improve NDE efficiency is drawing *S* raw thermal images with *δ* increment from a new video to construct a single spatial-temporal instance. Since a video is for only one nugget, with a robust spatial-temporal CNN regression model, one instance suffices for predicting the nugget thickness and diameter accurately.

## 5 Conclusion

In this study, we proposed an innovative data processing method to improve the prediction performance of CNN regression with thermal videos of RSW nugget. Normalization and watershed image segmentation were explored for resolving the data-level challenges posed by thermal videos, i.e., uninformative images, blurry nugget profile, and spatial-temporal correlation. Spatial-temporal instances were constructed using the proposed method and fed to a spatial-temporal CNN regression model, which was demonstrated to result in significantly more accurate prediction for the nugget thickness and diameter.

This work has multiple technical contributions. First, it has established an effective, systematic way of improving noisy, blurry thermal imaging data for better learning outcome in DL-based NDE. This was an underexplored topic but properly addressed in our study. Second, the work provides a reference and performance benchmark for subsequent studies about NDE with RSW thermal videos. The case study data had limited quality, but our method has achieved satisfying NDE performance on it, indicating a promising direction for enhancing the DL-based NDE performance. Third, the proposed method can be extended for weld defect detection by incorporating defect information such as cracks and porosity in model training. Fourth, the proposed data processing method is readily generalizable to various RSW applications. It can guide existing DL-based NDE practice.

## Acknowledgment

This article was supported in part by the US Department of Energy, in part by the Office of Nuclear Energy (Advanced Methods for Manufacturing Program), and in part by the AI Initiative at Oak Ridge National Laboratory.

## Conflict of Interest

There are no conflicts of interest.

## Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request. Data provided by a third party listed in Acknowledgment.

## Nomenclature

*c*=number of columns in a pixel matrix (image)

*r*=number of rows in a pixel matrix (image)

*t*=image index,

*t*= 1, …,*T**M*=number of levels in watershed segmentation

*S*=number of raw images in a spatial-temporal instance

*T*=the number of images (or image frames) in a thermal video

=*y*response variable (label) in DL model

*q*_{n}=threshold for removing uninformative images

*P*_{n}=the

*n*th thermal video in a dataset- $ln(t)$ =
threshold for watershed segmentation for an image

- $pn(t)$ =
the

*t*th pixel matrix (image) in the*n*th video- $P~n$ =
the

*n*th thermal video after pixel matrix flattening- $P~~n$ =
the

*n*th thermal video after normalization- $p~n(t)$ =
the

*t*th flattened pixel matrix in $P~n$- $p~~n(t)$ =
the

*t*th normalized image in $P~~n$*X*(·) =set of pixels in an image above the watershed contour

*δ*=timestamp increment between each of the

*S*images- Ω
_{n}= set of images in the

*n*th video containing the nugget