Target Detection of Diamond Nanostructures Based on Improved YOLOv8 Modeling (2024)

1. Introduction

The relationship between microstructure and performance for electrode materials, such as Boron-doped diamond (BDD), has been a foundational question in instrument science and sensor technology. BDD films have been widely used in electrochemical sensing because of their excellent mechanochemical stability, wide electrochemical windows, low background currents, and excellent resistance to biological fouling [1]. Recent studies have shown that the surface structure and morphology of BDD film electrodes are the key factors in determining their macroscopic sensing properties, such as seawater salinity. Hence, nano-structuring of electrode materials has drawn extensive attention from scientific and technological points of view. However, previous researchers have mainly interpreted the morphology and microstructure by analyzing the SEM photographs, which were strongly determined by the subjective judgment of analysts. With the development of deep learning, the target detection and feature extraction of large-area and complex surface structures could be easily carried out under more objective data-based evaluation criteria [2]. Therefore, target detection by deep learning, as the foundation and bridge for understanding, optimization, and application of diamond nanostructures, plays an important role in promoting automatic analysis and subsequent parameter acquisition and is thus of great research significance.

The effectiveness of deep learning is verified in various fields such as computer vision, speech recognition, and natural language processing [3]. Further, deep convolutional neural networks (DCNNs) push the performance of computer vision systems to a higher level, including image classification [4] and object detection.

For the target detection of nanostructures by deep learning, there are several technological difficulties to be overcome. Firstly, diamond nanostructures need to be photographed using a specific scanning electron microscope due to their small size, in which grayscale images are always obtained with unclear object boundaries. Secondly, during the fabrication process of physical and chemical etching, the nanostructures occlude each other owing to their high number density and high ratio of length and diameter. Thirdly, the noisy data at an oblique angle increases the task and difficulty of target detection.

Deep learning algorithms could overcome all these obstacles, as well as images noise or structure defect, after training with datasets [5]. The majority of current target detection algorithms can be classified into two distinct categories: two-stage and one-stage methods. Two-stage methods, exemplified by the RCNN (regions with CNN features) family [6,7], involve the extraction of candidate frames followed by classification and non-maximal suppression, which is used to refine the prediction. While two-stage detectors exhibit certain advantages in terms of detection accuracy, they also face challenges in terms of training, detection speed, and optimization. In contrast, single-stage methods, exemplified by the SSD (single-shot multi-box detector) [8,9] and the YOLO (you only look once) [10,11] series, are becoming increasingly prevalent, particularly in real-time object detection. Following the proposal of YOLOv1 by Redmon [12] to address the computational complexity of two-stage algorithms, subsequent researchers have developed YOLOv2, YOLOv3, YOLOv4 [13], YOLOv5 [14], YOLOv7 [15], and YOLOX [16], achieving promising results. It must be mentioned that targets in theses natural scenes are multi-scale with prominent color features.

However, diamond nanostructures in SEM images lack color features and complete information due to occlusion. Therefore, more mechanisms should be incorporated to the conventional detection algorithms to accurately detect diamond nanostructure in SEM images.

Recent research work in the field of target detection in SEM images of micro-structures, provide important theoretical guidance and algorithm optimization for the detection of diamond nanostructures. Okuyama Y et al. proposed an automated analysis method for SEM images of semiconductor device cross-sections using deep learning algorithms [17]. This method comprises two image recognition tasks: target detection for determining the coordinates of each cell of the pattern and semantic segmentation for obtaining each region (mask, substrate, and background) boundaries. The combined results of these two tasks, typically feature lengths, such as width and depth, can be measured accurately and immediately. And the extraction speed is 240 times faster than manual measurement. However, the proposed algorithm is only for target detection of individual semiconductor devices and not applicable to the detection of targets occluded from each other. Dengiz et al. used a fuzzy logic and neural network approach to automatically detect grain boundaries in microstructure images (obtained by optical microscopy of noisy data) of high-temperature alloy steels during the sintering process [18]. Albuquerque et al. [19] efficiently segmented and quantified microstructures in nodular, gray, and malleable cast iron images based on multilayer perceptive neural networks. Mulewicz et al. [20] classified different types of microstructural images of metals obtained by optical microscopy techniques based on deep convolutional neural network (DCNN) image analysis techniques. Adachi et al. [21] used three typical convolutional neural networks, including LeNet5, AlexNet, and GoogleNet, to recognize the microstructure of steel, verify their recognition accuracy, and investigate the effect of learning rate, dropout rate, and average image subtraction on the recognition accuracy. The effectiveness of deep learning to classify microstructures was demonstrated. However, all of these algorithms were based on samples taken at a flat overhead angle and do not apply to SEM images at an oblique angle. For most of the target detection methods, for SEM images of common materials, there is a general problem of low accuracy. Yet, there is still no precedent of applying deep learning to target the detection of diamond nanostructures with low image quality.

In order to solve the above problems, a target detection model is proposed based on SEM images of diamond nanostructures, called DWS-YOLOv8, using YOLOv8n as the backbone network. By replacing the backbone network with fused multiscale feature information, adding an attention mechanism and loss function that can reduce noise to improve the target localization ability, the performance of target detection for diamond nanostructures can be enhanced.

2. Methods

Taking account of the facts, there is no public dataset of diamond nanostructures. This paper starts by preparing diamond nanostructures and gradually builds the dataset used in this research. Then, the DWS-YOLOv8 model proposed in this paper is optimized on the basis of the original YOLOv8n model in three aspects, namely, network structure, attention mechanism, and loss function, respectively.

2.1. Producing Diamond Nanostructure Datasets

Materials and Experimental Methods

The BDD films (10 μm) were prepared using a hot filament chemical vapor deposition system (HFCVD) [22]. The deposition conditions were as follows: chamber pressure was 3–5 kPa, hot wire temperature was 2400–2500 °C, base temperature was 850 °C, doping concentration was 8000 ppm (ppm B/C in gas phase), and growth rate was 2.5 μm/h. Combined with a porous anodized aluminum oxide template, a hard mask layer of 200 nm was prepared by e-beam evaporation of 4N (99.99%) nickel metal pillars. Self-organized Ni nanoparticles were obtained after a 4-h vacuum annealing under 700 °C. Next, inductively coupled plasma (ICP) etching was performed to etch the BDD layer for 60 min [23] using the RF plasma equipment ICP601 (Beijing Chuangshiweina Technology, Beijing, China). Plasma was ignited in oxygen and argon atmospheres at pressures of 3 Pa. In the case of O₂ + Ar gas mixtures, the gases were introduced into the chamber in a 10:1 ratio at a pressure of 3 Pa. The RF power of 600 W was used to stimulate the plasma. After the plasma structuring process, the experimental results revealed that the diamond layer was covered by a re-deposited Ni nanoparticles layer, which served as an etching mask. The as-etched samples were analyzed by secondary electron microscopy (SEM, Hitachi 8100, 45° angle view). A dataset for improved target detection algorithms was constructed using photographs obtained by scanning electron microscopy.

2.2. Improvements to the YOLOv8 Model

The YOLO model has had tremendous success in computer vision. Building on this, researchers have improved the method by introducing new modules, leading to the development of many classic models. Compared to the outstanding models of the previous YOLO series, such as YOLOv5 and YOLOv7, YOLOv8 is characterized by higher detection accuracy and speed [24]. The core of the YOLOv8 detection algorithm is the combination of feature extraction, feature fusion, or various feature processing methods to enhance its object detection capability.

2.2.1. Deformable Convolutional DCN_C2f Network Architecture

The design of the YOLOv8 backbone network aims to effectively extract and fuse multiscale feature information. With the development of network architectures, classical backbone networks, such as ResNet, EfficientNet [25], MobileNet [26] and ShuffleNet [27], have been used in YOLOv8 to extract richer feature representations for tasks such as image classification, object detection, and segmentation.

In the backbone network of YOLOv8, the C2f module is responsible for feature extraction from input images [28]. A deep residual network structure is employed, and deep convolutional neural networks (CNNs) are used to extract feature representations. In this module, combined with stacking multiple convolutional layers, features in the image are extracted by utilizing residual connections and skip connections to fully exploit feature information at different levels, thereby enhancing the multiscale representability of features. However, the stacking of this module leads to excessive redundancy in the channel information. In addition, owing to the receptive field of the network restricted by the fixed characteristics of the standard convolutional kernels used in the C2f module, only local object information is captured [29]. As a result, missed detections are common when dealing with multiscale, multitarget, or occluded diamond nanostructures in SEM images. Then, for small target detection tasks such as diamond nanostructures, the C2f module is overly large, computationally intensive, and lacks lightweight characteristics.

To overcome these challenges, experiments are conducted to improve the detection accuracy. The second, third, and fourth C2f modules are replaced in the backbone network of the original YOLOv8 with the DCN_C2f modules. The deformable convolution module (DCN_C2f) is capable of expanding the sensory field of the feature map and concentrating the region of interest, which make the features extracted from diamond-nanostructured targets more discriminative. As applied in the remote sensing images [30], the effectiveness of DCN_C2f in enhancing the target localization ability of the algorithm was indicated, which is believed to effectively and accurately detect the nanostructures in the SEM images. Through this improvement, the adaptive adjustment of the receptive field of the network is achieved when model sampling allows, which effectively increases the sensitivity of the network to irregular shapes and improves the detection performance for targets of different scales and irregular shapes.

An offset for each sampling point is introduced into deformable convolution in the convolution kernel, which allows adaptive sampling beyond the constraints of a regular grid. The structure of the deformable convolution layer is shown in Figure 1. And a comparison between deformable and standard convolution is shown in Figure 2. The deformable convolution layer in the model has two tasks: first, to use convolution to obtain offset values from the input feature map, and second, to derive the output feature map based on the input feature map and the obtained offsets [31]. The representation of the deformable convolution is given by:

$y (p_{0}) = \sum_{p_{n} \in Ω} w (p_{n}) x (p_{0} + p_{n} + Δ p_{n})$

(1)

In Equation (1), $p_{0}$ represents the pixel position in the feature map, and Ω corresponds to the regular grid of the convolution kernel, illustrated here with a 3 × 3 convolution kernel as an example:

$Ω = {(- 1, - 1), (- 1,0), \dots, (0,1), (1,1)}$

(2)

$p_{n}$ is the position in Ω, $Δ p_{n}$ represents the offset $(\{Δ p_{n} = H \times W \times 2 N| n = 1, \dots N\})$ , where $N = | Ω |$ , and 2 denotes the two-dimensional offset ( $∆ x$ , $∆ y$ ). Here, $x$ , $y$ , and $W$ denote the input feature map, the output feature map, and the sampling position weight, respectively. However, since the offset $Δ p_{n}$ is usually fractional, the pixel value of the input feature map $x$ cannot be obtained directly. In general, it is obtained by bilinear interpolation, expressed as follows:

$x (p) = \sum_{q} G (q, p) \cdot x (q)$

(3)

In the equation, $p$ represents an arbitrary $(p = p_{0} + p_{n} + Δ p_{n})$ , $q$ denotes the spatial position in the feature map $x$ , and $G (\cdot, \cdot)$ is the bilinear interpolation kernel, expressed as follows:

$G (q, p) = g (q_{x}, p_{x}) \cdot g (q_{y}, p_{y})$

(4)

where $g (q, p) = m a x (0.1 - | q - p |)$ .

2.2.2. Shuffle Attention Mechanism

Attentional mechanisms are techniques that mimic the process of allocating human attention by allowing a model to focus on the parts of the input data that are relevant to the task at hand while ignoring the rest while processing it. Attention mechanisms are mainly categorized into three domains based on their application differences: spatial domain, channel domain, and mixed domain [32]. In computer vision, attention mechanisms are used in tasks such as image classification, target detection, and image generation. The attention mechanism allows the model to allocate different attention to different regions of the image during classification, thus improving the accuracy of classification [33].

Due to the limited contrast between background and target regions in images acquired by SEM, coupled with the presence of various types of noise, the application of many detection models is restricted by poor background suppression and noise resistance capabilities. To improve the detection model’s focus on the critical information in the input features while minimizing attention to the background, an attention mechanism called shuffle attention was introduced into the model’s backbone network. As it is applied to coal mining machine hobnail teeth [34], the ability to localize the hobnail teeth in dark and noisy environments is improved by the introduction of shuffle attention. And the noise suppression ability of shuffle attention is also proven, which would reduce the noise impact on the diamond nanostructure target detection results.

In the shuffle attention structure, the shuffle unit is used to integrate spatial and channel attention mechanisms. The structure diagram of the shuffle attention mechanism is shown in Figure 3.

For a given diamond nanostructure feature map $X \in R^{C \times H \times W}$ , where $C$ , $H$ , and $W$ represent the number of channels, spatial height, and width, respectively. SA first divides $X$ along the channel dimension into $G$ groups, i.e., $X = [X_{1}, \dots, X_{G}]$ , where $X_{k} \in R^{C / G \times H \times W}$ . Each sub-feature $X_{k}$ gradually captures specific semantic responses during the training process. Then, the attention module generates importance coefficients for each sub-feature. Specifically, at the beginning of each attention unit, the input of $X_{k}$ is split into two branches along the channel dimension, i.e., $X_{k 1} {, X}_{k 2} \in R^{C / 2 G \times H \times W}$ . In one branch the inter-channel relationships are used to output channel attention maps, while in the other branch, the spatial relationships of features are used to generate spatial attention maps.

Channel attention first generates channel-wise statistics through global mean pooling, i.e., $s \in R^{C / 2 G \times 1 \times 1}$ , which embeds global information. These can be computed by contracting $X_{k 1}$ along the spatial dimensions $H \times W$ :

$s = F_{g p} (X_{k 1}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{w} X_{k 1} (i, j)$

(5)

where s represents the global information of the channel attention, $H$ and $W$ denote the spatial height and width, respectively. And $X_{k 1} (i, j)$ denotes the value of the pixel located at the i-th row and j-th column of the feature map partitioned by the channel attention.

Additionally, a compact feature is created to guide the precise and adaptive selection. The final output of the channel attention is as follows:

$X_{k 1}^{'} = σ (F c (s)) \cdot X_{k 1} = σ (W_{1 s} + b_{1}) \cdot X_{k 1}$

(6)

Here, $W_{1} \in R^{C / 2 G \times 1 \times 1}$ , $b_{1} {\in R}^{C / 2 G \times 1 \times 1}$ are the parameters used to scale and shift $s$ , and $σ$ refers to the sigmoid function, that maps real valued inputs to values in the range between 0 and 1. Its mathematical expression is:

$σ (x) = \frac{1}{1 + e^{- x}}$

(7)

In contrast to channel attention, spatial attention focuses on “where” information, providing a complement to channel attention. First, spatial statistics are obtained using the group norm on $X_{k 2}$ . Then, $F_{c} (\cdot)$ is used to improve the feature representation. The final output of spatial attention is as follows:

$X_{k 2}^{'} = σ (W_{2} \cdot G N (X_{k 2}) + b_{2}) \cdot X_{k 2}$

(8)

Here, $W_{2}$ and $b_{2}$ are parameters with a shape of $R^{C / 2 G \times 1 \times 1}$ .

Group norm is a normalization method that divides channels into multiple groups, with channels within each group sharing statistical information. the mean and variance for the channels within each group is calculated by group norm, and then these statistics are used to normalize the channels within the group. The mathematical expression for group norm is as follows:

$G N (x) = γ (\frac{x - μ}{\sqrt{σ^{2} + ϵ}}) + β$

(9)

where $x$ is the input tensor, $μ = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{w} x_{i, j}^{g}$ represents the mean of channels within each group, and $σ^{2} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{w} {{(x}_{i, j}^{g} - μ)}^{2}$ denotes the variance of channels within each group, $x^{g}$ denotes the set of channels within the group. The scaling coefficient $γ$ and the offset coefficient $β$ , obtained during training, are applied. The constant ϵ is introduced to prevent division by zero errors in the division operation. $H$ and $W$ , respectively, represent the height and width of the input tensor, and $x_{i, j}^{g}$ denotes the channel value at position $(i, j)$ within the g-th group.

Finally, all the channel attention and spatial attention obtained from the diamond nanostructure images are aggregated:

$X_{k}^{'} = [X_{k 1}^{'}, X_{k 2}^{'}] \in R^{\frac{C}{2 G} \times H \times W}$

(10)

At this point, the output feature map has the same dimensions as the input.

In this paper, the SA attention mechanism is incorporated into the model. On the one hand, shuffle attention helps to reduce the model’s attention to background information. On the other hand, the mixed-domain attention mechanism enhances the model’s focus on crucial target information, thereby optimizing the model’s detection performance.

2.2.3. WISE-IoUv3 Loss Function

In target detection tasks during SEM image scenarios with unclear target boundaries, mutually occluded objects, and poor image quality, the model’s detection performance can be significantly improved by designing an appropriate loss function. For the purpose of optimizing the loss function, the Wise-IoU (WIoU) [35] method was utilized. The application of the WIoUv3 loss function to the detection model of small targets [36] improved the model’s focus on small samples in UAV aerial photography scenarios. Thus, it is believed that the detection ability, accuracy, and stability of the model in detecting diamond nanostructure targets of the WIoUv3 loss function would be improved.

Wise-IoU introduces category weighting based on traditional IoU to minimize differences between categories and reduce their impact on test results [37]. This involves assigning a weight to each category and using different weights to weigh the overlap between different categories when calculating the IoU, thereby achieving more accurate evaluation results. Among the three versions of WIoU, the focusing coefficients were introduced in WIoUv3.

The parameter scheme of Wise-IoU is shown in Figure 4. A dynamic non-monotonic focusing mechanism is used to evaluate the anchor box quality of diamond nanostructures. The use of gradient amplification reduces the influence of harmful gradients ensuring the high-quality anchor box effect of diamond nanostructures, thereby improving the overall performance of the algorithm.

Wise-IoUv3, building upon the foundation of Wise-IoUv1, introduces a non-monotonic focusing coefficient ‘r’ constructed through the outlier degree parameter $β$ . The formula is as follows:

$L_{W I o U v 3} = r L_{W I o U v 1}$

(11)

$L_{W I o U v 1} = R_{W I o U} L_{I o U}$

(12)

3. Experiments and Results Analysis

3.1. Datasets

The original image, captured by a scanning electron microscope, was divided into six smaller images of equal dimensions. These images were then integrated to create a dataset of diamond nanostructures, designated as the NanoData dataset. The original image captured by the scanning electron microscope is depicted in Figure 7. A portion of the diamond nanostructures is present in the NanoData dataset. as illustrated in Figure 8.

The NanoData dataset utilized in this study comprises two annotated object classes: nanopillars and nanocones. As depicted in Figure 9, the pertinent information pertaining to the manually annotated objects within this dataset is illustrated. The subfigures in Figure 8 are presented in order from left to right and top to bottom. Figure 9a depicts the number of objects in each category in the dataset. Figure 9b depicts the size of the object enclosure boxes in the dataset, with the center coordinates of all the object enclosure boxes fixed at a single point. Figure 9c illustrates the distribution of the center coordinates of the object enclosure boxes. Figure 9d illustrates a scatter plot of the corresponding widths and heights of the object enclosure boxes.

3.2. Evaluation Metrics

To quantitatively analyze the training results of target detection, researchers have established various evaluation metrics, each designed to reflect a specific aspect of the performance of the detection algorithm. In this paper, the mean average precision (mAP), precision (P), recall (R), are primarily included in the precision-related metrics [38].

(1) Precision

Precision (P) is the proportion of all samples predicted as true by the model that are, in fact, true. It is calculated using the following formula:

$P = \frac{T P}{T P + F P}$

(16)

TP represents the number of instances in which the classifier predicts positive samples correctly, that is, the actual positive samples that are correctly identified. Conversely, FP signifies the instances in which the classifier predicts positive samples, but they are, in fact, negative samples. This parameter indicates the number of falsely reported negative samples.

(2) Recall

The recall (R) is the proportion of all actual true samples that are correctly identified by the model. It is calculated using the following formula:

$R = \frac{T P}{T P + F N}$

(17)

where TP represents true positives, and FN represents false negatives.

(3) mAP

The metric average precision (AP) is utilized in the context of target detection to assess the performance of a model in detecting targets. The calculation formula is as follows:

$A P = \sum_{n} (R_{n} - R_{n - 1}) \cdot P_{n}$

(18)

where $R_{n}$ represents the recall at the nth precision point, $P_{n}$ represents the precision at the n-th recall point, and the sum is taken over all precision–recall pairs.

The mean average precision (mAP) is the average of the average precision (AP) values across all classes. The calculation formula is as follows:

$m A P = \frac{\sum_{0}^{N} A P_{n}}{N}$

(19)

where $N$ represents the total number of classes, $A P_{n}$ denotes the average precision for the nth class, corresponding to the area under the precision–recall curve.

[emailprotected] signifies the mean average precision value when the IoU parameter is set at a threshold of 0.5. The average value of mAP is represented as [emailprotected]:0.95, indicating the range of IoU parameter thresholds as [0.5:0.05:0.95].

3.3. Experimental Environment and Hyperparameter Settings

The experimental environment is shown in Table 1.

In order to meet the needs of different scenarios and applications to balance the speed, accuracy, and resource consumption of the model, the YOLOv8 model generates five different model versions by adjusting the depth and width of the network structure; they are YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. The parameters and computation amount of the five models increase sequentially, and the accuracy of the detection increases gradually. The corresponding channel width, depth and maximum number of channels of these five models are shown in Table 2.

In order to better adapt to the training environment, as well as better improve the model, YOLOv8n was chosen as the baseline model in this research, and the hyperparameter settings in the model training are shown in Table 3.

The training process incorporates a learning rate decay method, whereby the initial learning rate (learning rate 0, lr0) controls the rate at which the model parameters are updated. The coefficient of the initial learning rate (learning rate float, lrf) adjusts the decay of the learning rate during the training process to obtain the final learning rate, which is obtained by multiplying the initial learning rate with the coefficients. An epoch represents a complete iteration of the entire image dataset in YOLO. During each epoch, the YOLO model traverses all the bounding boxes in the dataset and updates its model parameters according to the aforementioned loss function and optimization algorithm. The higher the number of epochs, the more times the YOLO model traverses the entire dataset. To ensure the stability of the training process, the training was performed for 180 rounds, during which the learning rate gradually decreased. This configuration permits the model to converge smoothly to the optimal solution, thereby avoiding oscillations.

3.4. Results

On the basis of the NanoData dataset, the present model was compared with the YOLOv8 model, the common excellent target detection models, and the YOLOv8 model equipped with a single improved strategy in the comparison experiments, respectively, and a visualization analysis was carried out, which was used to illustrate the superiority of the present model for the target detection of diamond nanostructures.

3.4.1. Comparison with YOLOv8

In order to evaluate the improvement of the enhanced model in terms of detection accuracy, this paper conducts comparative experiments between the YOLOv8 model and the DWS-YOLOv8 model using both the training and validation sets. Figure 10 illustrates the changes in several key evaluation metrics throughout the training process for DWS-YOLOv8 and YOLOv8n. Precision, recall, and [emailprotected] improve rapidly throughout the iterations, gradually approaching stable values. Figure 11 shows the regression loss of DWS-YOLOv8 and YOLOv8 during training and testing. Table 4 demonstrates the computational complexity and resource requirements of the DWS-YOLOv8 model compared to the YOLOv8 base model.

As illustrated in Figure 10, DWS-YOLOv8 outperforms YOLOv8n in the three detection metrics of precision, recall, and mAP0.5 after approximately 80 epochs of training. It is demonstrated that DWS-YOLOv8 is able to recognize targets more accurately and has higher detection performance in the diamond nanostructures target detection task. In the early stages of model training, it is normal for the model to show signs of overfitting or underfitting to the training data, resulting in some variations and fluctuations. DWS-YOLOv8 stabilizes after about 120 epochs, and the performance no longer changes significantly and stabilizes at a high level, proving that the model converges to a better state and no longer needs substantial optimization.

As illustrated in Figure 11, as the number of iterations increases, the average value of the loss function decreases significantly. The loss function of DWS-YOLOv8 accelerates convergence and improves prediction accuracy, and the average value of the loss function tends to converge as the number of trainings approaches 180.

As shown in Table 4, the computational complexity and resource requirements of the model are significantly increased in DWS-YOLOv8 compared to YOLOv8n due to the introduction of DCN_C2f and SA. This is attributed to the fact that deformable convolution requires the computation of additional offsets and weights, and the attention mechanism requires the global weighting computation of the feature map, and these operations increase the number of parameters and the computational effort.

Since YOLOv8n is used as the benchmark model in this model, to further illustrate the effectiveness of this paper’s method, the model in this paper is compared with several YOLOv8s (YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x) of different sizes in the dataset. The experimental results are shown in Table 5.

Recall is the ability of the model to correctly identify the target, [emailprotected] is the average precision of the model at different confidence thresholds, and [emailprotected]:0.95 is the average precision of the model at higher confidence thresholds. As illustrated in Table 5, the enhanced model exhibits the best detection performance across the four evaluation metrics of recall, mAP0.5, mAP0.5:0.95, and F1-score, in comparison to the other models, particularly the model with a larger scale than itself. While this model may exhibit lower detection accuracy compared to larger-scale models, its smaller size results in significantly faster training speeds. Thus, sacrificing a small portion of detection accuracy enables a swift detection process. These results indicate the excellent detection performance of the present model besides keeping a relatively small scale.

3.4.2. Ablation Experiment

In order to evaluate the performance gain of the three strategies on the benchmark model (YOLOv8n), namely the loss function of the DCN_C2f module, the SA module, and the Wise-IoUv3 optimization, the corresponding models are trained and tested on the NanoData dataset. The experimental results are shown in Table 6, which demonstrates the effects of the model combination and improvement on the dataset. As depicted in Table 6, the symbol √ indicates the improved strategy that was used.

The experimental results in Table 6 indicate that each improvement strategy, when applied to the base model, enhances the detection performance to varying degrees. By applying DCN_C2f to YOLOv8n, the network’s receptive field with deformable convolutions is enlarged, and sampling is closer to the objects. Then, the test set precision (P) is increased by 4.3%. Correspondingly, the recall (R), [emailprotected], and [emailprotected]:0.95 increase by 4.9%, 3.2%, 0.5%, respectively. When SA is applied to YOLOv8n, its efficient attention mechanism enhances the focus on key information in the feature map. These result in an increase of 1.3% in precision (P) and a 2.1% improvement in precision (P) when they are applied to the YOLOv8n model simultaneously with DCN_C2f. When WIoUv3 and a smarter sample allocation strategy are introduced into the regression loss to enhance the localization ability of the model, the precision (P) and recall (R) are improved by 4.2% and 2.9%, respectively. After the three strategies are applied to the YOLOv8n model simultaneously, the model obtains optimal overall performance.

Obviously, DCN_C2f is applied to achieve the optimal values for recall, [emailprotected], and [emailprotected]:0.95. Considering the impact of attention mechanisms and loss functions on the accuracy of diamond nanostructure detection, a portion of recall and [emailprotected] is sacrificed to achieve higher detection accuracy with reduced occurrences of missed detections and false alarms.

3.4.3. Comparative Experiment

To demonstrate the superiority and effectiveness of the improvement algorithm proposed in this paper, this paper conducted comparative experiments between DWS-YOLOv8 and other excellent models (such as SSD, Faster R-CNN, CenterNet and YOLOv7). The experimental results are shown in Table 7.

These models selected in this paper are all mainstream algorithms in the field of target detection, representing different stages of development and technical routes of target detection. Among them, SSD and YOLOv7 are single-stage detectors, while Faster R-CNN is a two-stage detector and CenterNet is based on centroid detection. To comprehensively understand the performance of the optimized YOLOv8n models, the accuracy, F1-score, and computational resource consumption are assessed and compared.

The results in the table show that, compared to other mainstream models, the proposed DWS-YOLOv8n in this paper has better detection performance. The mAP0.5 and F1-score of the proposed model in this paper are higher than those of other excellent models, which proves that this model recognizes the correct object more accurately in detecting the target, and the target detection accuracy is higher. This is attributed to the DCN_C2f module used in this paper, which adaptively adjusts the perceptual field of the network and expands the perceptual range of the feature map, thus making the features extracted from diamond-nanostructured targets more discriminative.

As for the value of mAP0.5:0.95, the model proposed in this paper is slightly lower than the SSD model and the Faster R-CNN model. This is because of the two-stage approach used in SSD and Faster R-CNN, which is more fine-grained in generating candidate frames in the first stage, and thus obtains a higher mAP0.5:0.95. However, in this paper, a single-stage target detection model is adopted. At the cost of slightly lower mAP0.5:0.95, excellent detection speed, higher mAP0.5, and a F1-score are obtained for the present single-stage target detection model. Obviously, this model outperforms the SSD model and the Faster R-CNN model in terms of comprehensive detection capability.

3.4.4. Visualization Analysis

In order to visually and conveniently demonstrate the detection performance of DWS-YOLOv8, the confusion matrix, model inference results, and heatmaps are utilized to analyze the model’s detection performance under comparative experiments.

To visualize the performance of the DWS-YOLOv8 model on the dataset, this research generates confusion matrices for both YOLOv8n and DWS-YOLOv8 to compare the target detection performance of the two models on diamond nanostructures, as shown in Figure 12. The rows and columns of the confusion matrix represent the actual and predicted categories, respectively. There are two categories in the dataset: nanocone and nanopillar. The diagonal values represent the percentage of correct predictions for each category, while the values in other areas indicate the proportion of incorrectly predicted categories.

As shown in Figure 12, it can be seen that the diagonal region of the confusion matrix in DWS-YOLOv8n is darker, indicating that model’s ability to correctly predict object categories has been enhanced. In addition, the values on the diagonal are larger, indicating that the prediction accuracy of this model is higher than that of the YOLOv8n model.

To visually demonstrate the detection performance of this model, this study conducts inference experiments using YOLOv8n and DWS-YOLOv8. This research selects three SEM images containing most of the diamond nanostructures outside the dataset, which include a large number of diamond nanostructures. The visualization results of target detection on the test set for YOLOv8n and the improved YOLOv8 model are shown in Figure 13.

As shown in Figure 13, it can be observed that the improved YOLOv8 model shows significant optimization in detecting multiple targets, occluded targets, etc., and the overall target detection performance. The experimental results confirm that the DWS-YOLOv8 model significantly reduces leakage and misdetection, which promotes to detect diamond nanostructures more accurately.

In order to visualize the detection performance of the model for different scales of diamond nanostructures, two different scales of diamond nanostructures were selected in this study. The detection results of this model for different scales of diamond nanostructures are depicted in Figure 14.

As shown in Figure 14, it can be seen that the model has good detection abilities for diamond nanostructures with diameters ranging from 80 nm to 500 nm [39]. Generally, the etching process determines the size and shape of the nanostructures. Thus, the dataset constructed accordingly is the most efficient and repeatable for recognizing similar structures. It should be mentioned that, for diamond nanostructures of different scales, the detection accuracy of the model is about 90%. The results show that the model has excellent detection ability for diamond nanostructures of different scales.

Gradient-weighted Class Activation Mapping (Grade CAM) is used to generate heatmaps for YOLOv8n and DWS-YOLOv8. Heatmaps intuitively and conveniently reflect which areas of the feature maps where the model is focusing on. Grade CAM propagates the model’s gradient information backward to the last convolutional layer and then combines it with the weights of the feature maps to generate spatial region features related to the model’s decisions. Pixels with higher gradients in the heatmap are represented by red shading, while the ones with lower gradients are represented by blue shading. The experimental results are shown in Figure 15.

For the model proposed in this paper, the heat map shows that the attention of the model is more focused on the center point of the object, which proves that this model has high accuracy in target localization. This indicates that the model pays more attention to the main part of the target, which can help the model identify and classify the diamond nanostructures better.

4. Conclusions and Future Direction

Diamond nanostructures have been successfully prepared at the nanoscale by employing plasma etching techniques. In order to comprehensively analyze and examine the characteristics of these nanostructures, high-quality morphology images were obtained using a high-resolution scanning electron microscope (SEM). These images clearly demonstrate the surface morphology of the nanostructures as well as the periodic arrangement details. Based on the acquired high-precision topographic images, this research further constructed an exhaustive dataset. This dataset systematically organizes and classifies the image information of various nanostructures, which provides valuable data support for subsequent structural design optimization, performance prediction, and an in-depth understanding of the plasma etching process.

When performing target detection of diamond nanostructures in SEM scenarios, problems such as noise, unclear object boundaries, and object occlusion are encountered. The target detection and its accuracy are low for existing models, which makes it difficult to detect fabricated diamond nanostructures. In order to optimize the detection performance of the model, this paper proposes a diamond nanostructure target detection model, DWS-YOLOv8, based on YOLOv8 in SEM scenes. In order to enhance the feature representation for multiscale, occluded diamond nanostructures, adaptively adjusting the network’s receptive field was achieved by the introduction of the lightweight DCN_C2f module. Then, a dynamic weighted attention mechanism was incorporated during feature fusion, which addresses the issue of detail loss during convolutional iterations, facilitates feature self-calibration, to addresses the issue of detail loss, and reduces the impact of noise and background.

To further improve the training effectiveness and detection accuracy of diamond nanostructures, the dynamic non-monotonic focusing mechanism and gradient gain method were introduced to reduce the influence of harmful gradients.

Verified by the comparative performance of three improvement strategies, the present DWS-YOLOv8 model achieved improvements in precision (P), recall (R), [emailprotected], and [emailprotected]:0.95. These demonstrate that the present strategies have enhanced the effectiveness of object detection. Comparative experimental results with other mainstream object detection models, such as Faster R-CNN, SSD, CenterNet, and YOLOv7, also indicate that DWS-YOLOv8n outperforms mainstream models in terms of [emailprotected] and F1-score. This suggests that the detection accuracy of this model’s diamond nanostructure surpasses those of mainstream models.

Since the improved model added attention mechanisms, the model structure has become more complex, resulting in an overall increase of 0.2 M in model size, and leading to varying degrees of increase in computational and inference time. For some very small diamond nanostructures, the detection accuracy of the improved model is still not high enough. The primary focus of the next step of this research is to further optimize the detection accuracy of the model, with the consideration of resource consumption. In future research, this research will collect more SEM images of multi-diamond nanostructures to serve the subsequent feature extraction of diamond nanostructures.

Author Contributions

Conceptualization, Z.G.; data curation, X.G., Y.W., S.L., M.Z. and L.Z.; formal analysis, S.L. and Z.G.; funding acquisition, F.G. and Z.G.; investigation, Q.W. and L.Z.; methodology, X.G.; project administration, F.G.; resources, M.Z. and L.Z.; software, X.G.; supervision, F.G. and Z.G.; validation, X.G.; visualization, X.G. and L.G.; writing—original draft, X.G.; writing—review and editing, F.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Nature Science Foundation of China (Nos. 41806112 and 51802179) and the National Key R&D Program of China (2022YFC3104203). We also acknowledge the financial support from the Special Wenhai Plan of Qingdao National Laboratory for Marine Science and Technology (Nos. LSKJ202204601, LSKJ202204702), the Shandong Provincial Natural Science Foundation (Nos. ZR2020ME037, ZR2022MF300, ZR2022QB142, ZR2022QB229, ZR2023QB017).

Data Availability Statement

The derived data generated in this study will be shared by the respective authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ren, S.; Zeng, J.; Zheng, Z.; Shi, H. Perspective and application of modified electrode material technology in electrochemical voltammetric sensors for analysis and detection of illicit drugs. Sens. Actuators A Phys. 2021, 329, 112821. [Google Scholar] [CrossRef]
Li, J.; Cai, Y.; Li, Q.; Kou, M.; Zhang, T. A review of remote sensing image segmentation by deep learning methods. Int. J. Digit. Earth 2024, 17, 2328827. [Google Scholar] [CrossRef]
Li, S.; Huang, H.; Meng, X.; Wang, M.; Li, Y.; Xie, L. A Glove-Wearing Detection Algorithm Based on Improved YOLOv8. Sensors 2023, 23, 9906. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Ma, N.; Su, Y.; Yang, L.; Li, Z.; Yan, H. Wheat Seed Detection and Counting Method Based on Improved YOLOv8 Model. Sensors 2024, 24, 1654. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Wang, H.; Liu, C.; Cai, Y.; Chen, L.; Li, Y. YOLOv8-QSD: An Improved Small Object Detection Algorithm for Autonomous Vehicles Based on YOLOv8. IEEE Trans. Instrum. Meas. 2024, 10, 119. [Google Scholar] [CrossRef]
Soylu, E.; Soylu, T. A performance comparison of YOLOv8 models for traffic sign detection in the Robotaxi-full scale autonomous vehicle competition. Multimed. Tools Appl. 2024, 83, 25005–25035. [Google Scholar] [CrossRef]
Zhang, J.; Hua, Y.; Chen, L.; Li, L.; Shen, X.; Shi, W.; Wu, X.; Fu, Y.; Lv, C.; Zhu, J. EMR-YOLO: A Study of Efficient Maritime Rescue Identification Algorithms. J. Mar. Sci. Eng. 2024, 12, 1048. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2017, arXiv:2004.10934. [Google Scholar]
Kim, J.H.; Kim, N.; Park, Y.W.; Won, C.S. Object detection and classification based on YOLO-V5 with improved maritime dataset. J. Mar. Sci. Eng. 2022, 10, 377. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Okuyama, Y.; Ohmori, T. Automated measurement method based on deep learning for cross-sectional SEM images of semiconductor devices. Jpn. J. Appl. Phys. 2022, 62, SA1016. [Google Scholar] [CrossRef]
Dengiz, O.; Smith, A.E.; Nettleship, I. Grain boundary detection in microstructure images using computational intelligence. Comput. Ind. 2005, 56, 854–866. [Google Scholar] [CrossRef]
de Albuquerque, V.H.C.; Cortez, P.C.; de Alexandria, A.R.; Tavares, J.M.R. A new solution for automatic microstructures analysis from images based on a backpropagation artificial neural network. Nondestruct. Test. Eval. 2008, 23, 273–283. [Google Scholar] [CrossRef]
Mulewicz, B.; Korpala, G.; Kusiak, J.; Prahl, U. Autonomous interpretation of the microstructure of steels and special alloys. In Mater. Sci. Forum 2019, 949, 24–31. [Google Scholar] [CrossRef]
Adachi, Y.; Taguchi, M.; Hirokawa, S. Microstructure recognition by deep learning. Tetsu-to-Hagane 2016, 102, 722–729. [Google Scholar] [CrossRef]
Song, C.W.; Cho, D.S.; Lee, J.M.; Song, P.K. Effect of boron doping on diamond film and electrochemical properties of BDD according to thickness and morphology. Coatings 2020, 10, 331. [Google Scholar] [CrossRef]
Lu, J.; Xu, D.; Huang, N.; Jiang, X.; Yang, B. One-dimensional diamond nanostructures: Fabrication, properties and applications. Carbon 2024, 223, 119020. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, K.; Wang, L.; Wu, L. An Improved YOLOv8 Algorithm for Rail Surface Defect Detection. IEEE Access 2024, 12, 44984–44997. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Zhu, J.; Hu, T.; Zheng, L.; Zhou, N.; Ge, H.; Hong, Z. YOLOv8-C2f-Faster-EMA: An Improved Underwater Trash Detection Model Based on YOLOv8. Sensors 2024, 24, 2483. [Google Scholar] [CrossRef] [PubMed]
Shao, Y.; Zhang, R.; Lv, C.; Luo, Z.; Che, M. TL-YOLO: Foreign-Object Detection on Power Transmission Line Based on Improved Yolov8. Electronics 2024, 13, 1543. [Google Scholar] [CrossRef]
Shen, L.; Lang, B.; Song, Z. DS-YOLOv8-Based Object Detection Method for Remote Sensing Images. IEEE Access 2023, 11, 125122–125137. [Google Scholar] [CrossRef]
Wen, C.; Yin, L.; Liu, S. DGNet: A handwritten mathematical formula recognition network based on deformable convolution and global context attention. Mob. Netw. Appl. 2024, 10, 1–14. [Google Scholar] [CrossRef]
Zhang, F.; Jin, X.; Lin, G.; Jiang, J.; Wang, M.; An, S.; Hu, J.; Lyu, Q. Hybrid attention network for citrus disease identification. Comput. Electron. Agric. 2024, 220, 108907. [Google Scholar] [CrossRef]
Zhang, H.; Li, S.; Miao, Q.; Fang, R.; Xue, S.; Hu, Q.; Hu, J.; Chan, S. Surface defect detection of hot rolled steel based on multi-scale feature fusion and attention mechanism residual block. Sci. Rep. 2024, 14, 7671. [Google Scholar] [CrossRef] [PubMed]
Hu, T.; Zhuang, D.; Qiu, J.; Zheng, L. Improved YOLOv8 Algorithm with C2f-DCNv3 and Shuffle Attention for Detection of Coal Shearer Drum Teeth. In Proceedings of the 2024 4th International Conference on Neural Networks, Information and Communication (NNICE), IEEE, Guangzhou, China, 19–21 January 2024; pp. 1019–1022. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Jiang, T.; Chen, S.A. Lightweight Forest Pest Image Recognition Model Based on Improved YOLOv8. Appl. Sci. 2024, 14, 1941. [Google Scholar] [CrossRef]
Wan, X.; Wang, M. Gear Fault Detection Method Based on the Improved YOLOv5. In Proceedings of the 2023 IEEE International Conference on Mechatronics and Automation (ICMA), IEEE, Harbin, China, 6–9 August 2023; pp. 2348–2353. [Google Scholar]
Hu, X.; Yang, Y.; Wang, W.; Wang, Y.; Gong, X.; Geng, C.; Tang, J. Hollow Fe³⁺-doped anatase titanium dioxide nanosphere for photocatalytic degradation of organic dyes. ACS Appl. Nano Mater. 2023, 6, 18999–19009. [Google Scholar]

Figure 1. Structure of the deformable convolutional layer.

Figure 2. Comparison between deformable convolution and standard convolution.

Figure 3. Structural diagram of the shuffle attention mechanism.

Figure 4. Illustrates the schematic diagram of Wise-IoU parameters.

Figure 5. The network structure of the enhanced YOLOv8 is illustrated.

Figure 6. Training process of DWS-YOLOv8.

Figure 7. Raw images of diamond nanostructures taken by SEM.

Figure 8. Images of some of the diamond nanostructures in the NanoData. (a–f) represent images of the diamond nanostructures used in the dataset.

Figure 9. The provided information about manually labeled objects in the NanoData dataset. (a) depicts the number of objects in each category in the dataset. (b) depicts the size of the object enclosure boxes in the dataset. (c) illustrates the distribution of the center coordinates of the object enclosure boxes. (d) illustrates a scatter plot of the corresponding widths and heights of the object enclosure boxes.

Figure 10. Precision, Recall, and [emailprotected] for DWS-YOLOv8 and YOLOv8n. (a) represents the comparison of accuracy, (b) represents the comparison of recall, and (c) represents the comparison of [emailprotected].

Figure 11. Comparison of training and validation loss curves between DWS-YOLOv8 and YOLOv8n.

Figure 12. (a) Confusion matrix for YOLOv8n; (b) Confusion matrix for DWS-YOLOv8.

Figure 13. Comparison of target detection results between YOLOV8n and DWS-YOLOv8 on the diamond dataset. (a) represents the original image, (b) represents the target detection result of YOLOv8n, and (c) represents the target detection result of DWS-YOLOv8.

Figure 14. Detection of diamond nanostructures at different scales.

Figure 15. Thermogram comparison of YOLOV8n and DWS-YOLOv8. (a) represents the original image, (b) shows the thermogram of YOLOv8n, and (c) shows the thermogram of DWS-YOLOv8.

Table 1. Experimental environment.

Options	Configuration
Operating System	Windows10
CPU	AMD R7 5700X
GPU	NVIDIA RTX3060
GPU memory size	12 G
DL Framework	Pytorch 1.13.1
Language	Python 3.10.11

Table 2. Corresponding parameters for different versions of the YOLOv8 model.

Model	Width	Depth	FLOPs	Max Channels
YOLOv8n	0.25	0.33	8.7	1024
YOLOv8s	0.50	0.33	28.6	1024
YOLOv8m	0.75	0.67	78.9	768
YOLOv8l	1.00	1.00	165.2	512
YOLOv8x	1.25	1.00	257.8	512

Table 3. Hyperparameter settings.

Hyperparameter Options	Setting
Input Resolution	640 × 640
Initial Learning Rate 0 (lr0)	0.01
Learning Rate Float	0.01
Momentum	0.937
Weight_decay	0.0005
Batch_size	2
Epochs	180

Table 4. Computation complexity and resource requirements of DWS-YOLOv8 vs. YOLOv8n.

Models	Model Size/MB	Detection Time/ms	Parameter/10⁶	GFLOPs
YOLOv8n	6.3	8.0	3.0	8.7
DWS-YOLOv8	21.4	15.0	11.1	28.6

Table 5. Comparison of the model in this paper with other YOLOv8s of different sizes.

Models	Precision/%	Recall/%	mAP0.5/%	Map0.5:0.95/%	F1-Score
YOLOv8s	78.3	68.0	70.6	54.0	0.72
YOLOv8m	83.1	71.4	74.7	55.6	0.76
YOLOv8l	84.7	73.3	76.9	58.2	0.78
YOLOv8x	85.3	74.9	78.1	60.3	0.79
DWS-YOLOv8	82.4	75.7	81.5	60.7	0.80

Table 6. Experimental results of model combinations on the dataset.

Model ¹				Precision/%	Recall/%	mAP0.5/%	Map0.5:0.95/%
B	D	S	W	DATA	DATA	DATA	DATA
√				73.0	75.1	78.9	60.1
√	√			77.3	80.2	82.1	60.7
√		√		74.3	77.8	80.2	59.4
√			√	77.2	78.0	80.3	59.6
√	√	√		79.4	75.8	81.4	60.5
√	√	√	√	82.4	75.7	81.5	60.7

¹ B: Base(YOLOv8n); D: DCN_C2f; S: Shaffule attention; W: Wise-IoUv3.

Table 7. Comparison with other excellent models.

Models	mAP0.5/%	mAP0.5:0.95/%	F1-Score	GFLOPs
SSD	77.3	61.8	0.53	68.3
Faster R-CNN	79.8	66.2	0.57	932.5
CenterNet	73.3	56.5	0.53	133.7
YOLOv7	80.7	59.6	0.77	105.1
DWS-YOLOv8	81.5	60.7	0.80	79.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).