
Citation: | Jiangjian Xie, Zhulin Hao, Chunhe Hu, Changchun Zhang, Junguo Zhang. 2025: Beyond amplitude: Phase integration in bird vocalization recognition with MHAResNet. Avian Research, 16(1): 100229. DOI: 10.1016/j.avrs.2025.100229 |
Bird vocalizations are pivotal for ecological monitoring, providing insights into biodiversity and ecosystem health. Traditional recognition methods often neglect phase information, resulting in incomplete feature representation. In this paper, we introduce a novel approach to bird vocalization recognition (BVR) that integrates both amplitude and phase information, leading to enhanced species identification. We propose MHAResNet, a deep learning (DL) model that employs residual blocks and a multi-head attention mechanism to capture salient features from logarithmic power (POW), Instantaneous Frequency (IF), and Group Delay (GD) extracted from bird vocalizations. Experiments on three bird vocalization datasets demonstrate our method’s superior performance, achieving accuracy rates of 94%, 98.9%, and 87.1% respectively. These results indicate that our approach provides a more effective representation of bird vocalizations, outperforming existing methods. This integration of phase information in BVR is innovative and significantly advances the field of automatic bird monitoring technology, offering valuable tools for ecological research and conservation efforts.
Birds are a vital component of biodiversity and play a significant role in maintaining the balance of ecosystems (Lu et al., 2023). Bird communities are good indicators of environmental changes that shape biodiversity at the landscape scale (Dvořáková et al., 2023). Bird vocalization, as the primary means of communication, reproduction, and territorial declaration among birds, carry rich and valuable information. The advent of passive acoustic monitoring (PAM) has revolutionized bird monitoring, offering a non-invasive, efficient approach to long-term ecological research (Sedlácek et al., 2015; Wheeldon et al., 2019). It not only conserves substantial manpower and material resources but also offers minimal interference, extensive monitoring coverage, and high efficiency, making it highly promising for application (Ma, 2016). Despite its advantages, the vast volume of data generated by PAM systems necessitates automated analysis to conserve resources and improve efficiency (Kasten et al., 2012). Automatic Bird Vocalization Recognition (BVR) has emerged as a critical solution to this challenge, harnessing the power of artificial intelligence and deep learning (DL) to surpass traditional methods (Xie et al., 2023).
The performance of BVR is directly related to the quality of the selected features. To achieve excellent performance, optimal features should be chosen as inputs to the model. In BVR, Numerous features, including short-time Fourier transform (STFT) spectrograms, log-Mel spectrograms and Mel-frequency cepstral coefficients (MFCCs), are widely utilized as input features (Xie et al., 2023). However, relying on a single feature type may not yield the best results. Hence, the integration of diverse features is imperative for comprehensive representation. Yan et al. (2021) evaluated the fusion of log-mel spectrogram (LM), MFCC, and chroma features using four bird audio datasets. Experimental results demonstrated that the combination of LM, MFCC, and chroma features achieved the best performance, with a mean average precision (mAP) of 97.9%. Xie et al. (2022) introduced the minimal-redundancy-maximal-relevance (mRMR) method to select optimal multi-view features for birdsong classification. The candidate features include four handcrafted features (wavelet transform (WT) spectrum, Hilbert-Huang transform (HHT) spectrum, STFT spectrum, MFCC), in addition to the deep features extracted from WT, HHT, and STFT spectrums. When the mRMR feature subset was 800, it achieved the best performance. Yang et al. (2022) incorporated a multi-scale feature fusion structure and Pyramid Split Attention (PSA) module to enhance the extraction of spatial and channel information. By adjusting the depth-wise separable convolution and introducing the Bnecks module with a channel attention mechanism, the model achieved a top-1 accuracy of 95.12% and a top-5 accuracy of 100%. Xie and Zhu (2023) investigated an early fusion method of deep features for birdsong classification. They used five pretrained models (VGG16, ResNet50, EfficientB0, MobileNetV2, and Xception) to extract deep cascaded features from multi-view spectrograms. The combination of VGG16 and MobileNetV2 achieved the best performance, with a balanced accuracy of 94.89%. These studies highlight the superior performance of multi-feature integration compared to using a single feature alone. However, the integration of phase information remains unexplored.
Phase information refers to the periodic variation of sound wave vibration, which denotes the spatial relationship between the waveform’s peaks and troughs (Kinsler et al., 1999). In the Fourier transform, a signal’s spectrum comprises amplitude and phase spectra. The amplitude spectrum describes the intensity of the frequency components, while the phase spectrum describes the relative positional relationships between these components. The completeness of phase information aids in accurately restoring the original audio signal. In the presence of time domain variations in the audio signal, phase information provides additional contextual and temporal information, which enhances the accuracy in distinguishing different speech units, such as phonemes or speech events (Hidaka et al., 2022). The role of phase information of audio signals in feature representation and audio recognition tasks is multifaceted. It contains richer phonetic features that help the expression and recognition of bird vocalizations. Here, we inventively fuse logarithmic power (POW) spectrum with group delay (GD) and instantaneous frequency (IF) features extracted from phase information. This fusion strategy is expected to further enhance the richness of feature representations, providing a more comprehensive view for BVR.
To fully leverage the potential of these enriched feature representations, the choice of a robust recognition model becomes crucial. DL models, such as convolutional neural networks (CNNs), have demonstrated remarkable performance in BVR compared to traditional methods (Lin et al., 2016; Sinha et al., 2020). Additionally, studies have found that deeper network architectures generally perform better in BVR (Ruff et al., 2019; Xie et al., 2019; Florentin et al., 2020). Deep network architectures, while powerful, are prone to gradient-related issues that can diminish model performance. To counteract these challenges, we employ residual networks (ResNets) in our approach. ResNets have proven effective for BVR by enabling the training of deeper models and alleviating the impact of vanishing and exploding gradients (He et al., 2020; Manna et al., 2023). Furthermore, recognizing bird vocalizations in the wild presents additional challenges, as environmental noise can substantially hinder recognition performance. Studies have shown that attention mechanisms can help mitigate these issues by focusing on the most relevant parts of the audio signal, such as distinctive bird vocalization features, thus enhancing recognition accuracy even in noisy conditions (Xu et al., 2020; Jiang et al., 2021; Xiao et al., 2022c). By integrating attention mechanisms into DL models, the performance of BVR systems can be further improved, making them more robust to noise and better at capturing important features. Noumida et al. (Noumida and Rajan, 2022) used a hierarchical attention mechanism to focus on the information output by the hidden layer of the bidirectional gated recurrent unit model. The model achieves considerable performance on the Xeno-Canto dataset, with an F1-score of 84%. Xiao et al. (2022b) proposed AMResNet, an automatic bird sound recognition model that integrates attention mechanism and ResNet. The attention mechanism enhances recognition by focusing on relevant bird sound features, while the ResNet addresses gradient vanishing issues. Tested on 12,651 bird sound samples, the model achieved a classification accuracy of 92% and an F1-score of 97.1%, outperforming several other models. Hu et al. (2023a) embedded a shuffle attention module between the double-layer residual module connected by base block and down block, which transfers effective information and enhances the ripple characteristics of spectrograms, thereby improving the model’s accuracy and efficiency. Additionally, they introduced the ScSEnet attention module into the ResNet18 backbone network to reduce the impact of noise in Mel and Sinc spectrograms on classification. Their model achieved high accuracy across three bird sound datasets (Hu et al., 2023b). In this paper, we further investigate attention mechanisms and the structure of ResNets. The ResNet addresses the problems of gradient vanishing and gradient explosion in deep network training by introducing residual connections, allowing the network to learn deeper feature representations. The multi-head attention mechanism enables the model to focus on multiple positions in the sequence simultaneously, capturing different contextual information, which enhances the model’s performance.
The main contributions of this paper are as follows:
(1) We propose MHAResNet, a novel model that combines a residual network with a multi-head attention mechanism. By developing a local feature extraction module, our approach reduces the number of parameters while enhancing the feature learning capability for noisy bird vocalizations.
(2) We extract POW, GD, and IF features from bird vocalizations. These features are fused to serve as inputs to the recognition model, providing a comprehensive representation of bird vocalizations.
(3) We conduct experiments on various datasets and compare the performance of our model with other DL-based BVR models. The results demonstrate that our network exhibits strong generalization ability and outperforms existing methods.
In this paper, we propose a BVR method based on feature fusion and MHAResNet (shown in Fig. 1). Firstly, we down-sample the bird audio samples and split them into equal-length segments. We then extract POW, GD, and IF features from each segment and fuse them into three-channel features of size (3, 224, 224) as inputs. The MHAResNet model incorporates the MHA mechanism into ResNet34 to enhance feature extraction capabilities. Finally, we use Softmax as a classifier to map the input multi-features to probability distributions across different classes.
We choose the Birdsdata dataset, derived from the Beijing Academy of Artificial Intelligence (BAAI)’s bird call library (https://www.aminer.cn/research_report/5f3394d73c99ce0ab7bc771f), for evaluating our method. The original dataset contains 14311 labeled sound clips from 20 categories, all with audio length of 2 s. We resampled all the audio samples to a sampling frequency of 16 kHz using Librosa 0.7.2. To avoid the issue of data imbalance, we followed the approach used by Xiao et al. (2022b) and excluded the species Perdix perdix due to their insufficient representation in the dataset. After processing, the dataset consists of 14,282 sound clips from 19 species of birds, spanning 9 orders, 12 families, and 16 genera. The detailed information of each bird species is presented in Table 1.
Genus | Family | Order | Scientific name | Number of samples | Abbreviation |
Anser | Anatidae | Anseriformes | Anser anser | 759 | AA |
Ardea | Ardeidae | Pelecaniformes | Ardea cinerea | 850 | ACi |
Anas | Anatidae | Anseriformes | Anas crecca | 602 | ACr |
Accipiter | Accipitridae | Accipitriformes | Accipiter gentilis | 733 | AG |
Anas | Anatidae | Anseriformes | Anas platyrhynchos | 766 | AP |
Buteo | Accipitridae | Accipitriformes | Buteo buteo | 290 | BB |
Coturnix | Phasianidae | Galliformes | Coturnix coturnix | 738 | CCo |
Cygnus | Anatidae | Anseriformes | Cygnus cygnus | 800 | CCy |
Fulica | Rallidae | Gruiformes | Fulica atra | 460 | FA |
Gavia | Gaviidae | Gaviiformes | Gavia stellata | 835 | GS |
Himantopus | Recurvirostridae | Charadriiformes | Himantopus himantopus | 786 | HH |
Phalacrocorax | Phalacrocoracidae | Suliformes | Phalacrocorax carbo | 852 | PCa |
Phasianus | Phasianidae | Galliformes | Phasianus colchicus | 797 | PCo |
Passer | Passeridae | Passeriformes | Passer domesticus | 1195 | PD |
Rallus | Rallidae | Gruiformes | Rallus aquaticus | 680 | RA |
Tringa | Scolopacidae | Charadriiformes | Tringa ochropus | 710 | TO |
Tringa | Scolopacidae | Charadriiformes | Tringa glareola | 825 | TG |
Tringa | Scolopacidae | Charadriiformes | Tringa totanus | 790 | TT |
Vanellus | Charadriidae | Charadriiformes | Vanellus vanellus | 814 | VV |
We utilize the phase information and amplitude information of audio to constitute the fused feature. The feature extraction process is shown in Fig. 2 (Hidaka et al., 2022).
According to the short-time Fourier transform, the audio signal is divided into frames along the time direction. The Gabor function transforms each frame of the audio segment from the time domain to the frequency domain (as shown in Eq. (1)).
Gaˆbor(f,t)=keiθ∫+∞−∞e−i2π(f−f0)tω(at)dt=kaeiθωf−f0a | (1) |
where k is a constant that represents the amplitude coefficient of the Gabor function, used to adjust the amplitude of the Gabor function.
The form of the above equation can be viewed as a complex wave with the amplitude shown as Eq. (2):
A=‖Gaˆbor(f)‖=kaωf−f0a | (2) |
The amplitude derived from Eq. (2) is squared, down-sampled, and converted to a logarithmic scale to obtain POW, as expressed in Eq. (3):
POW=log(downsample(A2)) | (3) |
In Eq. (1), the complex exponential term
Gaˆbor(f)=ka(cos(θ)+i∙sin(θ))ωf−f0a | (4) |
then the real and imaginary parts of the above Gabor function can be represented as Eq. (5) and Eq. (6) respectively:
R(Gaˆbor(f))=kacos(θ)ωf−f0a | (5) |
I(Gaˆbor(f))=kasin(θ)ωf−f0a | (6) |
The cosine and sine terms in Eq. (5) and Eq. (6) represent the real and imaginary parts of the signal, respectively. The phase (in radians) of the signal is shown as Eq. (7):
ϕ(f)=tan−1I(Gaˆbor(f))R(Gaˆbor(f)) | (7) |
Frequency domain differentiation denotes the partial differentiation of the frequency with respect to the phase. That is, the rate of change of the phase in the frequency domain. This is GD. The formula for GD is shown in Eq. (8):
GD=∂∂fϕ(f) | (8) |
Therefore, the time domain unfolding is represented as a decomposition of the desired phase into multiple segments along time. For timeframe t(i) ∈ T, i = 1, 2, …, F. F is the total number of frames. The difference of the phase between frame t(i) and frame t(i + 1) represents the rate of change of the phase over time. It is IF.
The POW, GD, and IF described in this paper are all derived on a time frame. Finally, in the same way as the short-time Fourier transform, the above features obtained are stacked and spliced along the time axis to obtain a two-dimensional time-frequency feature map. The spectrograms acquired consist of POW spectrum, GD spectrum, and IF spectrum, all standardized to dimensions of (224, 224). These individual spectrograms are concatenated to form fused features with a size of (3, 224, 224), serving as input data for the MHAResNet network.
The MHAResNet model integrates residual blocks and a multi-head attention mechanism to enhance feature extraction capabilities. It consists of convolutional layers, batch normalization, ReLU activations, max pooling, and a sequence of residual blocks (MRBlocks) that progressively decrease in size while increasing in depth. A multi-head attention block is introduced to capture salient features within the bird vocalizations, followed by average pooling and a fully connected layer for classification.
MRBlocks is a sequence of four residual block groups and one multi-head attention block. From left to right each residual block group has residual layers of 3, 4, 6, and 3. Fig. 1 has the inputs and outputs labeled for the feature map dimensions of each block group. The data flow through the residual blocks decreases in size but increases in depth. Each residual layer consists of two 3 × 3 convolutional layers and two normalization layers, with a ReLU activation function added between the first and second convolutional layers. These two successive convolution operations define the residual function F. The output of the residual function is summed with the input X through skip connections. Eventually, the result of the summation is nonlinearly transformed by the ReLU function to obtain the output of the residual block. The specific formula is shown in Eq. (9):
Yi=h(Xi)+F(Xi,Wi) | (9) |
where Yi is the output of the residual block, Xi is the input, h (Xi) is the constant mapping, and F(Xi, Wi) is the residual function with weights.
The architecture utilizes skip connections (indicated by dashed arrows) that allow gradients to flow through the network without disappearing or exploding (He et al., 2016).
The Multi-head Attention (MHA) Block (see Fig. 1) receives the output from the first residual block set and further processes the feature information without changing the spatial dimensions. Firstly, the sequence X with output dimension (N, C, H, W) from the previous residual block group is reshaped to (N * H * W, 1, C) to match the input requirement of the multi-head attention, where N is the batch size, L is the length, and E is the embedding dimension. The new sequence data X′ then enters the multi-head attention block. Q, K, and V are the mapping matrices of the input sequences. First, the input embedding I is multiplied by the corresponding weight matrix to initialize the three mapping matrices as shown in Eq. (10).
Q=WQI,K=WKI,V=WVI | (10) |
where WQ, WK, WV are learnable parameters, the dimensions of Q, K are dk, and the dimension of V is dv.
Attention weights are then obtained by calculating the similarity between Q and K by scaling the dot product attention operation (see Fig. 1), as shown in Eq. (11):
Attention(Q,K,V)=softmax(QKT√dk)V | (11) |
in this mechanism, each "head" performs the same operation, but on a different representation subspace. In this way, the model is able to capture information from multiple levels of abstraction, improving its expressive power and complexity handling. The MHA output is a splice of the individual head outputs. The information is subsequently integrated by another linear transformation to produce the final feature representation. This multi-angle feature integration approach allows the model to have higher flexibility and adaptability in processing sequence data. The specific formula is shown in Eq. (12):
MultiHead(Q,K,V)=Concat(head1,…,headh)WO,headi=Attention(QWQi,KWKi,VWVi) | (12) |
where WiQ, WiK, WiV is the projection matrix specific to each head. WO is the output projection matrix. h is the number of heads. Each head will correspond to a different representation subspace, thus allowing the model to capture different features in multiple subspaces.
Finally, the 3D feature maps are converted into 1D vectors of size 512 by AvgPool. Since AvgPool is executed after MRBlocks, the feature maps used for flattening are of size 1*1 and are fed to the FC layer with 512 hidden units. Based on the number of classes in the bird call dataset, the output of the FC layer is a 1*19 tensor.
In the experiments, all the data are randomly divided according to the ratio of 6:2:2, which are used as the training set, validation set and test set respectively. The network model is built based on PyTorch 2.0.1 DL framework. The programming language is Python 3.8.0. The hardware environment is CPU Intel(R) i5-9300H and GPU Nvidia GeForce GTX 1050.
The model was trained for 200 epochs using the Adam optimizer, with an initial learning rate set to 0.0001. We employed the learning rate Step Decay Schedule strategy (Ge et al., 2019), where the learning rate is updated by a process like Eq. (13).
lrnew=lrinitial×E⌊MN⌋ | (13) |
where
We set the batch size to 16. An early stopping method was used during the training phase to prevent the network from overfitting. In addition, we perform L2 regularization for all weight parameters. The loss function is updated as Eq. (14).
L′(θ)=L(θ)+λ12‖θ‖2 | (14) |
where L′(θ) and L(θ) denote the loss function before and after regularization, respectively, θ denotes all the weighting parameters, and λ is used as a parameter to regulate the degree of the rule term’s influence on the loss. In addition, cross-entropy is used as the loss function because it can balance class similarity and speed up convergence in recognition tasks. The specific formula is shown in Eq. (15).
L(θ)=−1n∑x[YrlnYp+(1−Yr)ln(1−Yp)] | (15) |
where
In order to evaluate the model performance, Accuracy, precision, recall and F1-score are used as the evaluation metrics for the experiments in this paper. Accuracy is the gold metric for recognition models. It is applicable to both binary and multi-categorization tasks, as shown in Eq. (16).
Accuracy=TP+TNTP+FP+TN+FN | (16) |
Precision denotes the proportion of all bird vocalization samples within the predicted label set that were correctly classified, as shown in Eq. (17). Recall is the proportion of all bird vocalization samples within the true label set that were correctly classified, as shown in Eq. (18). F1-score is obtained by weighting the two metrics Precision and Recall, as shown in Eq. (19).
Precision=TPTP+FP | (17) |
Recall=TPTP+FN | (18) |
F1=2TP2TP+FN+FP | (19) |
where TP is correctly predicted as a positive example, TN is correctly predicted as a negative example, FP is incorrectly predicted as a positive example, and FN is incorrectly predicted as a negative example.
The experiment is divided into three parts. First, we investigate the impact of incorporating attention mechanisms at different positions within the ResNet to identify the optimal model configuration. Next, we conduct ablation experiments to analyze the contribution of each feature to the model’s recognition performance. Finally, we compare our method’s performance with other existing methods on the same dataset and evaluate our model’s performance on different datasets to demonstrate its robustness and generalization capability.
In this section, we compare the effects of adding the mechanism at different positions. To ensure the fairness of the experiments, factors such as learning rate, batch size, and training period were kept constant across all experimental setups, with the only variation being the position of the multi-head attention mechanism. The MHA1 model places the multi-head attention mechanism between groups of residual blocks (between layer1 and layer2). This positioning is intended to facilitate the exchange of information between features at different levels. The MHA2 model introduces the attention module between the Maxpool layer and layer1, aiming to enhance feature representation prior to feature down-sampling. The MHA3 model inserts the multi-head attention mechanism between the Avgpool layer and the fully connected (Fc) layers, with the goal of increasing the discriminative power of the model at the feature decision level.
The recognition accuracies, F1-scores, recalls and precision values of the models with different location compositions are given in Table 2.
Model | Position | Accuracy (%) | Recall (%) | Precision (%) | F1-score (%) |
MHA1(MHAResNet) | Between Layer1 and Layer2 | 94.0 | 93.9 | 94.1 | 93.8 |
MHA2 | Between Maxpool and Layer1 | 91.9 | 91.6 | 91.8 | 91.5 |
MHA3 | Between Avgpool and Fc | 90.2 | 90.1 | 90.2 | 90.2 |
The experimental results show that the MHA1 model exhibits optimal recognition performance with an accuracy of 94.0% and a F1-score of 93.8%. The effectiveness of the MHA mechanism lies in its ability to allow the model to focus on different subsets of information in parallel while processing sequential data. This enables the model to comprehensively understand the data and consider various combinations of features during decision-making. Therefore, inserting the MHA between Layer1 and Layer2 is most appropriate. At the earlier stage of feature extraction, the model retains more of the original information and local features, facilitating the exchange and fusion of different features. The MHA2 model achieved an accuracy of 91.9%. By placing the attention module before feature down-sampling, the model has the opportunity to process the original input features in more detail. However, the MHA mechanism at this stage may lack sufficient contextual information to enhance feature representation effectively, resulting in a lower accuracy rate. The accuracy of the MHA3 model was lower than that of the other two models. This may be due to its placement deeper in the network, where some detailed information may be lost before attentional focusing can occur. In summary, inserting the MHA between Layer1 and Layer2 is most effective. This position enhances the model’s understanding and integration of features while retaining sufficient detail. It not only provides effective channels for information flow but also helps the model capture complex relationships among features more comprehensively, thereby improving the performance of the recognition task.
We further use Grad-CAM (Selvaraju et al., 2017) to visualize the region of interest of the model in order to investigate the benefits of the multi-attention mechanism. Grad-CAM uses the gradient of the network back-propagation to compute the weights of each channel of the feature map to obtain a class activation map as a mask. The input spectrograms are superimposed at a certain scale to form the final heat map. We use the POW spectrogram as the input to the models Resnet34, MHA1, MHA2, and MHA3. In the heatmap, brighter red areas indicate a higher level of model attention.
From Fig. 3, it is evident that ResNet34, MHA2, and MHA3 do not fully focus on the bird vocalization information. For instance, when recognizing GS using ResNet34, the model primarily focuses on high-frequency information, neglecting low-frequency features. Similarly, when recognizing PD, some chirping features are lost. The MHA2 model fails to focus on the bird vocalization region when recognizing GS and overemphasizes noise in the high-frequency region when recognizing ACi. In contrast, the results demonstrate that MHA1 significantly enhances focus on high-frequency details and highly concentrates on the bird vocalization information, leading to the best recognition performance. All the aforementioned results aptly demonstrate that the inclusion of the attention mechanism at the appropriate position (MHA1) allows for focused attention on the comprehensive features present in bird vocalization, resulting in enhanced recognition performance.
In order to analyze the recognition performance of different species, this paper evaluates the performance of our proposed method using the confusion matrix (Fig. 4) and the precision, recall, F1-scores, and accuracy (Table 3) of each species.
AA | ACi | ACr | AG | AP | BB | CCo | CCy | FA | GS | HH | PCa | PCo | PD | RA | TO | TG | TT | VV | |
Precision (%) | 89.7 | 94.7 | 93.3 | 94.8 | 91.9 | 89.5 | 98.7 | 95.8 | 90.7 | 96.9 | 96.1 | 89.8 | 97.1 | 94.9 | 93.1 | 96.4 | 86.6 | 95.3 | 94.9 |
Recall (%) | 89.1 | 95.3 | 87.5 | 94.1 | 94.4 | 91.1 | 97.4 | 95.1 | 92.6 | 96.9 | 94.9 | 91.7 | 95.4 | 91.1 | 88.1 | 97.8 | 94.9 | 92.8 | 97.8 |
Accuracy (%) | 89.4 | 94.9 | 90.2 | 93.9 | 93.8 | 90.5 | 98.7 | 94.9 | 92.1 | 96.3 | 95.1 | 90.5 | 95.9 | 92.1 | 91.1 | 97.6 | 91.6 | 94.5 | 96.2 |
F1-score (%) | 89.3 | 95.1 | 90.5 | 92.7 | 93.1 | 90.3 | 98.1 | 95.4 | 91.7 | 96.9 | 95.5 | 90.8 | 96.2 | 92.9 | 90.5 | 97.1 | 90.6 | 94.1 | 95.9 |
As can be seen from Table 3, MHAResNet has a good recognition accuracy of higher than 90% for each type of bird except AA. The relatively high F1-scores for CCo (98.1%) and TO (97.1%) indicate that the model performs best at recognizing these two species. This could be due to the larger sample sizes for CCo and TO, allowing the model to learn more feature information, thus improving recognition accuracy. This consistency makes it easier for the model to learn the distinguishing features. However, AA and BB have relatively low F1-scores of 89.3% and 90.3% respectively. They are the most challenging species to distinguish. The relatively small number of samples for BB prevents the model from fully learning their features, which impacts the model’s generalization ability.
Fig. 5 shows a typical time-frequency diagram for the species AA. From Fig. 5, it can be concluded that the complexity of the vocal structure of AA, the significant variation of their vocalizations or the presence of more background noise may be the reason for the increased difficulty in recognition.
In this section, we assess the contribution of three distinct features, POW, IF, and GD, to the model’s recognition performance. The experiments first evaluated the effect of each feature on model performance individually, and then evaluated the effect of different combinations of features. Each experiment was run under the same training conditions to ensure the validity of the results. The experimental results for single-channel features, two-channel features and three-channel features are shown in Table 4.
Feature | Accuracy (%) | Recall (%) | Precision (%) | F1-Score (%) |
POW | 93.1 | 92.9 | 93.1 | 93.1 |
IF | 85.4 | 80.5 | 80.6 | 80.9 |
GD | 81.2 | 81.1 | 81.3 | 81.4 |
POW + IF | 90.2 | 90.0 | 90.1 | 90.1 |
POW + GD | 90.9 | 90.5 | 90.9 | 90.9 |
IF + GD | 84.1 | 83.7 | 83.8 | 83.9 |
POW + IF + GD | 94.0 | 93.9 | 94.1 | 93.8 |
Our ablation study reveals that utilizing the POW feature alone yields a high accuracy of 93.1%, underscoring its efficacy in capturing the essential features of bird vocalizations. The IF, when used alone, demonstrated relatively high accuracy (85.4%), but the F1-score was significantly lower than that of POW. This implies that IF contributes less to the recognition of certain species, making it more likely to produce false positives or false negatives. The GD has the poorest performance when used alone, with accuracy and F1-scores of 81.2% and 81.4%, respectively, indicating that the GD feature alone is insufficient for distinguishing bird species. From the results of the combined features, POW + IF and POW + GD, the accuracy of both combinations is slightly lower than when POW is used alone but higher than when IF or GD are used alone. This indicates that the information provided by POW dominates the combination, and there is overlapping information between the combined features, resulting in no significant improvement in performance. The IF + GD combination is the worst-performing feature combination, with lower accuracy and F1-scores than POW alone. This suggests that phase information alone may be inherently limiting, reflecting only a portion of the characteristics of the bird song signal and therefore being less effective. When IF and GD are used in combination, redundant information may be introduced, which does not provide additional useful information to the model and may lead to model confusion and affect its generalization ability.
POW represents the energy intensity (i.e., tone strength) of a bird’s vocalization. Due to the differences in the energy distribution of different bird species when they emit their chirps, POW becomes a powerful feature for distinguishing bird vocalizations. The IF and GD extracted from the phase information reflect the speed of change of the phase feature of bird vocalization in the time and frequency domain dimensions, respectively. IF represents the frequency change (i.e., pitch) of the bird vocalization, and pitch change is crucial for distinguishing bird vocalizations with similar volume but different pitches (Feng et al., 2018). GD represents the time-varying properties of the amplitude of bird vocalizations. Frequency Modulation (FM) and Amplitude Modulation (AM) are very common in bird vocalizations and are ways for birds to convey complex information (Zollinger et al., 2012; Nemeth et al., 2013). Experimental results show that IF or GD alone is weak in BVR, indicating that phase information alone is not as discriminative as tone intensity information. Changes in tone intensity carry important information such as rhythm and emphasis, whereas phase information tends to vary less significantly than amplitude information over short periods. The highest accuracy and F1-scores (94.0% and 93.8%) are achieved when all features are used in combination. This suggests that although IF and GD perform poorly on their own, they can provide complementary information when combined with POW, carrying more comprehensive information that enhances the model’s recognition ability. Thus, the joint input of magnitude and phase features to the classifier helps to increase robustness (Eisele et al., 2024).
(1) Comparison experiments with other methods on the same dataset
To evaluate the effectiveness of our proposed models, we performed comparative experiments using the Birdsdata dataset. The baseline and state-of-the-art models included Gaussian Mixture Models (GMM) (Ptacek et al., 2016) and Hidden Markov Models (HMM) (de Oliveira et al., 2020), which are traditional statistical models widely utilized in audio signal processing. Additionally, we compared our model with the classical artificial neural network (ANN) (Pahuja and Kumar, 2021; Xiao et al., 2022a) and three variants of ResNet: ResNet18 (Koh et al., 2019), ResNet34 (Koh et al., 2019), and ResNet50 (Schwab et al., 2023), known for their strong performance in image and sound recognition tasks. Emerging models such as Vision Transformer (ViT) (Tanzi et al., 2022) and BirdNET (Kahl et al., 2021) were also included, representing recent trends in feature learning and optimization for bird vocalization recognition (BVR). Finally, AMResNet (Xiao et al., 2022b), an enhanced ResNet architecture with an attention module, was also included. Table 5 presents the recognition accuracies of these methods.
Model | Accuracy (%) |
GMM | 61.1 |
HMM | 63.2 |
ANN | 69.3 |
ResNet18 | 88.3 |
ResNet34 | 89.5 |
ResNet50 | 86.6 |
ViT | 82.8 |
BirdNET | 82.5 |
AMResNet | 92.6 |
MHAResNet (proposed) | 94.0 |
Table 5 indicates that MHAResNet improves accuracy by 1.4% over the previously best-performing model, AMResNet. Notably, aside from AMResNet, none of the other methods achieve an accuracy above 90%, further highlighting the performance advantages of MHAResNet. Traditional statistical models such as GMM and HMM demonstrate limited performance due to their reliance on fixed assumptions about data patterns and temporal relationships. While effective for simple data, their capacity to model the non-linear and dynamic nature of bird vocalizations is constrained. ANN, with its shallow architecture, struggles to extract complex features, leading to lower recognition accuracy compared to deeper models. ResNet-based architectures (ResNet18, ResNet34, and ResNet50) show superior deep feature learning capabilities but lack attention mechanisms, making them more susceptible to noise interference and less effective in capturing discriminative audio features. ViT, a transformer-based model, excels in capturing global dependencies due to its large parameter count but faces challenges of computational inefficiency and overfitting, particularly when dealing with limited datasets. BirdNET employs down-sampling before each residual structure, which adversely impacts its performance on short audio clips by discarding critical high-frequency features. AMResNet, despite incorporating attention mechanisms, does not integrate phase information, limiting its ability to fully capture the intricacies of bird vocalizations.
MHAResNet addresses these limitations by incorporating phase features for enhanced feature fusion and employing an MHA mechanism. This design achieves an optimal balance between performance and parameter efficiency, utilizing approximately 21.8 million parameters. By integrating MHA into the ResNet34 backbone, MHAResNet surpasses traditional models such as GMM and HMM, as well as shallow networks like ANN. Compared to models like ResNet50, ViT, and AMResNet, it achieves higher accuracy with lower computational complexity. Furthermore, it improves upon BirdNET, which compromises high-frequency feature retention due to down-sampling. By integrating phase information and MHA, MHAResNet enhances feature representation while maintaining a lightweight architecture, offering a robust and efficient solution for BVR.
(2) Comparison experiments using MHAResNet on other datasets
We tested our method on two other datasets: a 16-class bird dataset and a 264-class bird dataset based on the Cornell Bird Challenge (CBC) 2020. For both datasets, the training parameters were kept consistent with the previous experiments. As shown in Table 6, our method achieves 98.9% accuracy on the 16-class dataset. It is 2.6% higher than the optimal performance (Xie et al., 2022). Although the feature fusion method is used, only fusion of different time-frequency transformed magnitudes are utilized in Xie et al. (2022), which has high redundancy. In contrast, we use the fusion of magnitude and phase, which can achieve better results. For the CBC dataset, in order to compare with the latest method (Gupta et al., 2021), we chose 100 species of birds among them as they did. The division of the dataset is also the same. Our method’s accuracy is much improved over theirs. They only use common MFCC features, but our approach incorporates the phase feature of the bird vocalization, providing more comprehensive information. Our model also combines the advantages of the attention mechanism and residual network, outperforming traditional DL models. These factors contribute to the superior performance of our method.
Performance (comparison) | Performance (ours) | ||||||||||
Study | Number of classes | Feature | Method | Accuracy (%) | Recall (%) | Precision (%) | F1-Score (%) | Accuracy (%) | Recall (%) | Precision (%) | F1-Score (%) |
Xie et al. (2022) | 16 | MFCC and deep spectral features of STFT, WT, HHT | Multi-view features with RF, SVM, MLP | 96.3 | – | – | – | 98.9 | 97.9 | 98.1 | 98.6 |
Gupta et al. (2021) | 100 | MFCC and GTCC | Temporal correlation with RNN | 67.0 | – | – | – | 87.1 | 87.2 | 87.5 | 86.8 |
Our study pioneers the integration of phase information into BVR, addressing the limitations of traditional approaches that focus solely on amplitude features. We proposed a BVR method based on feature fusion, where in the combined POW, IF, and GD features of bird vocalizations serve as input. The MHAResNet, integrating residual blocks and multi-head attention mechanisms, was designed for accurate bird species recognition. Experiments were conducted on three different bird vocalization datasets with 19, 16, and 100 classes, achieving accuracy rates of 94%, 98.9%, and 87.1%, respectively. These results surpass the existing methods, demonstrating that the feature set constructed by our method can better represent the information inherent in bird vocalizations. Our research paves the way for potential applications in the realms of automatic bird diversity monitoring, providing scientists and environmental managers with a powerful tool to better understand and conserve bird species. While the proposed method offers enhanced accuracy and robustness through the integration of phase features, it is important to note that phase information may be susceptible to specific noise types or variations in vocalization patterns, which could impact its effectiveness in certain scenarios. Future work will focus on developing adaptive preprocessing techniques to mitigate these challenges and further enhance the model’s resilience to diverse acoustic environments.
Jiangjian Xie: Writing – review & editing, Writing – original draft, Visualization, Methodology, Conceptualization. Zhulin Hao: Writing – review & editing, Writing – original draft, Visualization, Software, Methodology. Chunhe Hu: Writing – review & editing. Changchun Zhang: Writing – review & editing. Junguo Zhang: Writing – review & editing.
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
Not applicable.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ge, R., Kakade, S.M., Kidambi, R., Netrapalli, P., 2019. The step decay schedule: a near optimal, geometrically decaying learning rate procedure for least squares. Adv. Neural Inf. Process. Syst. 32.
|
He, K., Zhang, X., Ren, S., Sun, J., 2016. In: Deep residual learning for image recognition. IEEE, pp. 770–778.
|
Hidaka, S., Wakamiya, K., Kaburagi, T., 2022. An investigation of the effectiveness of phase for audio classification. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE.
|
Jiang, D., Hu, Y., Dai, L., Peng, J., 2021. Facial expression recognition based on attention mechanism. Sci. Program. 2021, 2021.
|
Kinsler, L.E., Frey, A.R., Coppens, A.B., Sanders, J.V., 1999. Fundamentals of Acoustics, fourth ed. Wiley, New York.
|
Manna, A., Upasani, N., Jadhav, S., Mane, R., Chaudhari, R., Chatre, V., 2023. Bird image classification using convolutional neural network transfer learning architectures. Int. J. Adv. Comput. Sci. Appl. 14.
|
Ruff, J., Lesmeister, D., Duchac, L., Padmaraju, B., Sullivan, C., 2019. Automated identification of avian vocalizations with deep convolutional neural networks. Remote Sens. Ecol. Conserv. 6, 79–92.
|
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. In: Grad-cam: visual explanations from deep networks via gradient-based localization. IEEE, pp. 618–626.
|
Xiao, H., Ran, Z., Mabu, S., Li, Y., Li, L., 2022c. Saunet++: an Automatic Segmentation Model of Covid-19 Lesion from Ct Slices, 39. Visual Comput, pp. 2291–2304.
|
Xie, S., Lu, J., Liu, J., Zhang, Y., Lv, D., Chen, X., et al., 2022. Multi-view features fusion for birdsong classification. Ecol. Inform. 72, 101893.
|
Xu, Y., Li, L., Gao, H., 2020. Sentiment classification with adversarial learning and attention mechanism. Comput. Intell. 37, 774–798.
|
Zollinger, S.A., Podos, J., Nemeth, E., Goller, F., Brumm, H., 2012. On the relationship between, and measurement of, amplitude and frequency in birdsong. Anim. Behav. 84, e1–e9.
|
Genus | Family | Order | Scientific name | Number of samples | Abbreviation |
Anser | Anatidae | Anseriformes | Anser anser | 759 | AA |
Ardea | Ardeidae | Pelecaniformes | Ardea cinerea | 850 | ACi |
Anas | Anatidae | Anseriformes | Anas crecca | 602 | ACr |
Accipiter | Accipitridae | Accipitriformes | Accipiter gentilis | 733 | AG |
Anas | Anatidae | Anseriformes | Anas platyrhynchos | 766 | AP |
Buteo | Accipitridae | Accipitriformes | Buteo buteo | 290 | BB |
Coturnix | Phasianidae | Galliformes | Coturnix coturnix | 738 | CCo |
Cygnus | Anatidae | Anseriformes | Cygnus cygnus | 800 | CCy |
Fulica | Rallidae | Gruiformes | Fulica atra | 460 | FA |
Gavia | Gaviidae | Gaviiformes | Gavia stellata | 835 | GS |
Himantopus | Recurvirostridae | Charadriiformes | Himantopus himantopus | 786 | HH |
Phalacrocorax | Phalacrocoracidae | Suliformes | Phalacrocorax carbo | 852 | PCa |
Phasianus | Phasianidae | Galliformes | Phasianus colchicus | 797 | PCo |
Passer | Passeridae | Passeriformes | Passer domesticus | 1195 | PD |
Rallus | Rallidae | Gruiformes | Rallus aquaticus | 680 | RA |
Tringa | Scolopacidae | Charadriiformes | Tringa ochropus | 710 | TO |
Tringa | Scolopacidae | Charadriiformes | Tringa glareola | 825 | TG |
Tringa | Scolopacidae | Charadriiformes | Tringa totanus | 790 | TT |
Vanellus | Charadriidae | Charadriiformes | Vanellus vanellus | 814 | VV |
Model | Position | Accuracy (%) | Recall (%) | Precision (%) | F1-score (%) |
MHA1(MHAResNet) | Between Layer1 and Layer2 | 94.0 | 93.9 | 94.1 | 93.8 |
MHA2 | Between Maxpool and Layer1 | 91.9 | 91.6 | 91.8 | 91.5 |
MHA3 | Between Avgpool and Fc | 90.2 | 90.1 | 90.2 | 90.2 |
AA | ACi | ACr | AG | AP | BB | CCo | CCy | FA | GS | HH | PCa | PCo | PD | RA | TO | TG | TT | VV | |
Precision (%) | 89.7 | 94.7 | 93.3 | 94.8 | 91.9 | 89.5 | 98.7 | 95.8 | 90.7 | 96.9 | 96.1 | 89.8 | 97.1 | 94.9 | 93.1 | 96.4 | 86.6 | 95.3 | 94.9 |
Recall (%) | 89.1 | 95.3 | 87.5 | 94.1 | 94.4 | 91.1 | 97.4 | 95.1 | 92.6 | 96.9 | 94.9 | 91.7 | 95.4 | 91.1 | 88.1 | 97.8 | 94.9 | 92.8 | 97.8 |
Accuracy (%) | 89.4 | 94.9 | 90.2 | 93.9 | 93.8 | 90.5 | 98.7 | 94.9 | 92.1 | 96.3 | 95.1 | 90.5 | 95.9 | 92.1 | 91.1 | 97.6 | 91.6 | 94.5 | 96.2 |
F1-score (%) | 89.3 | 95.1 | 90.5 | 92.7 | 93.1 | 90.3 | 98.1 | 95.4 | 91.7 | 96.9 | 95.5 | 90.8 | 96.2 | 92.9 | 90.5 | 97.1 | 90.6 | 94.1 | 95.9 |
Feature | Accuracy (%) | Recall (%) | Precision (%) | F1-Score (%) |
POW | 93.1 | 92.9 | 93.1 | 93.1 |
IF | 85.4 | 80.5 | 80.6 | 80.9 |
GD | 81.2 | 81.1 | 81.3 | 81.4 |
POW + IF | 90.2 | 90.0 | 90.1 | 90.1 |
POW + GD | 90.9 | 90.5 | 90.9 | 90.9 |
IF + GD | 84.1 | 83.7 | 83.8 | 83.9 |
POW + IF + GD | 94.0 | 93.9 | 94.1 | 93.8 |
Model | Accuracy (%) |
GMM | 61.1 |
HMM | 63.2 |
ANN | 69.3 |
ResNet18 | 88.3 |
ResNet34 | 89.5 |
ResNet50 | 86.6 |
ViT | 82.8 |
BirdNET | 82.5 |
AMResNet | 92.6 |
MHAResNet (proposed) | 94.0 |
Performance (comparison) | Performance (ours) | ||||||||||
Study | Number of classes | Feature | Method | Accuracy (%) | Recall (%) | Precision (%) | F1-Score (%) | Accuracy (%) | Recall (%) | Precision (%) | F1-Score (%) |
Xie et al. (2022) | 16 | MFCC and deep spectral features of STFT, WT, HHT | Multi-view features with RF, SVM, MLP | 96.3 | – | – | – | 98.9 | 97.9 | 98.1 | 98.6 |
Gupta et al. (2021) | 100 | MFCC and GTCC | Temporal correlation with RNN | 67.0 | – | – | – | 87.1 | 87.2 | 87.5 | 86.8 |