Beyond amplitude: Phase integration in bird vocalization recognition with MHAResNet

Jiangjian Xie; Zhulin Hao; Chunhe Hu; Changchun Zhang; Junguo Zhang

doi:10.1016/j.avrs.2025.100229

Volume 16 Issue 1

Turn off MathJax

Article Contents

Abstract

1. Introduction

2. Material and methods

3. Results and discussion

4. Conclusion

CRediT authorship contribution statement

Data availability and access

Ethics statement

Declaration of competing interest

References

Avian Research > 2025 > 16(1): 100229. > DOI: 10.1016/j.avrs.2025.100229

Jiangjian Xie, Zhulin Hao, Chunhe Hu, Changchun Zhang, Junguo Zhang. 2025: Beyond amplitude: Phase integration in bird vocalization recognition with MHAResNet. Avian Research, 16(1): 100229. DOI: 10.1016/j.avrs.2025.100229

Citation:

PDF (4735 KB)

Beyond amplitude: Phase integration in bird vocalization recognition with MHAResNet

Jiangjian Xie^{a, b, c, d, 1,},
Zhulin Hao^{a, 1},
Chunhe Hu^{a, c, d,},
Changchun Zhang^{a, c, d},
Junguo Zhang^{a, b, c, ,}

a.
School of Technology, Beijing Forestry University, Beijing, 100083, China
b.
State Key Laboratory of Efficient Production of Forest Resources, Beijing Forestry University, Beijing, 100083, China
c.
Key Laboratory of National Forestry and Grassland Administration on Forestry Equipment and Automation, Beijing, 100083, China
d.
Research Center for Biodiversity Intelligent Monitoring, Beijing Forestry University, Beijing, 100083, China

Funds:

the Beijing Natural Science Foundation 5252014

the National Natural Science Foundation of China 62303063

More Information

Corresponding author:
School of Technology, Beijing Forestry University, Beijing, 100083, China. E-mail address: zhangjunguo@bjfu.edu.cn (J. Zhang)
Peer review under the responsibility of Editorial Office of Avian Research.
Received Date: 25 Sep 2024
Rev Recd Date: 21 Jan 2025
Accepted Date: 13 Feb 2025
Publish Date: 14 Feb 2025

Graphical Abstract

Abstract

Abstract

Bird vocalizations are pivotal for ecological monitoring, providing insights into biodiversity and ecosystem health. Traditional recognition methods often neglect phase information, resulting in incomplete feature representation. In this paper, we introduce a novel approach to bird vocalization recognition (BVR) that integrates both amplitude and phase information, leading to enhanced species identification. We propose MHAResNet, a deep learning (DL) model that employs residual blocks and a multi-head attention mechanism to capture salient features from logarithmic power (POW), Instantaneous Frequency (IF), and Group Delay (GD) extracted from bird vocalizations. Experiments on three bird vocalization datasets demonstrate our method’s superior performance, achieving accuracy rates of 94%, 98.9%, and 87.1% respectively. These results indicate that our approach provides a more effective representation of bird vocalizations, outperforming existing methods. This integration of phase information in BVR is innovative and significantly advances the field of automatic bird monitoring technology, offering valuable tools for ecological research and conservation efforts.
- Bird vocalization recognition,
- Feature fusion,
- Phase information,
- Residual network

FullText(HTML)

1. Introduction

Birds are a vital component of biodiversity and play a significant role in maintaining the balance of ecosystems (Lu et al., 2023). Bird communities are good indicators of environmental changes that shape biodiversity at the landscape scale (Dvořáková et al., 2023). Bird vocalization, as the primary means of communication, reproduction, and territorial declaration among birds, carry rich and valuable information. The advent of passive acoustic monitoring (PAM) has revolutionized bird monitoring, offering a non-invasive, efficient approach to long-term ecological research (Sedlácek et al., 2015; Wheeldon et al., 2019). It not only conserves substantial manpower and material resources but also offers minimal interference, extensive monitoring coverage, and high efficiency, making it highly promising for application (Ma, 2016). Despite its advantages, the vast volume of data generated by PAM systems necessitates automated analysis to conserve resources and improve efficiency (Kasten et al., 2012). Automatic Bird Vocalization Recognition (BVR) has emerged as a critical solution to this challenge, harnessing the power of artificial intelligence and deep learning (DL) to surpass traditional methods (Xie et al., 2023).

The performance of BVR is directly related to the quality of the selected features. To achieve excellent performance, optimal features should be chosen as inputs to the model. In BVR, Numerous features, including short-time Fourier transform (STFT) spectrograms, log-Mel spectrograms and Mel-frequency cepstral coefficients (MFCCs), are widely utilized as input features (Xie et al., 2023). However, relying on a single feature type may not yield the best results. Hence, the integration of diverse features is imperative for comprehensive representation. Yan et al. (2021) evaluated the fusion of log-mel spectrogram (LM), MFCC, and chroma features using four bird audio datasets. Experimental results demonstrated that the combination of LM, MFCC, and chroma features achieved the best performance, with a mean average precision (mAP) of 97.9%. Xie et al. (2022) introduced the minimal-redundancy-maximal-relevance (mRMR) method to select optimal multi-view features for birdsong classification. The candidate features include four handcrafted features (wavelet transform (WT) spectrum, Hilbert-Huang transform (HHT) spectrum, STFT spectrum, MFCC), in addition to the deep features extracted from WT, HHT, and STFT spectrums. When the mRMR feature subset was 800, it achieved the best performance. Yang et al. (2022) incorporated a multi-scale feature fusion structure and Pyramid Split Attention (PSA) module to enhance the extraction of spatial and channel information. By adjusting the depth-wise separable convolution and introducing the Bnecks module with a channel attention mechanism, the model achieved a top-1 accuracy of 95.12% and a top-5 accuracy of 100%. Xie and Zhu (2023) investigated an early fusion method of deep features for birdsong classification. They used five pretrained models (VGG16, ResNet50, EfficientB0, MobileNetV2, and Xception) to extract deep cascaded features from multi-view spectrograms. The combination of VGG16 and MobileNetV2 achieved the best performance, with a balanced accuracy of 94.89%. These studies highlight the superior performance of multi-feature integration compared to using a single feature alone. However, the integration of phase information remains unexplored.

Phase information refers to the periodic variation of sound wave vibration, which denotes the spatial relationship between the waveform’s peaks and troughs (Kinsler et al., 1999). In the Fourier transform, a signal’s spectrum comprises amplitude and phase spectra. The amplitude spectrum describes the intensity of the frequency components, while the phase spectrum describes the relative positional relationships between these components. The completeness of phase information aids in accurately restoring the original audio signal. In the presence of time domain variations in the audio signal, phase information provides additional contextual and temporal information, which enhances the accuracy in distinguishing different speech units, such as phonemes or speech events (Hidaka et al., 2022). The role of phase information of audio signals in feature representation and audio recognition tasks is multifaceted. It contains richer phonetic features that help the expression and recognition of bird vocalizations. Here, we inventively fuse logarithmic power (POW) spectrum with group delay (GD) and instantaneous frequency (IF) features extracted from phase information. This fusion strategy is expected to further enhance the richness of feature representations, providing a more comprehensive view for BVR.

To fully leverage the potential of these enriched feature representations, the choice of a robust recognition model becomes crucial. DL models, such as convolutional neural networks (CNNs), have demonstrated remarkable performance in BVR compared to traditional methods (Lin et al., 2016; Sinha et al., 2020). Additionally, studies have found that deeper network architectures generally perform better in BVR (Ruff et al., 2019; Xie et al., 2019; Florentin et al., 2020). Deep network architectures, while powerful, are prone to gradient-related issues that can diminish model performance. To counteract these challenges, we employ residual networks (ResNets) in our approach. ResNets have proven effective for BVR by enabling the training of deeper models and alleviating the impact of vanishing and exploding gradients (He et al., 2020; Manna et al., 2023). Furthermore, recognizing bird vocalizations in the wild presents additional challenges, as environmental noise can substantially hinder recognition performance. Studies have shown that attention mechanisms can help mitigate these issues by focusing on the most relevant parts of the audio signal, such as distinctive bird vocalization features, thus enhancing recognition accuracy even in noisy conditions (Xu et al., 2020; Jiang et al., 2021; Xiao et al., 2022c). By integrating attention mechanisms into DL models, the performance of BVR systems can be further improved, making them more robust to noise and better at capturing important features. Noumida et al. (Noumida and Rajan, 2022) used a hierarchical attention mechanism to focus on the information output by the hidden layer of the bidirectional gated recurrent unit model. The model achieves considerable performance on the Xeno-Canto dataset, with an F1-score of 84%. Xiao et al. (2022b) proposed AMResNet, an automatic bird sound recognition model that integrates attention mechanism and ResNet. The attention mechanism enhances recognition by focusing on relevant bird sound features, while the ResNet addresses gradient vanishing issues. Tested on 12,651 bird sound samples, the model achieved a classification accuracy of 92% and an F1-score of 97.1%, outperforming several other models. Hu et al. (2023a) embedded a shuffle attention module between the double-layer residual module connected by base block and down block, which transfers effective information and enhances the ripple characteristics of spectrograms, thereby improving the model’s accuracy and efficiency. Additionally, they introduced the ScSEnet attention module into the ResNet18 backbone network to reduce the impact of noise in Mel and Sinc spectrograms on classification. Their model achieved high accuracy across three bird sound datasets (Hu et al., 2023b). In this paper, we further investigate attention mechanisms and the structure of ResNets. The ResNet addresses the problems of gradient vanishing and gradient explosion in deep network training by introducing residual connections, allowing the network to learn deeper feature representations. The multi-head attention mechanism enables the model to focus on multiple positions in the sequence simultaneously, capturing different contextual information, which enhances the model’s performance.

The main contributions of this paper are as follows:

(1) We propose MHAResNet, a novel model that combines a residual network with a multi-head attention mechanism. By developing a local feature extraction module, our approach reduces the number of parameters while enhancing the feature learning capability for noisy bird vocalizations.

(2) We extract POW, GD, and IF features from bird vocalizations. These features are fused to serve as inputs to the recognition model, providing a comprehensive representation of bird vocalizations.

(3) We conduct experiments on various datasets and compare the performance of our model with other DL-based BVR models. The results demonstrate that our network exhibits strong generalization ability and outperforms existing methods.

2. Material and methods

In this paper, we propose a BVR method based on feature fusion and MHAResNet (shown in Fig. 1). Firstly, we down-sample the bird audio samples and split them into equal-length segments. We then extract POW, GD, and IF features from each segment and fuse them into three-channel features of size (3, 224, 224) as inputs. The MHAResNet model incorporates the MHA mechanism into ResNet34 to enhance feature extraction capabilities. Finally, we use Softmax as a classifier to map the input multi-features to probability distributions across different classes.

Figure 1. The architecture of MHAResNet.

DownLoad: Full-Size Img PowerPoint

2.1 Data pre-processing

We choose the Birdsdata dataset, derived from the Beijing Academy of Artificial Intelligence (BAAI)’s bird call library (https://www.aminer.cn/research_report/5f3394d73c99ce0ab7bc771f), for evaluating our method. The original dataset contains 14311 labeled sound clips from 20 categories, all with audio length of 2 s. We resampled all the audio samples to a sampling frequency of 16 kHz using Librosa 0.7.2. To avoid the issue of data imbalance, we followed the approach used by Xiao et al. (2022b) and excluded the species Perdix perdix due to their insufficient representation in the dataset. After processing, the dataset consists of 14,282 sound clips from 19 species of birds, spanning 9 orders, 12 families, and 16 genera. The detailed information of each bird species is presented in Table 1.

Table 1. Sample information of bird vocalization.

Genus	Family	Order	Scientific name	Number of samples	Abbreviation
Anser	Anatidae	Anseriformes	Anser anser	759	AA
Ardea	Ardeidae	Pelecaniformes	Ardea cinerea	850	ACi
Anas	Anatidae	Anseriformes	Anas crecca	602	ACr
Accipiter	Accipitridae	Accipitriformes	Accipiter gentilis	733	AG
Anas	Anatidae	Anseriformes	Anas platyrhynchos	766	AP
Buteo	Accipitridae	Accipitriformes	Buteo buteo	290	BB
Coturnix	Phasianidae	Galliformes	Coturnix coturnix	738	CCo
Cygnus	Anatidae	Anseriformes	Cygnus cygnus	800	CCy
Fulica	Rallidae	Gruiformes	Fulica atra	460	FA
Gavia	Gaviidae	Gaviiformes	Gavia stellata	835	GS
Himantopus	Recurvirostridae	Charadriiformes	Himantopus himantopus	786	HH
Phalacrocorax	Phalacrocoracidae	Suliformes	Phalacrocorax carbo	852	PCa
Phasianus	Phasianidae	Galliformes	Phasianus colchicus	797	PCo
Passer	Passeridae	Passeriformes	Passer domesticus	1195	PD
Rallus	Rallidae	Gruiformes	Rallus aquaticus	680	RA
Tringa	Scolopacidae	Charadriiformes	Tringa ochropus	710	TO
Tringa	Scolopacidae	Charadriiformes	Tringa glareola	825	TG
Tringa	Scolopacidae	Charadriiformes	Tringa totanus	790	TT
Vanellus	Charadriidae	Charadriiformes	Vanellus vanellus	814	VV

| Show Table

DownLoad: CSV

2.2 Features extraction and fusion method

We utilize the phase information and amplitude information of audio to constitute the fused feature. The feature extraction process is shown in Fig. 2 (Hidaka et al., 2022).

Figure 2. Three channel feature extraction process.

DownLoad: Full-Size Img PowerPoint

According to the short-time Fourier transform, the audio signal is divided into frames along the time direction. The Gabor function transforms each frame of the audio segment from the time domain to the frequency domain (as shown in Eq. (1)).

$\begin{array}{c} G a \hat{b} o r (f, t) = k e^{i θ} \int_{- \infty}^{+ \infty} e^{- i 2 π ({f - f}_{0}) t} ω (a t) d t = \frac{k}{a} e^{i θ} ω \frac{f - f_{0}}{a} \end{array}$

(1)

where k is a constant that represents the amplitude coefficient of the Gabor function, used to adjust the amplitude of the Gabor function. $e^{i θ}$ represents a complex sinusoidal wave, where i is the imaginary unit and θ is the phase shift. $e^{- i 2 π ({f - f}_{0}) t}$ represents a complex exponential function, where f is the frequency variable and f₀ is the center frequency. ω(at) represents the Gaussian function, with a being a parameter that controls the width of the Gaussian function and t being the time variable.

The form of the above equation can be viewed as a complex wave with the amplitude shown as Eq. (2):

$\begin{array}{c} A = ‖ G a \hat{b} o r (f) ‖ = \frac{k}{a} ω \frac{f - f_{0}}{a} \end{array}$

(2)

The amplitude derived from Eq. (2) is squared, down-sampled, and converted to a logarithmic scale to obtain POW, as expressed in Eq. (3):

$\begin{array}{c} P O W = \log (downsample (A^{2})) \end{array}$

(3)

In Eq. (1), the complex exponential term $e^{i θ}$ can be decomposed into real and imaginary parts by Euler’s formula. That is, $e^{i}$ x = cos(x) + i* sin(x). The decomposed formula is then shown as Eq. (4):

$\begin{array}{c} G a \hat{b} o r (f) = \frac{k}{a} (\cos (θ) + i ∙ \sin (θ)) ω \frac{f - f_{0}}{a} \end{array}$

(4)

then the real and imaginary parts of the above Gabor function can be represented as Eq. (5) and Eq. (6) respectively:

$\begin{array}{c} R (G a \hat{b} o r (f)) = \frac{k}{a} \cos (θ) ω \frac{f - f_{0}}{a} \end{array}$

(5)

$\begin{array}{c} I (G a \hat{b} o r (f)) = \frac{k}{a} \sin (θ) ω \frac{f - f_{0}}{a} \end{array}$

(6)

The cosine and sine terms in Eq. (5) and Eq. (6) represent the real and imaginary parts of the signal, respectively. The phase (in radians) of the signal is shown as Eq. (7):

$\begin{array}{c} ϕ (f) = \tan^{- 1} \frac{I (G a \hat{b} o r (f))}{R (G a \hat{b} o r (f))} \end{array}$

(7)

Frequency domain differentiation denotes the partial differentiation of the frequency with respect to the phase. That is, the rate of change of the phase in the frequency domain. This is GD. The formula for GD is shown in Eq. (8):

$\begin{array}{c} G D = \frac{\partial}{\partial f} ϕ (f) \end{array}$

(8)

Therefore, the time domain unfolding is represented as a decomposition of the desired phase into multiple segments along time. For timeframe t(i) ∈ T, i = 1, 2, …, F. F is the total number of frames. The difference of the phase between frame t(i) and frame t(i + 1) represents the rate of change of the phase over time. It is IF.

The POW, GD, and IF described in this paper are all derived on a time frame. Finally, in the same way as the short-time Fourier transform, the above features obtained are stacked and spliced along the time axis to obtain a two-dimensional time-frequency feature map. The spectrograms acquired consist of POW spectrum, GD spectrum, and IF spectrum, all standardized to dimensions of (224, 224). These individual spectrograms are concatenated to form fused features with a size of (3, 224, 224), serving as input data for the MHAResNet network.

2.3 MHAResNet

The MHAResNet model integrates residual blocks and a multi-head attention mechanism to enhance feature extraction capabilities. It consists of convolutional layers, batch normalization, ReLU activations, max pooling, and a sequence of residual blocks (MRBlocks) that progressively decrease in size while increasing in depth. A multi-head attention block is introduced to capture salient features within the bird vocalizations, followed by average pooling and a fully connected layer for classification.

2.3.1 MRBlocks

MRBlocks is a sequence of four residual block groups and one multi-head attention block. From left to right each residual block group has residual layers of 3, 4, 6, and 3. Fig. 1 has the inputs and outputs labeled for the feature map dimensions of each block group. The data flow through the residual blocks decreases in size but increases in depth. Each residual layer consists of two 3 × 3 convolutional layers and two normalization layers, with a ReLU activation function added between the first and second convolutional layers. These two successive convolution operations define the residual function F. The output of the residual function is summed with the input X through skip connections. Eventually, the result of the summation is nonlinearly transformed by the ReLU function to obtain the output of the residual block. The specific formula is shown in Eq. (9):

$\begin{array}{c} Y_{i} = h (X_{i}) + F (X_{i}, W_{i}) \end{array}$

(9)

where Y_i is the output of the residual block, X_i is the input, h (X_i) is the constant mapping, and F(X_i, W_i) is the residual function with weights.

The architecture utilizes skip connections (indicated by dashed arrows) that allow gradients to flow through the network without disappearing or exploding (He et al., 2016).

2.3.2 Multi-head attention block

The Multi-head Attention (MHA) Block (see Fig. 1) receives the output from the first residual block set and further processes the feature information without changing the spatial dimensions. Firstly, the sequence X with output dimension (N, C, H, W) from the previous residual block group is reshaped to (N * H * W, 1, C) to match the input requirement of the multi-head attention, where N is the batch size, L is the length, and E is the embedding dimension. The new sequence data X′ then enters the multi-head attention block. Q, K, and V are the mapping matrices of the input sequences. First, the input embedding I is multiplied by the corresponding weight matrix to initialize the three mapping matrices as shown in Eq. (10).

$\begin{array}{c} Q = W^{Q} I, K = W^{K} I, V = W^{V} I \end{array}$

(10)

where W^Q, W^K, W^V are learnable parameters, the dimensions of Q, K are d_k, and the dimension of V is d_v.

Attention weights are then obtained by calculating the similarity between Q and K by scaling the dot product attention operation (see Fig. 1), as shown in Eq. (11):

$\begin{array}{c} Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V \end{array}$

(11)

in this mechanism, each "head" performs the same operation, but on a different representation subspace. In this way, the model is able to capture information from multiple levels of abstraction, improving its expressive power and complexity handling. The MHA output is a splice of the individual head outputs. The information is subsequently integrated by another linear transformation to produce the final feature representation. This multi-angle feature integration approach allows the model to have higher flexibility and adaptability in processing sequence data. The specific formula is shown in Eq. (12):

$\begin{array}{c} \begin{array}{c} MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}, \\ {head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{array} \end{array}$

(12)

where W_i^Q, W_i^K, W_i^V is the projection matrix specific to each head. W^O is the output projection matrix. h is the number of heads. Each head will correspond to a different representation subspace, thus allowing the model to capture different features in multiple subspaces.

Finally, the 3D feature maps are converted into 1D vectors of size 512 by AvgPool. Since AvgPool is executed after MRBlocks, the feature maps used for flattening are of size 1*1 and are fed to the FC layer with 512 hidden units. Based on the number of classes in the bird call dataset, the output of the FC layer is a 1*19 tensor.

3. Results and discussion

3.1 Experimental setup and metrics

In the experiments, all the data are randomly divided according to the ratio of 6:2:2, which are used as the training set, validation set and test set respectively. The network model is built based on PyTorch 2.0.1 DL framework. The programming language is Python 3.8.0. The hardware environment is CPU Intel(R) i5-9300H and GPU Nvidia GeForce GTX 1050.

The model was trained for 200 epochs using the Adam optimizer, with an initial learning rate set to 0.0001. We employed the learning rate Step Decay Schedule strategy (Ge et al., 2019), where the learning rate is updated by a process like Eq. (13).

$\begin{array}{c} {l r}_{new} = {l r}_{initial} \times E^{⌊ \frac{M}{N} ⌋} \end{array}$

(13)

where ${l r}_{new}$ is the adjusted new learning rate. ${l r}_{initial}$ is the initial learning rate. E is the factor by which the learning rate is multiplied each time it decays. N is the step size of the learning rate change. M is the current training period (number of iterations). $⌊ \frac{M}{N} ⌋$ indicates how many "steps" the current epoch is.

We set the batch size to 16. An early stopping method was used during the training phase to prevent the network from overfitting. In addition, we perform L2 regularization for all weight parameters. The loss function is updated as Eq. (14).

$\begin{array}{c} L^{'} (θ) = L (θ) + λ \frac{1}{2} {‖ θ ‖}_{2} \end{array}$

(14)

where L′(θ) and L(θ) denote the loss function before and after regularization, respectively, θ denotes all the weighting parameters, and λ is used as a parameter to regulate the degree of the rule term’s influence on the loss. In addition, cross-entropy is used as the loss function because it can balance class similarity and speed up convergence in recognition tasks. The specific formula is shown in Eq. (15).

$\begin{array}{c} L (θ) = - \frac{1}{n} \sum_{x} [Y_{r} \ln Y_{p} + (1 - Y_{r}) \ln (1 - Y_{p})] \end{array}$

(15)

where $Y_{r}$ is the actual output label and $Y_{p}$ is the corresponding predicted output.

In order to evaluate the model performance, Accuracy, precision, recall and F1-score are used as the evaluation metrics for the experiments in this paper. Accuracy is the gold metric for recognition models. It is applicable to both binary and multi-categorization tasks, as shown in Eq. (16).

$\begin{array}{c} A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N} \end{array}$

(16)

Precision denotes the proportion of all bird vocalization samples within the predicted label set that were correctly classified, as shown in Eq. (17). Recall is the proportion of all bird vocalization samples within the true label set that were correctly classified, as shown in Eq. (18). F1-score is obtained by weighting the two metrics Precision and Recall, as shown in Eq. (19).

$\begin{array}{c} P r e c i s i o n = \frac{T P}{T P + F P} \end{array}$

(17)

$\begin{array}{c} R e c a l l = \frac{T P}{T P + F N} \end{array}$

(18)

$\begin{array}{c} F 1 = \frac{2 T P}{2 T P + F N + F P} \end{array}$

(19)

where TP is correctly predicted as a positive example, TN is correctly predicted as a negative example, FP is incorrectly predicted as a positive example, and FN is incorrectly predicted as a negative example.

3.2 Experiments and analysis

The experiment is divided into three parts. First, we investigate the impact of incorporating attention mechanisms at different positions within the ResNet to identify the optimal model configuration. Next, we conduct ablation experiments to analyze the contribution of each feature to the model’s recognition performance. Finally, we compare our method’s performance with other existing methods on the same dataset and evaluate our model’s performance on different datasets to demonstrate its robustness and generalization capability.

3.2.1 Performance of different multi-head attention block positions

In this section, we compare the effects of adding the mechanism at different positions. To ensure the fairness of the experiments, factors such as learning rate, batch size, and training period were kept constant across all experimental setups, with the only variation being the position of the multi-head attention mechanism. The MHA1 model places the multi-head attention mechanism between groups of residual blocks (between layer1 and layer2). This positioning is intended to facilitate the exchange of information between features at different levels. The MHA2 model introduces the attention module between the Maxpool layer and layer1, aiming to enhance feature representation prior to feature down-sampling. The MHA3 model inserts the multi-head attention mechanism between the Avgpool layer and the fully connected (Fc) layers, with the goal of increasing the discriminative power of the model at the feature decision level.

The recognition accuracies, F1-scores, recalls and precision values of the models with different location compositions are given in Table 2.

Table 2. The recognition performance of the model after adding different position attention mechanisms.

Model	Position	Accuracy (%)	Recall (%)	Precision (%)	F1-score (%)
MHA1(MHAResNet)	Between Layer1 and Layer2	94.0	93.9	94.1	93.8
MHA2	Between Maxpool and Layer1	91.9	91.6	91.8	91.5
MHA3	Between Avgpool and Fc	90.2	90.1	90.2	90.2

| Show Table

DownLoad: CSV

The experimental results show that the MHA1 model exhibits optimal recognition performance with an accuracy of 94.0% and a F1-score of 93.8%. The effectiveness of the MHA mechanism lies in its ability to allow the model to focus on different subsets of information in parallel while processing sequential data. This enables the model to comprehensively understand the data and consider various combinations of features during decision-making. Therefore, inserting the MHA between Layer1 and Layer2 is most appropriate. At the earlier stage of feature extraction, the model retains more of the original information and local features, facilitating the exchange and fusion of different features. The MHA2 model achieved an accuracy of 91.9%. By placing the attention module before feature down-sampling, the model has the opportunity to process the original input features in more detail. However, the MHA mechanism at this stage may lack sufficient contextual information to enhance feature representation effectively, resulting in a lower accuracy rate. The accuracy of the MHA3 model was lower than that of the other two models. This may be due to its placement deeper in the network, where some detailed information may be lost before attentional focusing can occur. In summary, inserting the MHA between Layer1 and Layer2 is most effective. This position enhances the model’s understanding and integration of features while retaining sufficient detail. It not only provides effective channels for information flow but also helps the model capture complex relationships among features more comprehensively, thereby improving the performance of the recognition task.

We further use Grad-CAM (Selvaraju et al., 2017) to visualize the region of interest of the model in order to investigate the benefits of the multi-attention mechanism. Grad-CAM uses the gradient of the network back-propagation to compute the weights of each channel of the feature map to obtain a class activation map as a mask. The input spectrograms are superimposed at a certain scale to form the final heat map. We use the POW spectrogram as the input to the models Resnet34, MHA1, MHA2, and MHA3. In the heatmap, brighter red areas indicate a higher level of model attention.

From Fig. 3, it is evident that ResNet34, MHA2, and MHA3 do not fully focus on the bird vocalization information. For instance, when recognizing GS using ResNet34, the model primarily focuses on high-frequency information, neglecting low-frequency features. Similarly, when recognizing PD, some chirping features are lost. The MHA2 model fails to focus on the bird vocalization region when recognizing GS and overemphasizes noise in the high-frequency region when recognizing ACi. In contrast, the results demonstrate that MHA1 significantly enhances focus on high-frequency details and highly concentrates on the bird vocalization information, leading to the best recognition performance. All the aforementioned results aptly demonstrate that the inclusion of the attention mechanism at the appropriate position (MHA1) allows for focused attention on the comprehensive features present in bird vocalization, resulting in enhanced recognition performance.

Figure 3. The heatmaps of different individuals. From top to bottom, Coturnix coturnix (CCo), Ardea cinerea (ACi), Gavia stellata (GS), and Passer domesticus (PD). (a) Input spectrogram, (b) Resnet34 results, (c) MHA1 results, (d) MHA2 results, and (e) MHA3 results.

DownLoad: Full-Size Img PowerPoint

3.2.2 Recognition performance analysis for different species

In order to analyze the recognition performance of different species, this paper evaluates the performance of our proposed method using the confusion matrix (Fig. 4) and the precision, recall, F1-scores, and accuracy (Table 3) of each species.

Figure 4. The confusion matrix of the proposed MHAResNet.

DownLoad: Full-Size Img PowerPoint

Table 3. The recognition performance of the MHAResNet for each species.

	AA	ACi	ACr	AG	AP	BB	CCo	CCy	FA	GS	HH	PCa	PCo	PD	RA	TO	TG	TT	VV
Precision (%)	89.7	94.7	93.3	94.8	91.9	89.5	98.7	95.8	90.7	96.9	96.1	89.8	97.1	94.9	93.1	96.4	86.6	95.3	94.9
Recall (%)	89.1	95.3	87.5	94.1	94.4	91.1	97.4	95.1	92.6	96.9	94.9	91.7	95.4	91.1	88.1	97.8	94.9	92.8	97.8
Accuracy (%)	89.4	94.9	90.2	93.9	93.8	90.5	98.7	94.9	92.1	96.3	95.1	90.5	95.9	92.1	91.1	97.6	91.6	94.5	96.2
F1-score (%)	89.3	95.1	90.5	92.7	93.1	90.3	98.1	95.4	91.7	96.9	95.5	90.8	96.2	92.9	90.5	97.1	90.6	94.1	95.9

| Show Table

DownLoad: CSV

As can be seen from Table 3, MHAResNet has a good recognition accuracy of higher than 90% for each type of bird except AA. The relatively high F1-scores for CCo (98.1%) and TO (97.1%) indicate that the model performs best at recognizing these two species. This could be due to the larger sample sizes for CCo and TO, allowing the model to learn more feature information, thus improving recognition accuracy. This consistency makes it easier for the model to learn the distinguishing features. However, AA and BB have relatively low F1-scores of 89.3% and 90.3% respectively. They are the most challenging species to distinguish. The relatively small number of samples for BB prevents the model from fully learning their features, which impacts the model’s generalization ability.

Fig. 5 shows a typical time-frequency diagram for the species AA. From Fig. 5, it can be concluded that the complexity of the vocal structure of AA, the significant variation of their vocalizations or the presence of more background noise may be the reason for the increased difficulty in recognition.

Figure 5. Time-frequency diagram for the species Anser anser (AA).

DownLoad: Full-Size Img PowerPoint

3.2.3 Ablation experiments

In this section, we assess the contribution of three distinct features, POW, IF, and GD, to the model’s recognition performance. The experiments first evaluated the effect of each feature on model performance individually, and then evaluated the effect of different combinations of features. Each experiment was run under the same training conditions to ensure the validity of the results. The experimental results for single-channel features, two-channel features and three-channel features are shown in Table 4.

Table 4. The recognition effect of different input features.

Feature	Accuracy (%)	Recall (%)	Precision (%)	F1-Score (%)
POW	93.1	92.9	93.1	93.1
IF	85.4	80.5	80.6	80.9
GD	81.2	81.1	81.3	81.4
POW + IF	90.2	90.0	90.1	90.1
POW + GD	90.9	90.5	90.9	90.9
IF + GD	84.1	83.7	83.8	83.9
POW + IF + GD	94.0	93.9	94.1	93.8

| Show Table

DownLoad: CSV

Our ablation study reveals that utilizing the POW feature alone yields a high accuracy of 93.1%, underscoring its efficacy in capturing the essential features of bird vocalizations. The IF, when used alone, demonstrated relatively high accuracy (85.4%), but the F1-score was significantly lower than that of POW. This implies that IF contributes less to the recognition of certain species, making it more likely to produce false positives or false negatives. The GD has the poorest performance when used alone, with accuracy and F1-scores of 81.2% and 81.4%, respectively, indicating that the GD feature alone is insufficient for distinguishing bird species. From the results of the combined features, POW + IF and POW + GD, the accuracy of both combinations is slightly lower than when POW is used alone but higher than when IF or GD are used alone. This indicates that the information provided by POW dominates the combination, and there is overlapping information between the combined features, resulting in no significant improvement in performance. The IF + GD combination is the worst-performing feature combination, with lower accuracy and F1-scores than POW alone. This suggests that phase information alone may be inherently limiting, reflecting only a portion of the characteristics of the bird song signal and therefore being less effective. When IF and GD are used in combination, redundant information may be introduced, which does not provide additional useful information to the model and may lead to model confusion and affect its generalization ability.

POW represents the energy intensity (i.e., tone strength) of a bird’s vocalization. Due to the differences in the energy distribution of different bird species when they emit their chirps, POW becomes a powerful feature for distinguishing bird vocalizations. The IF and GD extracted from the phase information reflect the speed of change of the phase feature of bird vocalization in the time and frequency domain dimensions, respectively. IF represents the frequency change (i.e., pitch) of the bird vocalization, and pitch change is crucial for distinguishing bird vocalizations with similar volume but different pitches (Feng et al., 2018). GD represents the time-varying properties of the amplitude of bird vocalizations. Frequency Modulation (FM) and Amplitude Modulation (AM) are very common in bird vocalizations and are ways for birds to convey complex information (Zollinger et al., 2012; Nemeth et al., 2013). Experimental results show that IF or GD alone is weak in BVR, indicating that phase information alone is not as discriminative as tone intensity information. Changes in tone intensity carry important information such as rhythm and emphasis, whereas phase information tends to vary less significantly than amplitude information over short periods. The highest accuracy and F1-scores (94.0% and 93.8%) are achieved when all features are used in combination. This suggests that although IF and GD perform poorly on their own, they can provide complementary information when combined with POW, carrying more comprehensive information that enhances the model’s recognition ability. Thus, the joint input of magnitude and phase features to the classifier helps to increase robustness (Eisele et al., 2024).

3.2.4 Comparison experiments

(1) Comparison experiments with other methods on the same dataset

To evaluate the effectiveness of our proposed models, we performed comparative experiments using the Birdsdata dataset. The baseline and state-of-the-art models included Gaussian Mixture Models (GMM) (Ptacek et al., 2016) and Hidden Markov Models (HMM) (de Oliveira et al., 2020), which are traditional statistical models widely utilized in audio signal processing. Additionally, we compared our model with the classical artificial neural network (ANN) (Pahuja and Kumar, 2021; Xiao et al., 2022a) and three variants of ResNet: ResNet18 (Koh et al., 2019), ResNet34 (Koh et al., 2019), and ResNet50 (Schwab et al., 2023), known for their strong performance in image and sound recognition tasks. Emerging models such as Vision Transformer (ViT) (Tanzi et al., 2022) and BirdNET (Kahl et al., 2021) were also included, representing recent trends in feature learning and optimization for bird vocalization recognition (BVR). Finally, AMResNet (Xiao et al., 2022b), an enhanced ResNet architecture with an attention module, was also included. Table 5 presents the recognition accuracies of these methods.

Table 5. The recognition accuracy of different approaches.

Model	Accuracy (%)
GMM	61.1
HMM	63.2
ANN	69.3
ResNet18	88.3
ResNet34	89.5
ResNet50	86.6
ViT	82.8
BirdNET	82.5
AMResNet	92.6
MHAResNet (proposed)	94.0

| Show Table

DownLoad: CSV

Table 5 indicates that MHAResNet improves accuracy by 1.4% over the previously best-performing model, AMResNet. Notably, aside from AMResNet, none of the other methods achieve an accuracy above 90%, further highlighting the performance advantages of MHAResNet. Traditional statistical models such as GMM and HMM demonstrate limited performance due to their reliance on fixed assumptions about data patterns and temporal relationships. While effective for simple data, their capacity to model the non-linear and dynamic nature of bird vocalizations is constrained. ANN, with its shallow architecture, struggles to extract complex features, leading to lower recognition accuracy compared to deeper models. ResNet-based architectures (ResNet18, ResNet34, and ResNet50) show superior deep feature learning capabilities but lack attention mechanisms, making them more susceptible to noise interference and less effective in capturing discriminative audio features. ViT, a transformer-based model, excels in capturing global dependencies due to its large parameter count but faces challenges of computational inefficiency and overfitting, particularly when dealing with limited datasets. BirdNET employs down-sampling before each residual structure, which adversely impacts its performance on short audio clips by discarding critical high-frequency features. AMResNet, despite incorporating attention mechanisms, does not integrate phase information, limiting its ability to fully capture the intricacies of bird vocalizations.

MHAResNet addresses these limitations by incorporating phase features for enhanced feature fusion and employing an MHA mechanism. This design achieves an optimal balance between performance and parameter efficiency, utilizing approximately 21.8 million parameters. By integrating MHA into the ResNet34 backbone, MHAResNet surpasses traditional models such as GMM and HMM, as well as shallow networks like ANN. Compared to models like ResNet50, ViT, and AMResNet, it achieves higher accuracy with lower computational complexity. Furthermore, it improves upon BirdNET, which compromises high-frequency feature retention due to down-sampling. By integrating phase information and MHA, MHAResNet enhances feature representation while maintaining a lightweight architecture, offering a robust and efficient solution for BVR.

(2) Comparison experiments using MHAResNet on other datasets

We tested our method on two other datasets: a 16-class bird dataset and a 264-class bird dataset based on the Cornell Bird Challenge (CBC) 2020. For both datasets, the training parameters were kept consistent with the previous experiments. As shown in Table 6, our method achieves 98.9% accuracy on the 16-class dataset. It is 2.6% higher than the optimal performance (Xie et al., 2022). Although the feature fusion method is used, only fusion of different time-frequency transformed magnitudes are utilized in Xie et al. (2022), which has high redundancy. In contrast, we use the fusion of magnitude and phase, which can achieve better results. For the CBC dataset, in order to compare with the latest method (Gupta et al., 2021), we chose 100 species of birds among them as they did. The division of the dataset is also the same. Our method’s accuracy is much improved over theirs. They only use common MFCC features, but our approach incorporates the phase feature of the bird vocalization, providing more comprehensive information. Our model also combines the advantages of the attention mechanism and residual network, outperforming traditional DL models. These factors contribute to the superior performance of our method.

Table 6. Comparative analysis with other methods on different datasets.

				Performance (comparison)				Performance (ours)
Study	Number of classes	Feature	Method	Accuracy (%)	Recall (%)	Precision (%)	F1-Score (%)	Accuracy (%)	Recall (%)	Precision (%)	F1-Score (%)
Xie et al. (2022)	16	MFCC and deep spectral features of STFT, WT, HHT	Multi-view features with RF, SVM, MLP	96.3	–	–	–	98.9	97.9	98.1	98.6
Gupta et al. (2021)	100	MFCC and GTCC	Temporal correlation with RNN	67.0	–	–	–	87.1	87.2	87.5	86.8

| Show Table

DownLoad: CSV

4. Conclusion

Our study pioneers the integration of phase information into BVR, addressing the limitations of traditional approaches that focus solely on amplitude features. We proposed a BVR method based on feature fusion, where in the combined POW, IF, and GD features of bird vocalizations serve as input. The MHAResNet, integrating residual blocks and multi-head attention mechanisms, was designed for accurate bird species recognition. Experiments were conducted on three different bird vocalization datasets with 19, 16, and 100 classes, achieving accuracy rates of 94%, 98.9%, and 87.1%, respectively. These results surpass the existing methods, demonstrating that the feature set constructed by our method can better represent the information inherent in bird vocalizations. Our research paves the way for potential applications in the realms of automatic bird diversity monitoring, providing scientists and environmental managers with a powerful tool to better understand and conserve bird species. While the proposed method offers enhanced accuracy and robustness through the integration of phase features, it is important to note that phase information may be susceptible to specific noise types or variations in vocalization patterns, which could impact its effectiveness in certain scenarios. Future work will focus on developing adaptive preprocessing techniques to mitigate these challenges and further enhance the model’s resilience to diverse acoustic environments.

CRediT authorship contribution statement

Jiangjian Xie: Writing – review & editing, Writing – original draft, Visualization, Methodology, Conceptualization. Zhulin Hao: Writing – review & editing, Writing – original draft, Visualization, Software, Methodology. Chunhe Hu: Writing – review & editing. Changchun Zhang: Writing – review & editing. Junguo Zhang: Writing – review & editing.

Data availability and access

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

Ethics statement

Not applicable.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (43)

References

Dvořáková, D., Šipoš, J., Suchomel, J., 2023. Impact of agricultural landscape structure on the patterns of bird species diversity at a regional scale. Avian Res. 14, 100147.

Eisele, J., Gerlach, A., Maeder, M., Marburg, S., 2024. Relevance of phase information for object classification in automotive ultrasonic sensing using convolutional neural networks. J. Acoust. Soc. Am. 155, 1060–1070.

Feng, Y., Xiao, Y., Yan, Y., Max, L., 2018. Adaptation in Mandarin tone production with pitch-shifted auditory feedback: influence of tonal contrast requirements. Lang. Cogn. Neurosci. 33, 734–749.

Florentin, J., Dutoit, T., Verlinden, O., 2020. Detection and identification of European woodpeckers with deep convolutional neural networks. Ecol. Inform. 55, 101023.

Ge, R., Kakade, S.M., Kidambi, R., Netrapalli, P., 2019. The step decay schedule: a near optimal, geometrically decaying learning rate procedure for least squares. Adv. Neural Inf. Process. Syst. 32.

Gupta, G., Kshirsagar, M., Zhong, M., Gholami, S., Ferres, J.L., 2021. Comparing recurrent convolutional neural networks for large scale bird species classification. Sci. Rep. 11, 17085.

He, F., Liu, T., Tao, D., 2020. Why resnet works? residuals generalize. IEEE Transact. Neural Networks Learn. Syst. 31, 5349–5362.

He, K., Zhang, X., Ren, S., Sun, J., 2016. In: Deep residual learning for image recognition. IEEE, pp. 770–778.

Hidaka, S., Wakamiya, K., Kaburagi, T., 2022. An investigation of the effectiveness of phase for audio classification. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE.

Hu, S., Chu, Y., Tang, L., Zhou, G., Chen, A., Sun, Y., 2023a. A lightweight multi-sensory field-based dual-feature fusion residual network for bird song recognition. Appl. Soft Comput. 146, 110678.

Hu, S., Chu, Y., Wen, Z., Zhou, G., Sun, Y., Chen, A., 2023b. Deep learning bird song recognition based on mff-scsenet. Ecol. Indic. 154, 110844.

Jiang, D., Hu, Y., Dai, L., Peng, J., 2021. Facial expression recognition based on attention mechanism. Sci. Program. 2021, 2021.

Kahl, S., Wood, C.M., Eibl, M., Klinck, H., 2021. Birdnet: a deep learning solution for avian diversity monitoring. Ecol. Inform. 61, 101236.

Kasten, E.P., Gage, S.H., Fox, J., Joo, W., 2012. The remote environmental assessment laboratory*s acoustic library: an archive for studying soundscape ecology. Ecol. Inform. 12, 50–67.

Kinsler, L.E., Frey, A.R., Coppens, A.B., Sanders, J.V., 1999. Fundamentals of Acoustics, fourth ed. Wiley, New York.

Koh, C.Y., Chang, J.Y., Tai, C.L., Huang, D.Y., Hsieh, H.H., Liu, Y.W., 2019. In: Bird sound classification using convolutional neural networks. Conference and Labs of the Evaluation Forum. (Accessed 20 July 2024).

Lin, X., Liu, J., Kang, X., 2016. Audio recapture detection with convolutional neural networks. IEEE Transact. Multimed. 18, 1480–1487.

Lu, J., Zhang, Y., Lv, D., Xie, S., Fu, Y., Lv, D., et al., 2023. Improved broad learning system for birdsong recognition. Appl. Sci. 13, 11009.

Ma, K., 2016. Biodiversity monitoring relies on the integration of human observation and automatic collection of data with advanced equipment and facilities. Biodivers. Sci. 24, 1201.

Manna, A., Upasani, N., Jadhav, S., Mane, R., Chaudhari, R., Chatre, V., 2023. Bird image classification using convolutional neural network transfer learning architectures. Int. J. Adv. Comput. Sci. Appl. 14.

Nemeth, E., Pieretti, N., Zollinger, S.A., Geberzahn, N., Partecke, J., Miranda, A.C., et al., 2013. Bird song and anthropogenic noise: vocal constraints may explain why birds sing higher-frequency songs in cities. Proc. R. Soc. B 280, 20122798.

Noumida, A., Rajan, R., 2022. Multi-label bird species classification from audio recordings using attention framework. Appl. Acoust. 197, 108901.

de Oliveira, A.G., Ventura, T.M., Ganchev, T.D., Silva, L.N., Marques, M.I., Schuchmann, K.L., 2020. Speeding up training of automated bird recognizers by data reduction of audio features. PeerJ 8, e8407.

Pahuja, R., Kumar, A., 2021. Sound-spectrogram based automatic bird species recognition using mlp classifier. Appl. Acoust. 180, 108077.

Ptacek, L., Machlica, L., Linhart, P., Jaska, P., Muller, L., 2016. Automatic recognition of bird individuals on an open set using as-is recordings. Bioacoustics 25, 55–73.

Ruff, J., Lesmeister, D., Duchac, L., Padmaraju, B., Sullivan, C., 2019. Automated identification of avian vocalizations with deep convolutional neural networks. Remote Sens. Ecol. Conserv. 6, 79–92.

Schwab, E., Pogrebnoj, S., Freund, M., Flossmann, F., Vogl, S., Frommolt, K.H., 2023. Automated bat call classification using deep convolutional neural networks. Bioacoustics 32, 1–16.

Sedlácek, O., Vokurková, J., Ferenc, M., Djomo, E., Albrecht, T., Hořák, D., 2015. A comparison of point counts with a new acoustic sampling method: a case study of a bird community from the montane forests of mount Cameroon. Ostrich 86, 213–220.

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. In: Grad-cam: visual explanations from deep networks via gradient-based localization. IEEE, pp. 618–626.

Sinha, H., Awasthi, V., Ajmera, P., 2020. Audio classification using braided convolutional neural networks. IET Signal Process. 14, 448–454.

Tanzi, L., Audisio, A., Cirrincione, G., Aprato, A., Vezzetti, E., 2022. Vision transformer for femur fracture classification. Injury 53, 2625–2634.

Wheeldon, A., Mossman, H.L., Sullivan, M.J., Mathenge, J., de Kort, S.R., 2019. Comparison of acoustic and traditional point count methods to assess bird diversity and composition in the Aberdare National Park, Kenya. Afr. J. Ecol. 57, 168–176.

Xiao, H., Liu, D., Avolio, A.P., Chen, K., Li, D., Hu, B., et al., 2022a. Estimation of cardiac stroke volume from radial pulse waveform by artificial neural network. Comput. Methods Progr. Biomed. 218, 106738.

Xiao, H., Liu, D., Chen, K., Zhu, M., 2022b. Amresnet: an automatic recognition model of bird sounds in real environment. Appl. Acoust. 201, 109121.

Xiao, H., Ran, Z., Mabu, S., Li, Y., Li, L., 2022c. Saunet++: an Automatic Segmentation Model of Covid-19 Lesion from Ct Slices, 39. Visual Comput, pp. 2291–2304.

Xie, J., Hu, K., Zhu, M., Yu, J., Zhu, Q., 2019. Investigation of different cnn-based models for improved bird sound classification. IEEE Access 7, 175353–175361.

Xie, J., Zhong, Y., Zhang, J., Liu, S., Ding, C., Triantafyllopoulos, A., 2023. A review of automatic recognition technology for bird vocalizations in the deep learning era. Ecol. Inform. 73, 101927.

Xie, J., Zhu, M., 2023. Acoustic classification of bird species using an early fusion of deep features. Birds 4, 138–147.

Xie, S., Lu, J., Liu, J., Zhang, Y., Lv, D., Chen, X., et al., 2022. Multi-view features fusion for birdsong classification. Ecol. Inform. 72, 101893.

Xu, Y., Li, L., Gao, H., 2020. Sentiment classification with adversarial learning and attention mechanism. Comput. Intell. 37, 774–798.

Yan, N., Chen, A., Zhou, G., Zhang, Z., Liu, X., Wang, J., et al., 2021. Birdsong classification based on multi-feature fusion. Multimed. Tool. Appl. 80, 36529–36547.

Yang, F., Jiang, Y., Xu, Y., 2022. Design of bird sound recognition model based on lightweight. IEEE Access 10, 85189–85198.

Zollinger, S.A., Podos, J., Nemeth, E., Goller, F., Brumm, H., 2012. On the relationship between, and measurement of, amplitude and frequency in birdsong. Anim. Behav. 84, e1–e9.

Cited By

Figures(5) / Tables(6)

Get Citation

PDF

XML

Article Metrics

Article views (10) PDF downloads (2)

	Xudong Li, Jiangping Yu, Dake Yin, Longru Jin, Keqin Zhang, Li Shen, Zheng Han, Haitao Wang. 2024: Does social information affect the settlement decisions of resident birds in their second breeding attempt? A case study of the Japanese Tit (Parus minor). Avian Research, 15(1): 100198. DOI: 10.1016/j.avrs.2024.100198
	Tianlong Zhou, Kasun H. Bodawatta, Aiwu Jiang. 2023: A network meta-analysis on comparison of invasive and non-invasive sampling methods to characterize intestinal microbiota of birds. Avian Research, 14(1): 100086. DOI: 10.1016/j.avrs.2023.100086
	Bing-Run Zhu, Mo A. Verhoeven, Chris J. Hassell, Katherine K-S Leung, Dmitry Dorofeev, Qiang Ma, Krairat Eiamampai, Jonathan T. Coleman, Uchrakhzaya Tserenbat, Gankhuyag Purev-Ochir, David Li, Zhengwang Zhang, Theunis Piersma. 2023: Predicting the non-breeding distributions of the two Asian subspecies of Black-tailed Godwit using morphological information. Avian Research, 14(1): 100069. DOI: 10.1016/j.avrs.2022.100069
	Omar A. Hernández-Dávila, Javier Laborde, Vinicio J. Sosa, Cecilia Díaz-Castelazo. 2022: Interaction network between frugivorous birds and zoochorous plants in cloud forest riparian strips immersed in anthropic landscapes. Avian Research, 13(1): 100046. DOI: 10.1016/j.avrs.2022.100046
	Raphaël Musseau, Melina Bastianelli, Clementine Bely, Céline Rousselle, Olivier Dehorter. 2021: Using miniaturized GPS archival tags to assess home range features of a small plunge-diving bird: the European Kingfisher (Alcedo atthis). Avian Research, 12(1): 30. DOI: 10.1186/s40657-021-00267-4
	Liqing Fan, Lifang Gao, Zhenqin Zhu, Xiaodan Zhang, Wen Zhang, Haiyang Zhang, Jianchuan Li, Bo Du. 2021: The Grey-backed Shrike parents adopt brood survival strategy in both the egg and nestling phases. Avian Research, 12(1): 11. DOI: 10.1186/s40657-021-00244-x
	Mingju E, Tuo Wang, Shangyu Wang, Ye Gong, Jiangping Yu, Lin Wang, Wei Ou, Haitao Wang. 2019: Old nest material functions as an informative cue in making nest-site selection decisions in the European Kestrel (Falco tinnunculus). Avian Research, 10(1): 43. DOI: 10.1186/s40657-019-0182-5
	Jiaojiao Wang, Jianping Liu, Zhenqun Zhang, Hongxin Ren, Lijie Gao, Jianhua Hou. 2019: Is male condition corrected with song features in Dusky Warblers (Phylloscopus fuscatus). Avian Research, 10(1): 18. DOI: 10.1186/s40657-019-0158-5
	Per Alström, Pamela C. Rasmussen, Canwei Xia, Magnus Gelang, Yang Liu, Guoling Chen, Min Zhao, Yan Hao, Chao Zhao, Jian Zhao, Chengte Yao, James A. Eaton, Robert Hutchinson, Fumin Lei, Urban Olsson. 2018: Taxonomy of the White-browed Shortwing (Brachypteryx montana) complex on mainland Asia and Taiwan: an integrative approach supports recognition of three instead of one species. Avian Research, 9(1): 34. DOI: 10.1186/s40657-018-0125-6
	Jianqiang LI, Yanyun ZHANG, Zhengwang ZHANG. 2011: High frequency components in avian vocalizations. Avian Research, 2(3): 125-131. DOI: 10.5122/cbirds.2011.0019

Beyond amplitude: Phase integration in bird vocalization recognition with MHAResNet

Abstract

1. Introduction

2. Material and methods

2.1 Data pre-processing

2.2 Features extraction and fusion method

2.3 MHAResNet

2.3.1 MRBlocks

2.3.2 Multi-head attention block

3. Results and discussion

3.1 Experimental setup and metrics

3.2 Experiments and analysis

3.2.1 Performance of different multi-head attention block positions

3.2.2 Recognition performance analysis for different species

3.2.3 Ablation experiments

3.2.4 Comparison experiments

4. Conclusion

CRediT authorship contribution statement

Data availability and access

Ethics statement

Declaration of competing interest

References

Related Articles

Catalog

Article Metrics

Contact us

Beyond amplitude: Phase integration in bird vocalization recognition with MHAResNet

Abstract

1. Introduction

2. Material and methods

2.1 Data pre-processing

2.2 Features extraction and fusion method

2.3 MHAResNet

2.3.1 MRBlocks

2.3.2 Multi-head attention block

3. Results and discussion

3.1 Experimental setup and metrics

3.2 Experiments and analysis

3.2.1 Performance of different multi-head attention block positions

3.2.2 Recognition performance analysis for different species

3.2.3 Ablation experiments

3.2.4 Comparison experiments

4. Conclusion

CRediT authorship contribution statement

Data availability and access

Ethics statement

Declaration of competing interest

References

Related Articles

Catalog

Article Metrics

Contact us

Export File

Citation

Format

Content