跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.208) 您好!臺灣時間:2025/10/04 01:06
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:劉至峻
研究生(外文):Chih-Chun Liu
論文名稱:使用多模型合併之深度學習應用於音樂片段人聲辨識
論文名稱(外文):Deep Learning Algorithm Using Multi-model Combination Applied to Singing Voice Detection
指導教授:劉建宏劉建宏引用關係尤信程
指導教授(外文):Chien-Hung LiuShing-Chern You
口試委員:蔡偉和張寶基劉建宏尤信程
口試日期:2018-07-12
學位類別:碩士
校院名稱:國立臺北科技大學
系所名稱:資訊工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2018
畢業學年度:106
語文別:中文
論文頁數:45
中文關鍵詞:整體式學習遞歸神經網絡卷積層神經網路音樂特徵類神經網路深度學習
外文關鍵詞:Ensemble LearningRecurrent Neural NetworkConvolutional Neural NetworkVocal DetectionNeural NetworkDeep Learning
相關次數:
  • 被引用被引用:5
  • 點閱點閱:605
  • 評分評分:
  • 下載下載:105
  • 收藏至我的研究室書目清單書目收藏:0
使用機器分類出一段音樂是否有人聲,一直都是很重要的問題。之前有研究顯示,直接將快速傅立葉轉換(Fast Fourier Transformation)後的頻譜絕對值(spectral magnitude value),直接送入卷積層神經網路(convolutional neural networks, CNN)進行訓練,其準確率達到92%左右。為進一步探討增進準確率的方法,本論文試著利用整體式學習(ensemble learning)的技巧,合併CNN與其他類神經網路架構,如長短期記憶網路(Long Short Term Memory, LSTM)、卷積長短期記憶網路(Convolutional LSTM)和膠囊網路(Capsule Networks)等,以擷取各模型架構的優點,看準確率能否有所突破。本論文所使用的合併的方式,包含:投票(Voting)、融合(Fusion)及後分類器(Post Classification),個別比較其準確率之差異。本論文所使用的音樂資料集除了Jamendo音樂資料庫外,也使用自建的資料集來驗證方法的有效性。當本論文使用Jamendo資料集並將多重架構做Voting及Post Classification合併後,最高平均準確率可達94.2%,比單一架構的準確率高;當使用自建的資料集時,合併後之準確率也普遍優於單一模型之最高準確率。因此,本論文確認整體式學習在人聲辨識上,有其效果.
Detecting the vocal sound in a piece of audio is a fundamental step to many advanced audio processing techniques. Previously, one study showed that good accuracy of 92% could be achievable for this problem by using the convolutional neural networks (CNN) using spectrogram as the input features. To explore the possibilities of further performance improvement, in this thesis we attempted to incorporate CNN and other neural network architectures, such as Long Short Term Memory (LSTM), Convolutional LSTM, and Capsule Networks, into ensemble learning. The ensemble learning approaches studied in this thesis includeed voting, fusion, and post classification, and the accuracy of each individual approach was reported. Regarding to the training/testing dataset, in addition to the well-known Jamendo dataset, we also built in-house datasets to validate the studied approaches. When using the Jamendo dataset, the average accuracy achieved 94.2% by using voting or post classification approach. This figure is higher than that of using any single architecture. When tested with the in-house datasets, voting or post classification approach also yielded better accuracy than a single model could achieve. Overall, this thesis confirmed that the ensemble learning was effective in terms of accuracy for the vocal detection problem.
摘要 i
致謝 iii
目錄 iv
表目錄 vii
圖目錄 viii
第一章 緒論 1
1.1. 研究動機與目的 1
1.2. 名詞解釋 1
1.3. 論文的組織架構 2
第二章 相關研究與背景介紹 3
2.1. 人工神經網路 3
2.1.1. 單層神經元網路 3
1. 多層神經元網路 4
2.2. 深度學習(Deep Learning) 5
2.2.1. 卷積神經網路(Convolutional Nerual Network, CNN) 5
2.2.2. 遞歸神經網絡(Recurrent Neural Network, RNN) 7
2.3. Tensorflow深度學習框架 9
2.4. Keras 9
2.5. Jamendo音樂資料庫 10
2.5.1. Corpus sets 10
2.5.2. 真實參考標準(Ground-Truth) 10
2.6. FMA音樂資料庫 10
2.7. 本論文之相關文獻探討 11
第三章 深度學習音樂人聲辨識 13
3.1. MCNN:MFCC使用CNN深度學習架構 13
3.2. CapsNet (Capsule Networks) 深度學習架構 14
3.3. STSA (Short-Time Spectral Amplitude) 15
3.4. SCNN:STSA使用CNN深度學習架構 16
3.5. SLSTM :STSA使用LSTM深度學習架構 16
3.6. SConvLSTM:STSA使用ConvLSTM深度學習架構 17
3.7. Combination及Fusion 17
3.7.1. Combination:Voting 18
3.7.2. Fusion 18
3.7.3. Combination:Post Classification 19
第四章 系統實作與實驗 20
4.1. 深度學習系統環境與架構 21
4.2. 音樂資料預處理系統 21
4.2.1. Jamendo音樂訓練資料 21
4.2.2. FMA-C-1 Dataset 22
4.2.3. Test Hard Dataset 22
4.3. Voting & Post Classification資料處理 22
4.3.1. 驗證模型準確率 23
4.3.2. Voting 資料處理 23
4.3.3. Post Classification 資料處理 23
4.4. 各模型架構深度學習測試準確率 24
4.4.1. 實驗一: 使用Jamendo 資料集訓練及測試 24
4.4.2. 實驗二:使用FMA-C-1資料集訓練及測試 26
4.4.3. 實驗三: 使用Jamendo及FMA-C-1權重測試Test Hard資料集 28
4.5. Combination:Voting 法之準確率 28
4.5.1. 實驗四: Jamendo Test資料集使用不同權重及不同模型架構Voting 28
4.5.2. 實驗五:FMA-C-1 Test資料集使用不同權重及不同模型架構Voting 30
4.5.3. 實驗六:Test Hard資料集使用不同權重及不同模型架構Voting 31
4.6. Fusion法之準確率 32
4.6.1. 實驗七:Jamendo資料集將不同模型架構Fusion訓練 32
4.6.2. 實驗八:FMA-C-1資料集將不同模型架構Fusion訓練 34
4.7. Combination:Post Classification法之準確率 36
4.7.1. 實驗九:Jamendo資料集Post Classification 36
4.7.2. 實驗十:FMA-C-1資料集Post Classification 39
4.7.3. 實驗十一: Post Classification測試Test Hard資料集 39
4.8. Jamendo及FMA-C-1權重交互驗證 40
4.8.1. 實驗十二:Jamendo及FMA-C-1權重交互驗證 40
4.9. 結果與討論 41
第五章 結論與未來研究方向 42
5.1. 結論 42
5.2. 未來研究方向 42
參考文獻 44
[1] 黃泓鳴, "Singing Voice Detection Based on Deep Learning Algorithm," 國立臺北科技大學資訊工程所碩士論文, 2017.
[2] M. Ramona, G. Richard and B. David, "Vocal detection in music with support vector machines," ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., p. 1885–1888, 2008.
[3] S. Sabour and N. F. E. Hinton, "Dynamic Routing Between Capsules," arXiv prepr. arXiv:1710.09829, 2017.
[4] 劉人傑, "Relevance Between Data-set Composition and Test Accuracy Based on Deep Learning Algorithm in Voice Detection," 國立臺北科技大學資訊工程所碩士論文, 2018.
[5] M. Defferrard, K. Benzi, P. Vandergheynst and X. Bresson, "FMA: A Dataset For Music Analysis," ISMIR 2017, 2017.
[6] G. E. Hinton, S. Osindero and Y.-W. Teh, "A fast learning algorithm for deep belief nets," Neural Computation, p. 1527–1554, 2006.
[7] T. Mikolov, S. Kombrink and L. Burget, "Extensions of recurrent neural network language model," Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE Int. Conf. on, p. 5528–5531, 2011.
[8] Y. LeCun, Y. Bengio, G. Hinton, L. Y., B. Y. and H. G., "Deep learning," Nature, vol. 521, no. 7553, p. 436–444, 2015.
[9] M. A. e. al., "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems," arXiv Prepr. arXiv1603.04467, 2016.
[10] J. Dai, S. Liang, W. Xue, C. Ni and W. Liu, "Long Short-term Memory Recurrent Neural Network based Segment Features for Music Genre Classification," in Chinese Spoken Language Processing (ISCSLP), 2016 10th International Symposium on, 2016.
[11] B. Lehner, G. Widmer and S. Bock, "A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks," 2015 23rd Eur. Signal Process. Conf. EUSIPCO 2015, p. 21–25, 2015.
[12] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong and W.-c. Woo, "Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting," Neural Information Processing Systems, 2015.
[13] J. Li, W. Dai, F. Metze, S. Qu and S. Das, "A Comparison of deep learning methods for environmental sound," arXiv preprint arXiv:1703.06902, 2017.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊