跳到主要內容

臺灣博碩士論文加值系統

(44.200.194.255) 您好!臺灣時間:2024/07/23 12:39
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:蕭文逸
研究生(外文):Hsiao, Wen-Yi
論文名稱:基於捲積式生成對抗網路之自動作曲系統之探討
論文名稱(外文):Automatic Symbolic Music Generation Based on Convolutional Generative Adversarial Networks
指導教授:黃婷婷黃婷婷引用關係楊奕軒
指導教授(外文):Hwang, Ting-TingYang, Yi-Hsuan
口試委員:陳煥宗劉奕汶
口試委員(外文):Chen, Hwann-TzongLiu, Yi-Wen
口試日期:2018-07-02
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2018
畢業學年度:106
語文別:英文
論文頁數:68
中文關鍵詞:音樂自動作曲音樂資訊檢索深度學習生成對抗模型
外文關鍵詞:automatic music generationmusic information retrievaldeep learninggenerative adversarial nets
相關次數:
  • 被引用被引用:0
  • 點閱點閱:537
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
生成音樂與生成影像與影片有著一些顯著的差異。首先,音樂是時間上的藝術,所以我們需要時序上面的模型。接著,音樂通常由多個樂器/音軌來組成,且各自都具有自己的織體與演奏模式,但當合奏時又必須彼此和諧的箱呼應。最後,音符不僅僅是純粹時序上的關係,鄰近的音群可以組成各式的音樂語法,例如和弦、琶音與音階等等。本篇論文,在基於捲積式生成網路的框架下,我們探討了數個關於多軌與複音音樂聲成的議題,包括:音軌操控性、自動伴奏、神經網路的設計與時間模型。我們也把模型訓練在簡譜與團譜兩種常見的音樂格式。為了分析,我們提出了數個指標,來衡量生成音樂的品質,與音軌之間的和諧度。本篇論文會完整的從音樂的表示法、前處理到模型間的量化分析做一個通盤性的探討,希望藉此可以得到更多深刻的見解,並從而瞭解深度學習技術的有效性與侷限性。
Generating music has a few notable differences from generating
images and videos. First, music is an art of the time, necessitating
a temporal model. Second, music is usually composed
of multiple instruments/tracks with their own temporal
dynamics, but collectively they unfold over time interdependently.
Lastly, musical notes are often grouped into chords,
arpeggios or melodies in polyphonic music, and thereby introducing
a chronological ordering of notes is not naturally
suitable. In this thesis, we investigate several topics about symbolic
multi-track polyphonic music generation under the framework of Conlutional generative adversarial networks (GANs), including controllability, accompaniment, network architecture, temporal modeling. We trained and compared the models on two common formats: lead sheet and band score. To evaluate the generative results, a few intra-track and inter-track objective metrics are also proposed. The integrated survey from data representation, pre-processing, to qualitative evaluation between various architectures offers us more insights about composing music and also re-examining the efficiency and limitation of the deep learning models.
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Multi-track and Polyphony . . . . . . . . . . . . . . . . . . 2
1.2.2 Multi-track Interdependency . . . . . . . . . . . . . . . . . 3
1.2.3 Temporal Structure . . . . . . . . . . . . . . . . . . . . . . 3
1.2.4 Design of Networks . . . . . . . . . . . . . . . . . . . . . 4
1.2.5 Temporal Networks . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work 7
2.1 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . 7
2.2 Video Generation using GANs . . . . . . . . . . . . . . . . . . . . 8
2.3 Symbolic Music Generation . . . . . . . . . . . . . . . . . . . . . 9
3 Proposed System 10
3.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Multi-track Interdependency . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Jamming Model . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Composer Model . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.3 Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Modeling the Temporal Structure . . . . . . . . . . . . . . . . . . . 13
3.3.1 Generation from Scratch . . . . . . . . . . . . . . . . . . . 13
3.3.2 Track-conditional Generation . . . . . . . . . . . . . . . . 14
3.4 Integrated System . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Implementation 17
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 LPD dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Lead Sheet dataset . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.1 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.2 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.3 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.4 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Experiments 24
5.1 Objective Metrics for Evaluation . . . . . . . . . . . . . . . . . . . 24
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.1 Analysis of Training Data . . . . . . . . . . . . . . . . . . 25
5.2.2 Example Results . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.3 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . 26
5.2.4 Training Process . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4.1 Interpolation on inter-track random vectors . . . . . . . . . 38
5.4.2 Interpolation on intra-track random vectors . . . . . . . . . 39
5.4.3 Bilinear interpolation . . . . . . . . . . . . . . . . . . . . . 39
6 Design of Networks 46
6.1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3.1 Filter and Rhythm . . . . . . . . . . . . . . . . . . . . . . 52
6.3.2 Revisit the Tonal Distance . . . . . . . . . . . . . . . . . . 54
7 Temporal networks 57
7.1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . 58
7.2.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2.2 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8 Conclusions 65
References 66
[1] Adam Roberts, Jesse Engel, C. R. I. S. C. H. Musicvae: Creating a palette for
musical scores with machine learning., 2018. https://magenta.tensorflow.
org/music-vae.
[2] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. arXiv preprint
arXiv:1701.07875 (2017).
[3] Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. The Million
Song Dataset. In ISMIR (2011).
[4] Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. Modeling temporal
dependencies in high-dimensional sequences: Application to polyphonic
music generation and transcription. In ICML (2012).
[5] Briot, J.-P., Hadjeres, G., and Pachet, F. Deep learning techniques for music
generation: A survey. arXiv preprint arXiv:1709.01620 (2017).
[6] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel,
P. InfoGAN: Interpretable representation learning by information maximizing
generative adversarial nets. In Proc. Advances in Neural Information Processing
Systems (2016), pp. 2172–2180.
[7] Chris Donahue, Julian McAuley, M. P. Synthesizing audio with generative
adversarial networks. https://arxiv.org/pdf/1802.04208.pdf (2018).
[8] Chu, H., Urtasun, R., and Fidler, S. Song from PI: A musically plausible
network for pop music generation. In ICLR Workshop (2017).
[9] CShuiwang Ji, Wei Xu, M. Y. K. Y. 3d convolutional neural networks for
human action recognition. In IEEE Transactions on Pattern Analysis and Machine
Intelligence (Volume: 35, Issue: 1, Jan. 2013) (2016), pp. 221 – 231.
[10] Daniel Stoller, Sebastian Ewert, S. D. Wave-u-net: A multi-scale neural network
for end-to-end audio source separation. https://arxiv.org/abs/1806.03185
(2018).
[11] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS
(2014).
[12] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A.
Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028
(2017).
[13] Hadjeres, G., Pachet, F., and Nielsen, F. DeepBach: A steerable model for
Bach chorales generation. In ICML (2017).
[14] Harte, C., Sandler, M., and Gasser, M. Detecting harmonic change in musical
audio. In ACM MM workshop on Audio and music computing multimedia
(2006).
64
[15] Herremans, D., and Chew, E. MorpheuS: generating structured music with
constrained patterns and tension. IEEE Trans. Affective Computing (2017).
[16] Konstantinos Bousmalis, George Trigeorgis, N. S. D. K. D. E. Domain separation
networks. https://arxiv.org/pdf/1608.06019.pdf (2016).
[17] Luan Tran, Xi Yin, X. L. Disentangled representation learning gan for poseinvariant
face recognition. CVPR (2017).
[18] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral Normalization
for Generative Adversarial Networks. ArXiv e-prints (Feb. 2018).
[19] Mogren, O. C-RNN-GAN: Continuous recurrent neural networks with adversarial
training. In NIPS Worshop on Constructive Machine Learning Workshop
(2016).
[20] Nieto, O., and Bello, J. P. Systematic exploration of computational music
structure research. In ISMIR (2016).
[21] Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning
with deep convolutional generative adversarial networks. In ICLR (2016).
[22] Raffel, C. Learning-Based Methods for Comparing Sequences, with Applications
to Audio-to-MIDI Alignment and Matching. PhD thesis, Columbia
University, 2016.
[23] Raffel, C., and Ellis, D. P. W. Intuitive analysis, creation and manipulation
of MIDI data with pretty_midi. In ISMIR Late Breaking and Demo Papers
(2014).
[24] Raffel, C., and Ellis, D. P. W. Extracting ground truth information from MIDI
files: A MIDIfesto. In ISMIR (2016).
[25] Saito, M., Matsumoto, E., and Saito, S. Temporal generative adversarial nets
with singular value clipping. In ICCV (2017).
[26] Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., and
Chen, X. Improved techniques for training GANs. In Proc. Advances in Neural
Information Processing Systems (2016), pp. 2226–2234.
[27] Serrà, J., Müller, M., Grosche, P., and Arcos, J. L. Unsupervised detection of
music boundaries by time series structure features. In AAAI (2012).
[28] Sturm, B. L., Santos, J. F., Ben-Tal, O., and Korshunova, I. Music transcription
modelling and composition using deep learning. In Conference on Computer
Simulation of Musical Creativity (2016).
[29] Takeru Miyato, M. K. cgans with projection discriminator.
[30] Tero Karras, Timo Aila, S. L. J. L. Progressive growing of gans for improved
quality, stability, and variation.
[31] Tomas Mikolov, Ilya Sutskev Kai Chen, G. C. J. D. Distributed representations
of words and phrases and their compositionality.
65
[32] Tulyakov, S., Liu, M., Yang, X., and Kautz, J. MoCoGAN: Decomposing
motion and content for video generation. arXiv preprint arXiv:1707.04993
(2017).
[33] Vondrick, C., Pirsiavash, H., and Torralba, A. Generating videos with scene
dynamics. In NIPS (2016).
[34] Wen-Cheng Chen, Chien-Wen Chen, M.-C. H. Syncgan: Synchronize the latent
space of cross-modal generative adversarial networks. ICME (2018).
[35] Yang, L.-C., Chou, S.-Y., and Yang, Y.-H. MidiNet: A convolutional generative
adversarial network for symbolic-domain music generation. In ISMIR
(2017).
[36] Yu, L., Zhang, W., Wang, J., and Yu, Y. SeqGAN: Sequence generative adversarial
nets with policy gradient. In AAAI (2017).
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top