研究生(外文):Huang, Yu-Min
論文名稱(外文):Stochastic Convolutional Recurrent Networks
指導教授(外文):Chien, Jen-Tzung
口試委員(外文):Wu, Jwo-YuhHuang, Szu-HaoTsai, Shang-HoTseng, Huan-Hsin
外文關鍵詞:deep learningsequential learningautoregressive generationvariational inferenceconvolutional neural networkrecurrent neural network
深度學習在電腦視覺跟自然語言處理領域都取得巨大的成功。基本上,深度神經網路可以處理高維數據間複雜的輸入與輸出目標間映射,並在分類與迴歸問題上有良好的表現。然而,深度神經網路仍無法在高維數據的生成任務中得到想要的結果。在此同時,序列資料是隨處可見的。從語音資料、文字、影片,都是序列資料的一種。當我們在處理訊序列學習與生成時,因為訊號中存在因果性,很重要的是要根據過去的樣本來預測或生成未來的目標,。我們稱這種基於所有過去觀察的預測為自迴歸生成(autoregressive generation)。在這篇論文中,我們針對自迴歸生成提出了一個新的隨機表徵學習方法,而中間的生成跟推斷過程是基於捲積神經網路(CNN)跟遞迴神經網路(RNN)來做延伸的。
接下來我們針對捲積遞迴網路提出了兩個進一步延伸,一個是隨機捲積遞迴網路(SCRN),另一個是多解析捲積遞迴網路(multi-resolution CRN)。隨機捲積遞迴網路是想藉由引進隨機性來改善捲積遞迴網路的魯棒性。隨機潛在變數的最佳化是變分推斷(variational inference),讓其中的證據下界(evidence lower bound)被最大化,來完成的。在另一方面,我們質疑在捲積遞迴網路中只用最後一層捲積層的資訊給遞迴層,編碼進去的訊息會不夠豐富。而最終預測的結果可能會因此受限。據此,我們提出了多解析捲積遞迴網路來將不同捲積層中學到的多解析度的資訊,分別交給不同的遞迴層來解碼。如此一來,不同時間長度的關係都能在多解析捲積遞迴網路中被學到。
Deep learning has achieved great success in computer vision and natural language processing. Basically, deep neural networks (DNNs) are able to handle high-dimensional data with the complicated mapping between input data and output targets and perform well for different classification and regression tasks. Nevertheless, it is still challenging to carry out a desirable generation task in presence of high-dimensional data. Meanwhile, sequence data are everywhere in real world. Sequence data are ranged from speech signals to natural sentences and video streams, to name a few. When we deal with sequential learning and generation, it is important to predict or generate future targets based on all previous samples due to the causal property in signals. Such a prediction is also named as the autoregressive generation where the prediction at each time step is conditioned on all previous observations. This thesis presents a new stochastic learning representation for autoregressive generation where an inference and generative procedure based on convolutional neural network (CNN) and recurrent neural network (RNN) is developed.
RNN is specialized to characterize the sequential patterns and extract temporal information by evolving the dynamic states through time as an internal memory. RNN has been recognized as a popular sequential learning solution to autoregressive model for many years. Recently, temporal convolutional network (TCN) is proposed for sequential learning although CNN has been successfully developed for spatial modeling in computer vision. Typically, TCN is beneficial for parallel computation which provides rapid generation. Multilayer TCN can capture the temporal hierarchy where different layers represent various sizes of receptive fields. RNN and TCN are both feasible to sequential learning. This work aims to combine the advantages in TCN and RNN for construction of the so-called convolutional recurrent network (CRN). Basically, TCN is powerful to extract local features while RNN is able to capture long-term temporal dependencies in sequence data. The proposed CRN would like to infer or encode local information via convolutional layers and then predict, decode or generate each individual time sample via recurrent layers. CRN corresponds to implement TCN as encoder and RNN as decoder. A hybrid model of TCN and RNN is established. The complementary local and global features are characterized. Importantly, the recurrent layers in CRN are used to relax the limitation of TCN where the size of receptive field is constrained by the number of layers. CRN allocates the recurrent layers on top of convolutional layers so as to compensate the insufficiency of long-term temporal characteristics in TCN.
The proposed CRN is further improved by twofold extensions including the stochastic CRN (SCRN) and the multi-resolution CRN. Stochastic CRN aims to improve the robustness of CRN by incorporating stochastic property into CRN. The randomness of latent variables is considered in optimization procedure via variational inference where the variational lower bound of log likelihood, marginalized over latent variables, is maximized. On the other hand, we challenge the richness of the encoded information in CRN where only the latent variable in the last convolutional layer is fed into recurrent layer. The prediction performance may be constrained. Accordingly, the multi-resolution CRN is developed by capturing the multi-resolution encodings from different convolutional layers and feeding them into different recurrent layers for decoding and prediction in each time step. The temporal dependency with different lengths is learned in multi-resolution CRN.
The experiments on language modeling and action recognition are conducted to investigate the performance of different variants of convolutional recurrent networks. We show the merits of the proposed methods by comparing with RNN and TCN under different experimental settings.
1 Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 Outline 3
2 Deep Learning 5
2.1 Deep Neural Network 5
2.1.1 Error backpropagation 8
2.1.2 Stochastic gradient descent 8
2.2 Recurrent Neural Network 10
2.2.1 Backpropagation through time 11
2.2.2 Long short-term memory 12
2.3 Convolutional Neural Network 13
2.3.1 Convolutional layer 13
2.3.2 Pooling layer 14
2.3.3 Temporal Convolutional Network 15
2.4 Variational Autoencoder 16
2.4.1 Autoencoder 16
2.4.2 Variational inference 17
2.4.3 Variational autoencoder 19
2.4.4 Ladder variational autoencoder 24
3 Deterministic Autoregressive Models 27
3.1 Dilated RNN 27
3.2 CLDNN 29
4 Stochastic Autoregressive Models 31
4.1 Variational Recurrent Neural Network 31
4.1.1 Network architecture 32
4.1.2 Optimization procedure 34
4.2 Stochastic Wavenet 34
4.2.1 Network architecture 35
4.2.2 Optimization procedure 37
4.3 Stochastic Temporal Convolutional Network 38
4.3.1 Network architecture 38
4.3.2 Optimization procedure 40
5 Stochastic Convolutional Recurrent Network 42
5.1 Convolutional Recurrent Network 42
5.2 Stochastic Convolutional Recurrent Network 44
5.3 Multi-resolution Convolutional Recurrent Network 48
5.4 Generalization 51
5.5 Comparison 53
6 Experiments 56
6.1 Language Modeling 56
6.2 Action Recognition 61
7 Conclusion and Future Works 63
7.1 Conclusions 63
7.2 Future Works 64
