研究生(外文):Christopher Chuang
論文名稱(外文):Deep Learning Based Sign Language Recognition and Scoring Systems
指導教授(外文):Chih-Yung ChangChin-Hwa Kuo
外文關鍵詞:ConvLSTMDeep LearningSiamese LSTMSign language recognition
根據世界衛生組織(WHO)[1],全球超過5%的人口需要對他們的聽力障礙進行康復和援助。手語是聽障和聾啞社群的主要交流方式,但現有的手語識別和教學產品的效果有限。現有模型在識別手語的複雜語義上下文和細微手指動作方面面臨挑戰。本論文提出了一個手語教學和評分系統(STSS),結合了Siamese長短期記憶(LSTM)進行粗粒度分類和卷積LSTM(ConvLSTM)進行細粒度分類。Siamese LSTM分析經過時間和空間規範化預處理的關鍵點數據,快速計算樣本視頻與標準化視頻數據集之間的相似性。ConvLSTM對相似性結果高於某一門檻的數據點進行進一步分析。本論文所提出的STSS,與其他機制進行比較,在精確率、召回率和F1-Score方面均表現出色。
According to the World Health Organization (WHO) [1], over 5% of the global population requires assistance for their hearing loss. Sign language is the primary communication method for the deaf community, but recognition technologies are limited in their effectiveness. Existing models are challenged by the complex contextual relationships of sign language gestures and recognition of subtle finger movements. This dissertation proposed a Sign Language Teaching and Scoring System (STSS) which combines coarse-grained classification using Siamese Long Short-Term Memory (LSTM) and a fine-grained classification utilizing the Convolutional LSTM (ConvLSTM) model. The Siamese LSTM analyzes spatially and temporally normalized key point data, and quickly calculates the similarity between sample and standard sign language videos. It utilizes an adaptive contrastive loss function that dynamically adjusts according to similarity measures. The contrastive loss function helps the model focus on more challenging gestures that are very similar, but distinct. The ConvLSTM conducts further analysis on datapoints where similarity results rise above a certain threshold. The proposed STSS is then compared with other mechanisms, showing outperformance with respect to precision, recall, and F1-Score.
Outline IV
List of Figures VI
Chapter 1. Introduction 1
1.1 Research Goals 3
1.2 Organization of the Dissertation 5
Chapter 2. Related work 6
2.1 Machine Learning 6
2.1.1 Hidden Markov Model (HMM) 6
2.1.2 K-nearest neighbor (KNN) 7
2.1.3 Support Vector Machine (SVM) 8
2.2 Deep Learning 9
2.2.1 Convolutional Neural Network (CNN) 9
2.2.2 Graph Convolutional Network (GCN) 11
2.2.3 Long short-term memory (LSTM) 12
2.2.4 Hybrid networks 13
2.2.5 Principal Component Analysis Network (PCANet) 16
Chapter 3. Preliminary 18
3.1 MediaPipe for Key Point Recognition 18
3.2 Siamese Neural Network Architecture 20
3.3 Long Short-Term Memory (LSTM) Network 20
3.4 Convolutional LSTM (ConvLSTM) Network 22
Chapter 4. Notations, Assumptions, Problem Description 24
4.1 Notations and Assumptions 24
4.2 Problem Description 24
4.3 Objective 26
Chapter 5. The Proposed Sign Language Teaching and Scoring System (STSS) 29
5.1 Data Preprocessing 30
5.1.1 Input Video Segmentation 30
5.1.2 Key Point Extraction 32
5.1.3 Temporal Normalization 32
5.1.4 Spatial Normalization 33
5.2 Coarse-Grained Classification using Siamese LSTM Model 34
5.3 Fine-Grained Classification using ConvLSTM Model 37
5.4 Summary 40
Chapter 6. Performance Evaluation 41
6.1 Dataset 41
6.2 Simulation Results 42
6.3 Summary 51
Chapter 7. Conclusion and Future Work 52
References 53

List of Figures
Fig. 3.1. Key point coordinates extracted for each hand with Mediapipe 19
Fig. 3.2. Key point coordinates extracted for the face and upper body 19
Fig. 3.3. LSTM Cell Structure 21
Fig. 3.4. Convolutional kernel operations over an image 23
Fig. 5.1. The architecture of proposed STSS mechanism 29
Fig. 5.2. Input Video Segmentation process 31
Fig. 5.3. Architecture of Siamese LSTM 35
Fig. 5.4. Architecture of ConvLSTM 38
Fig. 6.1. Training set distribution 41
Fig. 6.2. Testing set distribution 42
Fig. 6.3. Impact of sampling frame interval on accuracy and average classification time 44
Fig. 6.4. Confusion Matrix of each sign language category 45
Fig. 6.5. Varying threshold and layer counts in relation to recall, precision, and F1-Score 47
Fig. 6.6. ROC Curves for proposed STSS and TAM models 48
Fig. 6.7. Comparison of the proposed STSS, ML-CNN, and TAM in terms of precision, recall, and F1-Score 50
