研究生(外文):Tsai, Hsin-Ying
論文名稱(外文):Applying Deep Learning to Develop a Mobile Application of Oral Function Training
指導教授(外文):Yang, Tzyy-Ching
口試委員(外文):Hsiao, Wen-FengWang, Yao-Te
外文關鍵詞:Oral Function TrainingDeep LearningMobileNetV3Feature Landmark
本研究的資料集是自行拍攝影片,轉成影格後每個動作等間隔取 3 張進行臉部偵測,被辨識出來的影像為 508張,藉由資料擴增後合計 7112 張。進行模型訓練時,再將整個資料分割為80%的訓練資料集以及20%的測試資料集。
由於MobileNetV3分成Large及Small兩種架構,且不同的寬度乘數會影響模型的複雜度及參數數量,本研究透過實驗,選擇了寬度乘數為 0.75的 MobileNetV3-Small,能讓複雜度較小且參數數量也較少。接著使用此模型比較兩種臉部偵測方式,結果顯示直接使用口腔原始影像,優於進行口腔特徵點標註影像進行訓練的成效。
In our research, we interviewed speech therapists and found that there are many physically and mentally handicapped people who need oral function training. Due to the shortage of speech therapists and the lack of professional training for the assisting caregivers and parents, they are unable to cope with the demand for oral function training. Therefore, we tried to use deep learning techniques to train an oral motor recognition model, and then developed an oral health app that can assist in checking whether or not each oral motor is actually performed.
Face detection can be divided into image-based or feature-based approaches. The former refers to the direct training of the recognition model with the image of the face, while the latter uses the facial landmark for training. In our research, we will compare the two approaches and investigate which one is more suitable for oral motor training system.
This study is based on oral function training, deep learning techniques and feature extraction. In oral function training, common practices include sensory abnormality training and motor abnormality training. For deep learning techniques, the model frameworks are constantly evolving, and this study mainly focuses on classical models such as AlexNet and VGG16. Finally MobileNetV3 is adopted because of the need to apply the model to mobile devices. As for feature extraction, we collected literatures on color features, texture features, and shape features.
The dataset of our research is a self-recorded video, which is converted into frames and then three images are taken at equal intervals of each action for face detection. 508 images are recognized, and the total number of images is 7112 after data augmentation. For model training, the whole data is divided into 80% of the data set and 20% of the test set.
Since MobileNetV3 can be divided into two architectures, Large and Small, different width multipliers may affect the complexity and the number of parameters of the model. In our research, MobileNetV3-Small with a width multiplier of 0.75 was chosen to minimize the complexity and the number of parameters. Then, the model is used to compare the two face detection methods, and the results show that the direct use of the original images is more effective than the training of the oral cavity feature landmark images.
In this study, the most effective model was used for the development of the oral motor recognition app, which not only assisted the clients to perform the oral motor abnormalities training on their own, but also allowed the speech therapists, caregivers, and parents to perform the oral sensory abnormalities training.
