跳到主要內容

臺灣博碩士論文加值系統

(44.221.73.157) 您好!臺灣時間:2024/06/20 10:47
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:翁英傑
研究生(外文):Weng, Ying-Cieh
論文名稱:從人類視覺轉移基於Transformer的圖像壓縮到機器感知
論文名稱(外文):Transferring Human Visualization to Machine Perception based on Transformer-based Image Compression
指導教授:彭文孝盧鴻興盧鴻興引用關係
指導教授(外文):Peng, Wen-HsiaoLu, Horng-Shing
口試委員:江瑞秋杭學鳴邱維辰彭文孝盧鴻興
學位類別:碩士
校院名稱:國立陽明交通大學
系所名稱:數據科學與工程研究所
學門:電算機學門
學類:軟體發展學類
論文種類:學術論文
論文出版年:2023
畢業學年度:111
語文別:英文
論文頁數:45
中文關鍵詞:Transformer-based影像壓縮機器感知影像壓縮提示微調
外文關鍵詞:Transformer-based image compressioncoding for machineprompt tuning
相關次數:
  • 被引用被引用:0
  • 點閱點閱:94
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
這項研究的目標在於將基於Transformer的圖像壓縮模型的適用任務從人類視覺轉移到機器感知,而無需對整個模型進行重新微調訓練。我們提出了一種可轉移、基於 Transformer 的圖像壓縮框架,命名為 TransTIC。受到視覺提示微調(visual prompt tuning) 的啟發,我們設計了 (1)針對特定任務的提示生成器以產生針對各影像的提示並注入編碼器 和 (2)針對特定任務的提示並注入解碼器。大量實驗驗證了我們提出的方法能夠將人類視覺圖像壓縮模型轉移到各類型的機器感知任務,並顯著優於競爭方法。據我們所知,這項研究是首篇嘗試在低層級圖像壓縮任務上利用提示微調機制的研究。最後透過對多任務訓練的研究,我們對基礎編碼器在機器視覺標線上的影響有了更好的了解。
This work aims to transfer a Transformer-based image compression codec from human vision to machine perception without fine-tuning the codec. We propose a transferable Transformer-based image compression framework, termed TransTIC. Inspired by visual prompt tuning, we introduce an instance-specific prompt generator to inject instance-specific prompts into the encoder and task-specific prompts into the decoder. Extensive experiments demonstrate that our proposed method is capable of transferring the codec to various machine tasks. To the best of our knowledge, this work represents the first attempt to leverage prompting in the context of low-level image compression. Through an investigation of multi-task training,we have acquired an understanding of the impact of the pre-trained base codec on the performance of machine tasks.
摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Research Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Learned Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Compression for Machine Perception . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Prompt Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Prompting Swin­Transformer Blocks . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Extractor in Prompt Generator . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Training Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Perceptual Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.1 Training Details and Datasets . . . . . . . . . . . . . . . . . . . . . . 16
4.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Rate­Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Complexity Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 More Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1 Model Design Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.1 IP­type vs. TP­type STBs . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.2 Prompt Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Prompting Encoder vs. Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Comparison with MPEG CFP . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Multi­task Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 Rate­Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Unseen Test Situations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.4 Prompt with Multi­task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
[1] M. Jia, L. Tang, B.­C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.­N. Lim, “Visual
prompt tuning,” in European Conference on Computer Vision (ECCV), 2022.
[2] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with
discretized gaussian mixture likelihoods and attention modules,” 2020. [Online].
Available: https://arxiv.org/abs/2001.01568
[3] Z. Guo, Z. Zhang, R. Feng, and Z. Chen, “Causal contextual prediction for learned image
compression,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32,
no. 4, pp. 2329–2341, 2021.
[4] Y. Xie, K. L. Cheng, and Q. Chen, “Enhanced invertible encoding for learned image compression,” in Proceedings of the 29th ACM international conference on multimedia, 2021,
pp. 162–170.
[5] B. Bross, Y.­K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, and J.­R. Ohm, “Overview
of the versatile video coding (vvc) standard and its applications,” IEEE Transactions on
Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736–3764, 2021.
[6] G. J. Sullivan, J.­R. Ohm, W.­J. Han, and T. Wiegand, “Overview of the high efficiency
video coding (hevc) standard,” IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.
[7] Y. Zhu, Y. Yang, and T. Cohen, “Transformer­based transform coding,” in International
Conference on Learning Representations, 2022.
[8] M. Lu, P. Guo, H. Shi, C. Cao, and Z. Ma, “Transformer­based image compression,” in
Data Compression Conference, 2022.
[9] M. Lu, F. Chen, S. Pu, and Z. Ma, “High­efficiency lossy image coding through adaptive
neighborhood information aggregation,” 2022.
[10] X. Wang, M. Lu, and Z. Ma, “Block­level rate control for learnt image coding,” in 2022
Picture Coding Symposium. IEEE, 2022, pp. 157–161.
[11] L. D. Chamain, F. Racapé, J. Bégaint, A. Pushparaja, and S. Feltman, “End­to­end
optimized image compression for multiple machine tasks,” CoRR, vol. abs/2103.04178,
2021. [Online]. Available: https://arxiv.org/abs/2103.04178
[12] R. Feng, X. Jin, Z. Guo, R. Feng, Y. Gao, T. He, Z. Zhang, S. Sun, and Z. Chen,
“Image coding for machines with omnipotent feature learning,” 2022. [Online]. Available:
https://arxiv.org/abs/2207.01932
[13] H. Choi and I. V. Bajic, “Scalable image coding for humans and machines,” IEEE
Transactions on Image Processing, vol. 31, pp. 2739–2754, 2022. [Online]. Available:
https://doi.org/10.1109%2Ftip.2022.3160602
[14] K. Liu, D. Liu, L. Li, N. Yan, and H. Li, “Semantics­to­signal scalable image compression
with learned revertible representations,” International Journal of Computer Vision, vol.
129, pp. 1–17, 09 2021.
[15] N. Yan, C. Gao, D. Liu, H. Li, L. Li, and F. Wu, “Sssic: Semantics­to­signal scalable image
coding with learned structural representations,” IEEE Transactions on Image Processing,
vol. 30, pp. 8939–8954, 2021.
[16] L. D. Chamain, F. Racapé, J. Bégaint, A. Pushparaja, and S. Feltman, “End­to­end optimized image compression for machines, a study,” in 2021 Data Compression Conference
(DCC). IEEE, 2021, pp. 163–172.
[17] M. Song, J. Choi, and B. Han, “Variable­rate deep image compression through spatiallyadaptive feature transform,” in Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2021, pp. 2380–2389.
[18] J. Liu, H. Sun, and J. Katto, “Improving multiple machine vision tasks in the compressed
domain,” in 2022 26th International Conference on Pattern Recognition (ICPR). IEEE,
2022, pp. 331–337.
[19] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational
image compression with a scale hyperprior,” 2018. [Online]. Available: https:
//arxiv.org/abs/1802.01436
[20] T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang, “End­to­end learnt image compression via non­local attention optimization and improved context modeling,” IEEE Transactions on Image Processing, vol. 30, pp. 3179–3191, 2021.
[21] Y. Hu, W. Yang, and J. Liu, “Coarse­to­fine hyper­prior modeling for learned image
compression,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34,
no. 07, pp. 11 013–11 020, Apr. 2020. [Online]. Available: https://ojs.aaai.org/index.php/
AAAI/article/view/6736
[22] D. Minnen, J. Ballé, and G. Toderici, “Joint autoregressive and hierarchical priors for
learned image compression,” CoRR, vol. abs/1809.02736, 2018. [Online]. Available:
http://arxiv.org/abs/1809.02736
[23] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer:
Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF
international conference on computer vision, 2021, pp. 10 012–10 022.
[24] A. Hassani, S. Walton, J. Li, S. Li, and H. Shi, “Neighborhood attention transformer,”
arXiv preprint arXiv:2204.07143, 2022.
[25] A. B. Koyuncu, H. Gao, A. Boev, G. Gaikov, E. Alshina, and E. Steinbach, “Contextformer: A transformer with spatio­channel attention for context modeling in learneimage compression,” in European Conference on Computer Vision (ECCV). Springer,
2022, pp. 447–463.
[26] J. Xiang, K. Tian, and J. Zhang, “Mimt: Masked image modeling transformer for video
compression,” in International Conference on Learning Representations, 2023.
[27] F. Mentzer, G. Toderici, D. Minnen, S.­J. Hwang, S. Caelles, M. Lucic, and E. Agustsson,
“Vct: A video compression transformer,” in Advances in Neural Information Processing
Systems, 2022.
[28] N. Le, H. Zhang, F. Cricri, R. Ghaznavi­Youvalari, and E. Rahtu, “Image coding for machines: an end­to­end learned approach,” in ICASSP 2021 ­ 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 1590–1594.
[29] R. Torfason, F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool,
“Towards image understanding from deep compression without decoding,” CoRR, vol.
abs/1803.06131, 2018. [Online]. Available: http://arxiv.org/abs/1803.06131
[30] Z. Wang, M. Qin, and Y.­K. Chen, “Learning from the cnn­based compressed domain,”
in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,
2022, pp. 3582–3590.
[31] J. Liu, H. Sun, and J. Katto, “Learning in compressed domain for faster machine vision
tasks,” in 2021 International Conference on Visual Communications and Image Processing
(VCIP), 2021, pp. 01–05.
[32] Y. Mei, F. Li, L. Li, and Z. Li, “Learn a compression for objection detection ­ vae with a
bridge,” in 2021 International Conference on Visual Communications and Image Processing (VCIP), 2021, pp. 1–5.
[33] B. Lester, R. Al­Rfou, and N. Constant, “The power of scale for parameter­efficient
prompt tuning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing. Online and Punta Cana, Dominican Republic: Association
for Computational Linguistics, Nov. 2021, pp. 3045–3059. [Online]. Available:
https://aclanthology.org/2021.emnlp­main.243
[34] X. L. Li and P. Liang, “Prefix­tuning: Optimizing continuous prompts for generation,”
CoRR, vol. abs/2101.00190, 2021. [Online]. Available: https://arxiv.org/abs/2101.00190
[35] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre­train, prompt, and
predict: A systematic survey of prompting methods in natural language processing,”
CoRR, vol. abs/2107.13586, 2021. [Online]. Available: https://arxiv.org/abs/2107.13586
[36] O. Ronneberger, P. Fischer, and T. Brox, “U­net: Convolutional networks for biomedical
image segmentation,” CoRR, vol. abs/1505.04597, 2015.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
770–778.
[38] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R­CNN: towards real­time object
detection with region proposal networks,” CoRR, vol. abs/1506.01497, 2015. [Online].
Available: http://arxiv.org/abs/1506.01497
[39] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R­CNN,” CoRR, vol. abs/
1703.06870, 2017. [Online]. Available: http://arxiv.org/abs/1703.06870
[40] J. Liu, G. Lu, Z. Hu, and D. Xu, “A unified end­to­end framework for efficient deep image
compression,” arXiv preprint arXiv:2002.03370, 2020.
[41] J. Deng, W. Dong, R. Socher, L.­J. Li, K. Li, and L. Fei­Fei, “Imagenet: A large­scale
hierarchical image database,” in 2009 IEEE conference on computer vision and pattern
recognition. Ieee, 2009, pp. 248–255.
[42] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona,
D. Ramanan, P. Doll’a r, and C. L. Zitnick, “Microsoft COCO: common objects in context,”
CoRR, vol. abs/1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312
[43] C. D. Xintao Wang, Ke Yu and C. C. Loy, “Recovering realistic texture in image superresolution by deep spatial feature transform,” in IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2018.
[44] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad­cam:
Visual explanations from deep networks via gradient­based localization,” in 2017 IEEE
International Conference on Computer Vision (ICCV), 2017, pp. 618–626.
[45] G. Bjøntegaard, “Calculation of average psnr differences between rd­curves,” in Technical
Report VCEG­M33, ITU­T SG16/Q6, 2001.
[46] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large­scale image
recognition,” CoRR, vol. abs/1409.1556, 2014.
[47] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv, 2018.
[48] E. Kodak, “Kodak lossless true color image suite (photocd pcd0992). http://r0k.us/graphics/kodak/.”
[49] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision
benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR),
2012.
[50] Y. Wu, A. Kirillov, F. Massa, W.­Y. Lo, and R. Girshick, “Detectron2,” https://github.com/
facebookresearch/detectron2, 2019.
[51] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self­supervised
monocular depth prediction,” October 2019.
[52] J. Krause, J. Deng, M. Stark, and L. Fei­Fei, “Collecting a large­scale dataset of finegrained cars,” 2013.
[53] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is
worth 16x16 words: Transformers for image recognition at scale,” 2021.
電子全文 電子全文(網際網路公開日期:20250716)
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top