跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.134) 您好!臺灣時間:2025/11/20 18:35
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:林昂賢
研究生(外文):Lin, Ang-Sheng
論文名稱:一個高效率的XML資料壓縮演算法
論文名稱(外文):A High Performance Compression Scheme for General XML Data
指導教授:陳文進陳文進引用關係
指導教授(外文):Chen, Wen-Chin
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2000
畢業學年度:88
語文別:英文
論文頁數:49
中文關鍵詞:資料壓縮壓縮器
外文關鍵詞:Data CompressionXMLText MiningDTDgzip
相關次數:
  • 被引用被引用:2
  • 點閱點閱:388
  • 評分評分:
  • 下載下載:23
  • 收藏至我的研究室書目清單書目收藏:3
在本篇論文中,我們提出了一個高效能的XML資料壓縮演算法,這個演算法運用了半結構化資料(semisturcture data)的特性、文字挖掘(text mining)的技巧,並結合現有的zlib函式庫中的gzip壓縮演算法。在整個壓縮的過程中,演算法動態的分析資料的結購與特性,使用者無需提供任何額外的文法資訊,如XML-Scehma或DTD等,但相關的文法資訊亦可用來促進更好的壓縮效率。整篇論文詳細敘述演算法的構想與流程;文中,我們也實作了一組壓縮與解壓縮器,用其驗證此一演算法並做為效能測試的工具。
In this thesis, we propose a high performance compression scheme
for general XML data. The scheme takes advantages of semisturcture data characters, text mining method, and existing compressing algorithms. To compress heterogeneous XML data, we incorporates and combines existing compressing such as zlib, the library function for gzip, as well as a collection of datatype specific compressors. In our scheme, we do not need schema information (such as a DTD or an XML-Schema), but can exploit those hints to further improve the compression ratio. According to our proposed approach, we implement a compressor/decompressor, and use them to test and verify our compression scheme.
Contents
1 Introduction 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Survey of Related Researches . . . . . . . . . . . . . 5
1.2.1 Semistructure Data . . . . . . . . . . . . . . . 5
1.2.2 Pattern Discovery and Text Mining . . . . . . . . 6
1.3 Overview of Our Proposed Approach . . . . . . . . . . 6
1.4 Thesis Organization . . . . . . . . . . . . . . . . . 7
2 Extensible Markup Language and Data Compression 8
2.1 Extensible Markup Language . . . . . . . . . . . . . . 8
2.1.1 What Is XML . . . . . . . . . . . . . . . . . . . 8
2.1.2 XML Document and Document Type Definition (DTD) .10
2.2 Data Compression . . . . . . .. . . . . . . . . . . . 12
2.2.1 Huffman Coding . . . . . . . . . . . . . . . . . 13
2.2.2 Lempe-Ziv LZ77 .. . . . . . . . . . . . . . .. . 15
2.2.3 Deflate Compression . . . . . . . . . . . . . . 18
3 A Compression framework for XML data 20
3.1 A Motivation Example . . . . . . . . . . . . . . . . 20
3.2 Separate Structure form Content . . . . . . . . . . . 22
3.3 Rearrange the heterogeneous data . . . . . . . . . . 26
3.4 Apply text mining to tag content . . . . . . . . . . 28
4 The Implementation Architecture 33
4.1 SAX Parser and SAX Client . . . . . . . . . . . . . . 33
4.2 Data Item Processing Model . . .. . . . . . . . . . . 35
4.3 Container and Gzip . . . . . . . . . . . . . . . . . 36
5 Experimental Evaluation 39
5.1 The Test Data Sources . . . . . . . . . . . . . . . . 39
5.2 The Compression Ratio . . . . . . . . . . . . . . . . 41
5.2.1 Performance Under Different Characteristic Files 41
5.2.2 Performance Under Different File Sizes . . . . . 42
6 Conclusion and Future Work 44
6.1 Discussion . . . . . . . . . . . . . . . . . . . . . 44
6.2 Future Work . .. . . . . . . . . . . . . . . . . . . 45
Bibliography
[1] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. International Journal on Digital Libraries, 1996.
[2] P. Buneman, S. davidson, Mary Fernandez, and D. Suciu. Adding structure to unstructured data. Technical Report MS-CIS-96-21, University of Penn sylvania, 1996.
[3] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A query language and optimization techniques for unstructured data. In Proceddings of ACM-SIGMOD International Conference on Management of Data, pages 505--516,
Montreal, Canada, June 1996.
[4] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The tsimmis project: Integration of heterogenous information sources. In In Proceddings of the Information Processing Society of Japan Conference, Tokyo, Japan, October 1994.
[5] V. Christophides, S. Abiteboul, S. Cluet, and M. Scholl. From structured documents to novel query facilities. In Proceedings of 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, May 1994.
[6] S. Cluet, C. Delobel, J. Simeon, and K. Smaga. Your mediators need data conversation ! In ACM-SIGMOD International Conference, pages 177--188, Seattle, Wachington, June 1998.
[7] A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with stored. In ACM SIGMOD International Conference on Managerment of Data, Philadelphia, May 1999.
[8] M. Fernandez, D. Florescu, J. Kang, A. Levy, and D. Suciu. Catching the boat with strudel. In Proceddings of ACM-SIGMOD International Conference, pages 414--425, Seattle, Wachington, June 1998.
[9] D. Florescu and S. Kossmann. Storing and querying xml data using an rdbms. Technical Report 22(3), Data Engineering Bulletin, 1999.
[10] R. Goldman, J. McHugh, and J. Widom. From semistructured data to xml: Migrating the lore data model and query language. In ACM SIGMOD Work-shop on the Web and Database(WebDB), pages 25--30, Philadelphia, PA, June 1999.
[11] B.R Iyer and D. Wilhite. Data compression support in database. In VLDB''94, Proceedings of 20th International Conference on Very Large Data Bases, pages 695--704, Santiago de Chile, Chile, September 1994.
[12] T. Lahiri, S. Abiteboul, and J. Widom. Ozone integrating semistructured and structured data. Technical report, 8th International Workshop on Database Programming Languages (DBPL), Kinloch Rannoch, Scotland, September 1999.
[13] lan H. Witten, Zane Bray, Malika Mahoui, and Bill Teahan. Text mining: A new frontier for lossless compression. In Data compression Conference, 1999. Proceedings, Dcc''99, pages 198--207, 1999.
[14] Dong-Ha Lee, Dong-Yal Seo, Kang-Sik Moon, Jisook Chang, Do-Won Nam, and Jeon-Young Lee. Discory and application of inter-Class patterns in database. In Database and Expert System Applications 1997''s Proceedings Eighth International Workshop, pages 326--331, 1997.
[15] Chih-Chin Liu, Jia-Lien Hsu, and Arbee L. P. Chen. Efficient theme and non-trivial repeating pattern discovering in music database. In Data Engineering, 1999. Proceedings, 15th International Conference, pages 14--21, 1999.
[16] Svetlozar Nestorov, Serge Abiteboul, and Rajeev Motwani. Inferring structure in semistructured data. In SIGMOD Record 26(4), pages 39--43, December 1997.
[17] Craig G. Nevill-Manning and Ian H. Witten. Compressing semi-structured text using hierarchical phrase identifications. In Data compression Conference, 1999. Dcc''96 Proceedings, pages 63--72, 1996.
[18] W3C Recommendation. Extensible markup language (xml) 1.0. In URL: http://www.w3.org/TR/REC xml, February 1998.
[19] M. A. Roth and S. Van Horn. Database compression. In ACM SIGMOD Record, pages 22(3)31--39, 1993.
[20] D. Salomon. Data compression. In The Complete Reference, New York, Springer 1997.
[21] J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying xml documents: Limitations and opportunities. In Proceedings of International Conference on Very Large Databases(VLDB), pages 302--314, Edinburgh, UK, September 1999.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊