学位论文 > 优秀研究生学位论文题录展示

基于SVM的中文网页多类分类问题研究及实现

作　者: 王绪峰
导　师: 陶跃华
学　校: 云南师范大学
专　业: 计算机软件与理论
关键词: 网页分类特征选择支持向量机多类分类器
分类号: TP393.092
类　型: 硕士论文
年　份: 2007年
下　载: 46次
引　用: 3次
阅　读: 论文下载

内容摘要

随着Internet技术的快速发展,人们从信息缺乏的时代过渡到信息极为丰富的数字化时代。在这个数字化的时代里,人们可以获得越来越多的数字化信息。这些信息大都是半结构化或者是非结构化的数据,想从其中迅速有效地获得所需信息是非常困难的事情。为此,中文网页自动分类被研究者提出并进行了应用研究,研究中文网页分类具有重要的理论意义和实际应用价值。自动分类不仅可以将网页按照类别信息分别建立相应的数据库,提高中文搜索引擎的查全率和查准率,而且可以建立自动的分类信息资源,为用户提供分类信息目录,并且,自动分类的好与坏,对后面的相关性排序过程也有一定的积极作用。本文在研究了传统支持向量机(SVM)分类器模型的同时,结合现有的网页分类技术,对SVM多类分类器模型构造进行了较为系统的研究,提出了一种基于SVM的多类分类器模型构造算法,在此基础上对基于分类的中文网页内容获取、中文分词、中文网页特征选择、SVM中文网页分类器提出了一定的思考和见解。(1)针对中文网页的结构和特点,分析了网页中对分类过程有贡献的信息成分,采用网页中的标题和主体部分标签中的文本来近似表达网页中的主题内容,并设计了标题和主体部分标签中文本获取的算法。(2)对中文分词和特征提取方法进行了深入地研究,系统地分析了中文分词方法,介绍了哈工大信息检索研究室的分词系统,采用改进的x~2估计方法作为本文特征选择方法,并描述了特征选择算法。(3)对SVM多类分类方法进行了深入理论研究,分析了以往SVM多类分类器构造方法,利用核函数在高维空间中距离公式,计算类别间最短距离,引入带权无向完全图来刻画高维空间中类别间的距离结构,基于最容易分割的类或类别集合先分割,提出了一种基于SVM的多类分类器的构造方法。(4)在上述研究的基础上,构建了一个完整的分类系统CWPMCS,进行了实验,并对实验结果做出了分析和评价。实验结果表明,本文研究开发的分类系统具有较高的分类准确率,比K-最近邻(KNN)分类方法的准确率要高。

全文目录

基于SVM的中文网页多类分类问题研究及实现  4-50
  摘要  5-8
  第一章绪论  8-10
    1.1 研究背景  8
    1.2 论文的研究思路与主要的工作  8-9
    1.3 论文的组织  9-10
  第二章中文网页预处理技术及中文分词  10-17
    2.1 中文网页的基本结构和特点  10-11
      2.1.1 中文网页基本结构  10
      2.1.2 中文网页特点分析  10-11
    2.2 中文网页主题内容提取算法  11-13
      2.2.1 算法主要思想  12
      2.2.2 网页主题文本内容提取算法  12-13
    2.3 中文分词  13-16
      2.3.1 中文分词方法  13-15
      2.3.2 中文分词成果  15
      2.3.3 本文系统(CWPMCS)的中文分词  15-16
    2.4 本章小结  16-17
  第三章中文网页特征选择  17-23
    3.1 网页表示  17-19
    3.2 特征选择  19-22
      3.2.1 常见特征选择方法  19-21
      3.2.2 本文的特征选择方法及算法描述  21-22
    3.3 本章小结  22-23
  第四章支持向量机理论及其在中文网页分类中的应用  23-36
    4.1 统计学习理论的核心内容  23
    4.2 SVM的二值分类  23-27
      4.2.1 线性可分情况  24-25
      4.2.2 线性不可分情况  25-26
      4.2.3 非线性可分情况  26-27
    4.3 支持向量机优点  27-28
    4.4 SVM多类分类方法  28-31
      4.4.1 一对多方法  28-29
      4.4.2 一对一方法  29-30
      4.4.3 决策有向无环图(Directed Acyclic Graph)方法  30
      4.4.4 基于SVM的二叉树方法  30-31
    4.5 多类分类模型构造算法  31-35
      4.5.1 算法主要思想  32-33
      4.5.2 构造算法  33
      4.5.3 算法分析  33-35
    4.6 本章小结  35-36
  第五章系统设计与实验结果分析  36-41
    5.1 CWPMCS(CHINESE WEBPAGE MULTICLASS CLASSIFIER SYSTEM)总体设计  36
    5.2 开发环境  36-37
    5.3 中文网页数据集  37
    5.4 CWPMCS功能实现  37-40
    5.5 中文网页实验结果与分析评价  40
    5.6 本章小结  40-41
  第六章总结与展望  41-43
    6.1 工作总结  41
    6.2 进一步的研究  41-43
  参考文献  43-50
Research And Implementation Of Chinese Web Page Multi-class Classification Based On SVM  50-91
  Abstract  51-55
  Chapter One Introduction  55-57
    1.1 Background  55
    1.2 The thinking of research of the thesis and main work  55-56
    1.3 Organization of the thesis  56-57
  Chapter Two Chinese webpage pretreatment technology and Chinese word segmentation  57-64
    2.1 Basic structure and characteristic of Chinese webpages  57-59
      2.1.1 Chinese basic structure of webpage  57-58
      2.1.2 Chinese webpage characteristic analysis  58-59
    2.2 Extraction algorithms OF Chinese webpage theme content  59-60
      2.2.1 Main thought of the algorithm  59-60
      2.2.2 Extraction algorithms in theme text content of the webpage  60
    2.3 Chinese word segmentation  60-63
      2.3.1 The method of Chinese word segmentation  60-62
      2.3.2 Chinese word segmentation achievement  62-63
      2.3.3 Chinese word segmentation in CWPMCS  63
    2.4 Brief summaries of This chapter  63-64
  Chapter Three Feature Selection of Chinese webpage  64-70
    3.1 Webpage Expression  64-65
    3.2 Feature Selection  65-69
      3.2.1 common method of feature selection  66-68
      3.2.2 the method of feature selection and algorithm describing in this paper  68-69
    3.3 Brief summaries of this chapter  69-70
  Chapter four Support Vector Machine theory and application in Chinese webpage classification  70-84
    4.1 the Key content of statistical learning theory  70-71
    4.2 Binary Classification of SVM Theory  71-75
      4.2.1 Linear separable case  72-73
      4.2.2 Linear inseparable case  73
      4.2.3 Non-linear separable case  73-75
    4.3 The Advantage of Support Vector Machine  75
    4.4 Multi-class Classification of Support Vector Machine  75-79
      4.4.1 One-Against-TheRest method  76-77
      4.4.2 One-Against-One Method  77
      4.4.3 Directed Acyclic Graph Method  77-78
      4.4.4 Binary Tree Method Based On SVM  78-79
    4.5 The algorithms of constructing multi-class classifier models  79-83
      4.5.1 Main thought of the algorithm  80
      4.5.2 Algorithm  80-82
      4.5.3 Analysis of the algorithm  82-83
    4.6 Brief summaries of this chapter  83-84
  Chapter Five The System Design and Experimental ResultAnalysis  84-89
    5.1 CWPMCS(Chinese WebPage Multi-Class Classifier System) design  84-85
    5.2 Development environment  85
    5.3 The Chinese webpage data collecting  85
    5.4 CWPMCS function realization  85-88
    5.5 The evaluation and analysis of Chinese webpage experimental result  88
    5.6 Brief summaries of this chapter  88-89
  Chapter Six Summarize and the prospect  89-91
    6.1 Summary  89
    6.2 Further Research  89-91
中文网页分类器及其相关技术研究  91-140
  摘要  92-95
  第一章引言  95-99
    1.1 背景和意义  95-96
    1.2 文本分类的目前研究状况  96-97
    1.3 网页分类的研究现状  97-98
    1.4 网页分类过程  98-99
  第二章中文分词  99-103
    2.1 分词方法  99-100
      2.1.1 基于字符串匹配的分词方法  99-100
      2.1.2 基于理解的分词方法  100
      2.1.3 基于统计的分词方法  100
    2.2 未登录词识别问题  100-101
    2.3 歧义切分问题  101
    2.4 中文分词成果  101-102
    2.5 现有分词方法的局限  102-103
  第三章降维技术  103-109
    3.1 特征选择方法  103-106
      3.1.1 文档频率(DF)  103-104
      3.1.2 信息增益(Information Gain，IG)  104
      3.1.3 互信息(Mutual Information,MI)  104-105
      3.1.4 X~2估计(X~2-test，CHI)  105-106
      3.1.5 文本证据权(Weight of Evidence Text)  106
    3.2 特征提取方法  106-109
      3.2.1 主成分分析(PCA)  107
      3.2.2 潜在语义索引(LSI)  107-108
      3.2.3 非负矩阵分解(NMF)  108-109
  第四章网页分类方法  109-133
    4.1 简单向量距离分类方法  109
    4.2 决策树分类方法  109-111
    4.3 K-近邻分类方法(K-NN)  111-112
    4.4 粗糙集分类方法  112
    4.5 贝叶斯分类方法  112-115
      4.5.1 朴素贝叶斯分类方法  113-114
      4.5.2 贝叶斯网络分类方法  114-115
    4.6 神经网络分类方法  115-120
      4.6.1 神经网络的基本属性  115-116
      4.6.2 误差反向传播的前馈网络(BP网络)  116-119
      4.6.3 RBF网络  119-120
    4.7 支持向量机(SVM)分类方法  120-132
      4.7.1 支持向量机的研究现状  120-121
      4.7.2 统计学习理论的核心内容  121-124
      4.7.3 基于SVM理论的二值分类  124-127
      4.7.4 支持向量机训练算法  127-128
      4.7.5 支持向量机多类分类  128-130
      4.7.6 网页的多归属  130
      4.7.7 分类器的性能评价  130-132
    第五章结束语  132-133
      5.1 内容总结  132
      5.2 中文网页分类技术的展望  132-133
  参考文献  133-140
Chinese Webpage Classifier And Relevant Technology Research  140-189
  Abstract  141-144
  Chapter one Foreword  144-149
    1.1 Background and Meaning  144-146
    1.2 Research state of text classification at present  146
    1.3 The current situation of the webpage classification  146-147
    1.4 The Course of the Webpage Classificaition  147-149
  Chapter two Chinese Word Segmentation  149-154
    2.1 Segmentation methods  149-151
      2.1.1 The Segmentation Method Based On String Matching  149-150
      2.1.2 The Segmentation Method Based On Understanding  150
      2.1.3 The Segmentation Method Based On Statistics  150-151
    2.2 The Discernable Question Of Not Logged Word  151
    2.3 Ambiguous Segmentation Questions  151-152
    2.4 Chinese Word Segmentation Achievements  152
    2.5 Limitations of Segmentation Method  152-154
  Chapter three Reduction Dimension Technology  154-161
    3.1 Feature Selection  154-159
      3.1.1 Document Frequency  154-155
      3.1.2 Information Gain(IG).  155-156
      3.1.3 Mutual Information(MI)  156-157
      3.1.4 X~2 Estimation(CHI)  157-158
      3.1.5 Weight Of Evidence Text  158-159
    3.2 Feature Extraction Method  159-161
      3.2.1 Principle component analysis (PCA)  159
      3.2.2 latent semantic indexing (LSI)  159-160
      3.2.3 Noir negative Matrix Factorization (NMF)  160-161
  Chapter four webpage classification method  161-188
    4.1 simple vector distance classification method  161
    4.2 Decision Tree Classification method  161-163
    4.3 K- the Nearest Neighbour Classification Method( K-NN)  163-164
    4.4 RoughSets Classification Method  164-165
    4.5 Beyes Classification method  165-168
      4.5.1 Naive Bayes Classification method  165-167
      4.5.2 Bayes Network Classification Method  167-168
    4.6 Neural Network Classification Method  168-174
      4.6.1 Basic attribute of the Neural Network  168-169
      4.6.2 Feedforward Network (BP network) of error backpropagation  169-173
      4.6.3 RBF network  173-174
    4.7 Support Vector Machine Classification Method  174-188
      4.7.1 The Current Research Situation of Support Vector Machine  174-175
      4.7.2 the Key content of statistical learning theory  175-178
      4.7.3 Binary Classification of SVM theory  178-181
      4.7.4 Train Algorithms Of Support Vector Machine  181-183
      4.7.5 Multi-class Classification of Support Vector Machine  183-185
      4.7.6 The more belonging of the webpage  185-186
      4.7.7 The Performance Appraisal of Classifier  186-188
  Chapter five Conclusion  188-189
    5.1 Contents Summarizing  188
    5.2 Technological Prospect of Chinese Webpage Classifier  188-189
致谢  189

基于SVM的中文网页多类分类问题研究及实现

内容摘要

全文目录

相似论文