NLP:词中的数学

.font_s.ios.pgc article>blockquote>p span,.font_s.ios.pgc article>ol>li span,.font_s.ios.pgc article>p span,.font_s.ios.pgc article>ul li span{font-size:16px!important}.font_m.ios.pgc article>blockquote>p span,.font_m.ios.pgc article>ol>li span,.font_m.ios.pgc article>p span,.font_m.ios.pgc article>ul li span{font-size:18px!important}.font_l.ios.pgc article>blockquote>p span,.font_l.ios.pgc article>ol>li span,.font_l.ios.pgc article>p span,.font_l.ios.pgc article>ul li span{font-size:20px!important}.font_xl.ios.pgc article>blockquote>p span,.font_xl.ios.pgc article>ol>li span,.font_xl.ios.pgc article>p span,.font_xl.ios.pgc article>ul li span{font-size:23px!important}.font_s.ios.pgc article li,.font_s.ios.pgc article p{line-height:26px!important}.font_m.ios.pgc article li,.font_m.ios.pgc article p,.ios.pgc article li,.ios.pgc article p{line-height:28px!important}.font_l.ios.pgc article li,.font_l.ios.pgc article p{line-height:30px!important}.font_xl.ios.pgc article li,.font_xl.ios.pgc article p{line-height:33px!important}@media (max-device-width:374px){.font_s.ios.pgc article>blockquote>p span,.font_s.ios.pgc article>ol>li span,.font_s.ios.pgc article>p span,.font_s.ios.pgc article>ul li span{font-size:14px!important}.font_m.ios.pgc article>blockquote>p span,.font_m.ios.pgc article>ol>li span,.font_m.ios.pgc article>p span,.font_m.ios.pgc article>ul li span{font-size:16px!important}.font_l.ios.pgc article>blockquote>p span,.font_l.ios.pgc article>ol>li span,.font_l.ios.pgc article>p span,.font_l.ios.pgc article>ul li span{font-size:18px!important}.font_xl.ios.pgc article>blockquote>p span,.font_xl.ios.pgc article>ol>li span,.font_xl.ios.pgc article>p span,.font_xl.ios.pgc article>ul li span{font-size:21px!important}.font_s.ios.pgc article li,.font_s.ios.pgc article p{line-height:26px!important}.font_m.ios.pgc article li,.font_m.ios.pgc article p,.ios.pgc article li,.ios.pgc article p{line-height:28px!important}.font_l.ios.pgc article li,.font_l.ios.pgc article p{line-height:30px!important}.font_xl.ios.pgc article li,.font_xl.ios.pgc article p{line-height:33px!important}}.font_s.android.pgc article>blockquote>p span,.font_s.android.pgc article>ol>li span,.font_s.android.pgc article>p span,.font_s.android.pgc article>ul li span{font-size:16px!important}.font_m.android.pgc article>blockquote>p span,.font_m.android.pgc article>ol>li span,.font_m.android.pgc article>p span,.font_m.android.pgc article>ul li span{font-size:18px!important}.font_l.android.pgc article>blockquote>p span,.font_l.android.pgc article>ol>li span,.font_l.android.pgc article>p span,.font_l.android.pgc article>ul li span{font-size:20px!important}.font_xl.android.pgc article>blockquote>p span,.font_xl.android.pgc article>ol>li span,.font_xl.android.pgc article>p span,.font_xl.android.pgc article>ul li span{font-size:23px!important}.font_s.android.pgc article li,.font_s.android.pgc article p{line-height:27px!important}.android.pgc article li,.android.pgc article p,.font_m.android.pgc article li,.font_m.android.pgc article p{line-height:29px!important}.font_l.android.pgc article li,.font_l.android.pgc article p{line-height:31px!important}.font_xl.android.pgc article li,.font_xl.android.pgc article p{line-height:34px!important}article>blockquote>p,article>ol>li,article>p,article>ul>li{text-indent:initial!important}article>blockquote>p span,article>ol>li span,article>p span,article>ul>li span{letter-spacing:initial!important}.font_l article>p+.article-br,.font_m article>p+.article-br,.font_s article>p+.article-br,.font_xl article>p+.article-br{display:none}.font_l article .article-br,.font_m article .article-br,.font_s article .article-br,.font_xl article .article-br{margin-top:0!important;margin-bottom:0!important}.font_s.pgc article blockquote>p{line-height:26px!important}.font_m.pgc article blockquote>p,.pgc article blockquote>p{line-height:28px!important}.font_l.pgc article blockquote>p{line-height:30px!important}.font_xl.pgc article blockquote>p{line-height:33px!important}.font_s.pgc article blockquote>p span{font-size:15px!important}.font_m.pgc article blockquote>p span{font-size:17px!important}.font_l.pgc article blockquote>p span{font-size:19px!important}.font_xl.pgc article blockquote>p span{font-size:22px!important}.pgc article p+.article-br+article-img{margin-top:-18px!important}.pgc article .article-literature.pgc-end-literature,.pgc article .article-source.pgc-end-source{margin-top:0!important;margin-bottom:0!important;line-height:24px!important}.font_s.pgc article .article-literature.pgc-end-literature,.font_s.pgc article .article-source.pgc-end-source{font-size:13px!important}.font_m.pgc article .article-literature.pgc-end-literature,.font_m.pgc article .article-source.pgc-end-source{font-size:15px!important}.font_l.pgc article .article-literature.pgc-end-literature,.font_l.pgc article .article-source.pgc-end-source{font-size:17px!important}.font_xl.pgc article .article-literature.pgc-end-literature,.font_xl.pgc article .article-source.pgc-end-source{font-size:20px!important}.font_s.pgc article .article-literature.pgc-end-literature span,.font_s.pgc article .article-source.pgc-end-source span{font-size:13px!important}.font_m.pgc article .article-literature.pgc-end-literature span,.font_m.pgc article .article-source.pgc-end-source span{font-size:15px!important}.font_l.pgc article .article-literature.pgc-end-literature span,.font_l.pgc article .article-source.pgc-end-source span{font-size:17px!important}.font_xl.pgc article .article-literature.pgc-end-literature span,.font_xl.pgc article .article-source.pgc-end-source span{font-size:20px!important}.font_s.pgc article p{margin-top:16px!important;margin-bottom:16px!important;margin-left:0!important;margin-right:0!important}.font_m.pgc article p,.pgc article p{margin-top:18px!important;margin-bottom:18px!important;margin-left:0!important;margin-right:0!important}.font_l.pgc article p{margin-top:20px!important;margin-bottom:20px!important;margin-left:0!important;margin-right:0!important}.font_xl.pgc article p{margin-top:23px!important;margin-bottom:23px!important;margin-left:0!important;margin-right:0!important}.pgc article p:first-child{margin-top:0!important}.pgc article blockquote>p:first-child{margin-top:0!important}.pgc article blockquote>p:last-child{margin-bottom:0!important}.pgc article blockquote li:first-child p{margin-top:0!important}.pgc article blockquote li:last-child p{margin-bottom:0!important}我们已经收集了一些词(词条) , 对这些词进行了计数 , 并将它们归并成词干或者词元 , 接下来就可以做一些有趣的事情了 。 分析词对一些简单的任务有用 , 例如得到词用法的一些统计信息 , 或者进行关键词检索 。 但是我们想知道哪些词对于某篇具体文档和整个语料库更重要 。 于是 , 我们可以利用这个“重要度”值 , 基于文档内的关键词重要度在语料库中寻找相关文档 。
这样做的话 , 会使我们的垃圾邮件过滤器更不可能受制于电子邮件中单个粗鲁或者几个略微垃圾的词 。 也因为有较大范围的词都带有不同正向程度的得分或标签 , 因此我们可以度量一条推文的正向或者友好程度 。 如果知道一些词在某文档内相对于剩余文档的频率 , 就可以利用这个信息来进一步修正文档的正向程度 。 在本章中 , 我们将会学习一个更精妙的非二值词度量方法 , 它能度量词及其用法在文档中的重要度 。 几十年来 , 这种做法是商业搜索引擎和垃圾邮件过滤器从自然语言中生成特征的主流做法 。


推荐阅读