NLP:词中的数学( 四 ) .font_s.ios.pgcarticle

The lift that sustains the kite in flight is generated when air flows around the kite’s surface, producing low pressure above and high pressure below the wings. The interaction with the wind also generates horizontal drag along the direction of the wind. The resultant force vector from the lift and drag force components is opposed by the tension of one or more of the lines or tethers to which the kite is attached. The anchor point of the kite line may be static or moving (such as the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle).
The same principles of fluid flow apply in liquids and kites are also used under water. A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting surface is called a kytoon.
Kites have a long and varied history and many different types are flown individually and at festivals worldwide. Kites may be flown for recreation, art or other practical uses. Sport kites can be flown in aerial ballet, sometimes as part of a competition. Power kites are multi-line steerable kites designed to generate large forces which can be used to power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. Even Man-lifting kites have been made.
——维基百科
然后，将该文本赋给变量：
>>> from collections import Counter>>> from nltk.tokenize import TreebankWordTokenizer>>> tokenizer = TreebankWordTokenizer()>>> from nlpia.data.loaders import kite_text?---　和上面一样， kite_text = “A kite is traditionally …”>>> tokens = tokenizer.tokenize(kite_text.lower())>>> token_counts = Counter(tokens)>>> token_countsCounter({'the': 26, 'a': 20, 'kite': 16, ',': 15, ...})注意
TreebankWordTokenizer会返回“kite.”（带有句点）作为一个词条。 Treebank分词器假定文档已经被分割成独立的句子，因此它只会忽略字符串最末端的标点符号。句子分割也是一件棘手的事情，我们将在第11章中介绍。尽管如此，由于在一趟扫描中就完成句子的分割和分词处理（还有很多其他处理）， spaCy分析器表现得更快且更精确。因此，在生产型应用中，可以使用spaCy而不是前面在一些简单例子中使用的NLTK组件。
好了，回到刚才的例子，里面有很多停用词。这篇维基百科的文章不太可能会与“the”“a”、连词“and”以及其他停用词相关。下面把这些词去掉：
>>> import nltk>>> nltk.download('stopwords', quiet=True)True>>> stopwords = nltk.corpus.stopwords.words('english')>>> tokens = [x for x in tokens if x not in stopwords]>>> kite_counts = Counter(tokens)>>> kite_countsCounter({'kite': 16,'traditionally': 1,'tethered': 2,'heavier-than-air': 1,'craft': 2,'wing': 5,'surfaces': 1,'react': 1,'air': 2,...,'made': 1})}单纯凭借浏览词在文档中出现的次数，我们就可以学到一些东西。词项kite(s)、wing和lift都很重要。并且，如果我们不知道这篇文章的主题是什么，只是碰巧在大规模的类谷歌知识库中浏览到这篇文章，那么我们可能“程序化”地推断出，这篇文章与“flight”或者“lift”相关，或者实际上和“kite”相关。
如果考虑语料库中的多篇文档，事情就会变得更加有趣。有一个文档集，这个文档集中的每篇文档都和某个主题有关，如放飞风筝（kite flying）的主题。可以想象，在所有这些文档中“string”和“wind”的出现次数很多，因此这些文档中的词项频率TF("string")和TF("wind")都会很高。下面我们将基于数学意图来更优雅地表示这些数值。

NLP:词中的数学( 四 )

推荐阅读

中国新闻网|新疆和田地区皮山县发生3.4级地震震源深度10千米

民警|博士学历、无业，男子偷电动车"上瘾"被抓

腾讯|腾讯张军回应字节跳动副总裁吐槽，回顾腾讯危机公关

造影多少钱(妇幼保健院做造影多少钱一次)

北京卫视我是大医生|我们可能每天都在吃！怪不得三高、肥胖等找上门，比糖和盐更“可怕”的调味剂

「新零售」港股“七翻身”序幕拉开，齐屹科技逆势发行前景可期

雌激素低月经量少怎么办？月经量多的原因有哪些？

节后旅游市场进入淡季退休老人及商务人士成主力军

「挖贝网」客户订货量上升，科立森2019年净利260.44万增长85.12%

有适合胖人（中年）的运动吗

外交部|美国威胁对TikTok、微信等采取措施，外交部回应

5款实用轻量的浏览器APP了解一下手机浏览器哪个好

推广|828零食节来了！惊爆价……

「家有汽车」车标用R简化，荣威也终于要换车标了

突发！浙江5岁男孩在小区游泳池溺亡，妈妈哭到不会说话

北京站所有通道“刷脸进站”

深燃财经|抖音上的皮肤骗局:手把手教孩子转走父母17万

电视剧|《我们的婚姻》，为什么大家喜欢拍关于爱情和婚姻的电视剧呢

形容女子思念丈夫的诗?表达丈夫对妻子思念的诗句

真会玩|又想翻车？继平台事件后肖战粉丝又开始了！何老师无辜受牵连