NLP:词中的数学( 四 )


The lift that sustains the kite in flight is generated when air flows around the kite’s surface, producing low pressure above and high pressure below the wings. The interaction with the wind also generates horizontal drag along the direction of the wind. The resultant force vector from the lift and drag force components is opposed by the tension of one or more of the lines or tethers to which the kite is attached. The anchor point of the kite line may be static or moving (such as the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle).
The same principles of fluid flow apply in liquids and kites are also used under water. A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting surface is called a kytoon.
Kites have a long and varied history and many different types are flown individually and at festivals worldwide. Kites may be flown for recreation, art or other practical uses. Sport kites can be flown in aerial ballet, sometimes as part of a competition. Power kites are multi-line steerable kites designed to generate large forces which can be used to power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. Even Man-lifting kites have been made.
——维基百科
然后 , 将该文本赋给变量:
>>> from collections import Counter>>> from nltk.tokenize import TreebankWordTokenizer>>> tokenizer = TreebankWordTokenizer()>>> from nlpia.data.loaders import kite_text?--- 和上面一样 , kite_text = “A kite is traditionally …”>>> tokens = tokenizer.tokenize(kite_text.lower())>>> token_counts = Counter(tokens)>>> token_countsCounter({'the': 26, 'a': 20, 'kite': 16, ',': 15, ...})注意
TreebankWordTokenizer会返回“kite.”(带有句点)作为一个词条 。 Treebank分词器假定文档已经被分割成独立的句子 , 因此它只会忽略字符串最末端的标点符号 。 句子分割也是一件棘手的事情 , 我们将在第11章中介绍 。 尽管如此 , 由于在一趟扫描中就完成句子的分割和分词处理(还有很多其他处理) , spaCy分析器表现得更快且更精确 。 因此 , 在生产型应用中 , 可以使用spaCy而不是前面在一些简单例子中使用的NLTK组件 。
好了 , 回到刚才的例子 , 里面有很多停用词 。 这篇维基百科的文章不太可能会与“the”“a”、连词“and”以及其他停用词相关 。 下面把这些词去掉:
>>> import nltk>>> nltk.download('stopwords', quiet=True)True>>> stopwords = nltk.corpus.stopwords.words('english')>>> tokens = [x for x in tokens if x not in stopwords]>>> kite_counts = Counter(tokens)>>> kite_countsCounter({'kite': 16,'traditionally': 1,'tethered': 2,'heavier-than-air': 1,'craft': 2,'wing': 5,'surfaces': 1,'react': 1,'air': 2,...,'made': 1})}单纯凭借浏览词在文档中出现的次数 , 我们就可以学到一些东西 。 词项kite(s)、wing和lift都很重要 。 并且 , 如果我们不知道这篇文章的主题是什么 , 只是碰巧在大规模的类谷歌知识库中浏览到这篇文章 , 那么我们可能“程序化”地推断出 , 这篇文章与“flight”或者“lift”相关 , 或者实际上和“kite”相关 。
如果考虑语料库中的多篇文档 , 事情就会变得更加有趣 。 有一个文档集 , 这个文档集中的每篇文档都和某个主题有关 , 如放飞风筝(kite flying)的主题 。 可以想象 , 在所有这些文档中“string”和“wind”的出现次数很多 , 因此这些文档中的词项频率TF("string")和TF("wind")都会很高 。 下面我们将基于数学意图来更优雅地表示这些数值 。


推荐阅读