自然语言2_常用函数

sklearn实战-乳腺癌细胞数据挖掘（博主亲自录制视频教程）

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

相同爱好者请加

QQ：231469242

seo 关键词

自然语言，NLP，nltk,python，tokenization，normalization,linguistics,semantic

学习参考书： http://nltk.googlecode.com/svn/trunk/doc/book/

http://blog.csdn.net/tanzhangwen/article/details/8469491

一个NLP爱好者博客

http://blog.csdn.net/tanzhangwen/article/category/1297154

1. 使用代理下载数据

nltk.set_proxy("**.com:80")

nltk.download()

2. 使用sents(fileid)函数时候出现：Resource 'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource:

import nltk

nltk.download()

安装窗口中选择'Models'项，然后'在 'Identifier' 列找 'punkt，点击下载安装该数据包

3. 语料Corpus元素获取函数

from nltk.corpus import webtext

webtext.fileids() #得到语料中所有文件的id集合

webtext.raw(fileid) #给定文件的所有字符集合

webtext.words(fileid) #所有单词集合

webtext.sents(fileid) #所有句子集合

Example	Description
`fileids()`	the files of the corpus
`fileids([categories])`	the files of the corpus corresponding to these categories
`categories()`	the categories of the corpus
`categories([fileids])`	the categories of the corpus corresponding to these files
`raw()`	the raw content of the corpus
`raw(fileids=[f1,f2,f3])`	the raw content of the specified files
`raw(categories=[c1,c2])`	the raw content of the specified categories
`words()`	the words of the whole corpus
`words(fileids=[f1,f2,f3])`	the words of the specified fileids
`words(categories=[c1,c2])`	the words of the specified categories
`sents()`	the sentences of the whole corpus
`sents(fileids=[f1,f2,f3])`	the sentences of the specified fileids
`sents(categories=[c1,c2])`	the sentences of the specified categories
`abspath(fileid)`	the location of the given file on disk
`encoding(fileid)`	the encoding of the file (if known)
`open(fileid)`	open a stream for reading the given corpus file
`root()`	the path to the root of locally installed corpus
`readme()`	the contents of the README file of the corpus

4.文本处理的一些常用函数

假若text是单词集合的列表

len(text) #单词个数

set(text) #去重

sorted(text) #排序

text.count('a') #数给定的单词的个数

text.index('a') #给定单词首次出现的位置

FreqDist(text) #单词及频率，keys()为单词，*[key]得到值

FreqDist(text).plot(50,cumulative=True) #画累积图

bigrams(text) #所有的相邻二元组

text.collocations() #找文本中频繁相邻二元组

text.concordance("word") #找给定单词出现的位置及上下文

text.similar("word") #找和给定单词语境相似的所有单词 ???

text.common_context("a“,"b") #找两个单词相似的上下文语境

text.dispersion_plot(['a','b','c',...]) #单词在文本中的位置分布比较图

text.generate() #随机产生一段文本

NLTK's Conditional Frequency Distributions: commonly-used methods and idioms for defining,accessing, and visualizing a conditional frequency distribution.of counters.

Example	Description
`cfdist = ConditionalFreqDist(pairs)`	create a conditional frequency distribution from a list of pairs
`cfdist.conditions()`	alphabetically sorted list of conditions
`cfdist[condition]`	the frequency distribution for this condition
`cfdist[condition][sample]`	frequency for the given sample for this condition
`cfdist.tabulate()`	tabulate the conditional frequency distribution
`cfdist.tabulate(samples, conditions)`	tabulation limited to the specified samples and conditions
`cfdist.plot()`	graphical plot of the conditional frequency distribution
`cfdist.plot(samples, conditions)`	graphical plot limited to the specified samples and conditions
`cfdist1 < cfdist2`	test if samples in `cfdist1` occur less frequently than in `cfdist2`

python风控评分卡建模和风控常识

https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

相关阅读:
shell流程控制
shell编程变量介绍与表达式详解
shell编程简介
反向代理与负载均衡
存储库之mongodb，redis，mysql
请求库之requests，selenium
解析库之re、beautifulsoup、pyquery
爬虫基本原理
Django 函数和方法的区别
Django 知识补漏单例模式

原文地址：https://www.cnblogs.com/webRobot/p/6058205.html