A quick tour of traditional NLP
Copora, Tokens, and Types
corpus : a text dataset
tokens : tokens correspond to words and
numeric sequences separated by white-space characters or punctuation.
tokenizing text
import spacy
nlp = spacy.load('en)
text = "Mary, don't slap"
Lemmas and Stems
Lemmas : Lemmas are root forms of words. Consider the verb fly. It can be inflected into
many different words—flow, flew, flies, flown, flowing, and so on—and fly is the
lemma for all of these seemingly different words.
Stems : Stemming is the poor-man’s lemmatization. 3 It involves the use of handcrafted
rules to strip endings of words to reduce them to a common form called stems.
spaCy
spaCy 是一个自然语言处理包,主要功能有 word tokenize, 词干化(Lemmatize), 词性标注(POS tagging), 实体识别(NER), 名词短语提取和句法分析等;