NLP basics

NLP basics Blanc Media LTD

The methods in fetching shorthand work on a subset selection basis. It does this by extracting phrases or sentences from an article to form a resume. LexRank and TextRank are well-known representatives of this approach, which use variations on the Google PageRank page sorting algorithm.
2021-10-03, by ,

#Deep learning || #NLP || #Technology ||

Table of contents:


Stemmization is the process of bringing a word to its root / stem.

It brings various variations of the word (for example, "help", "helping", "helped", "helpful") to its initial form (for example, "help"), removes all word appendages (prefix, suffix, ending) and leaves only the basis of the word.

The root of a word may or may not be an existing word in the language. For example, "mov" is the root of "movie", "emot" is the root of "emotion".


Lemmatization is similar to stemmatization in that it brings a word to its initial form, but with one difference: in this case, the root of the word will be the word that exists in the language. For example, the word "caring" will end in "care" rather than "car" as in stemmization.

WordNet is a base of words existing in the English language. Lemmatizer from NLTK WordNetLemmatizer () uses words from WordNet.

N-grams are combinations of several words used together, N-grams, where N = 1 are called unigrams. Likewise, bigrams (N = 2), trigrams (N = 3) and further can be continued in a similar way: doctranslator.

N-grams can be used when we need to store some kind of data sequence, for example, which word most often follows a given word. Unigrams do not contain any sequence of data, since each word is taken individually.