NLP course at ITMO University (2018)

Written exam questions

Zipf's Law, its importance for NLP. Language processing in information retrieval: lemmatization, stemming, Boolean search, inverted indices, execution of Boolean queries on them, skip-lists.
Language processing in information retrieval: vector space model, cosine distance, TF-IDF. Common ways of representing texts for machine learning tasks.
String distances and the algorithms for their computation: the Hamming distance, the Jaro-Winkler distance, the Levenshtein distance, the smallest common subsequence, the Jaccard distance for character N-grams. Indices for typos detection/correction in words.
Edit distances (definitions only). Regular expressions: basic constructions, recommendations for use.
Markov chains. Ergodic theorem. PageRank and Markov chains. Direct applications in the text analysis.
Elements of information theory: self-information, bit, pointwise mutual information, Kullback-Leibler divergence, Shannon entropy, its interpretations. Cross-entropy. Example of an application: collocations extraction.
Language modeling. N-gram models. Perplexity. The reasons for doing smoothing. Additive (Laplace) smoothing. Interpolation and backoff. The ideas on which the Kneser-Ney smoothing is based.
Vector semantics: term-document matrices, term-context matrices, HAL. SVD, LSA, NMF. Methods for quality evaluation of vector semantics models.
Vector semantics: what is word2vec (the core principles of the SGNS algorithm and its relationship with matrix factorization), word2vec as a neural network. Methods for quality evaluation of vector semantics models.
Clustering: types of clustering algorithms. KMeans, agglomerative and divisive clustering (+ ways of estimating the distances between clusters), DBSCAN. Limitations and areas of applicability of all algorithms. Methods clustering quality evaluation, the shortcomings of each.
Duplicates search: statement of the problem, description of the MinHash algorithm. "Permutations generation" in practice.
Topic modeling. LSA, pLSA, LDA, ARTM. Advantages and disadvantages of each method. Topic modeling quality evaluation (perplexity, coherence and methods with experts involved).
Classification. Binary classification quality evaluation. Metric classification methods. Logical methods of classification. Linear classification methods.
Quality evaluation of machine learning models (why divide the data set into three parts). Classification. Multi-class classification quality evaluation. Naive Bayes Classifier. Ensembles of models of machine learning.
Sequence tagging. PoS tagging. Named entity recognition. Hidden Markov models. Estimation of the probability of a sequence of states. Estimation of the probability of a sequence of observations. Quality evaluation.
Sequence tagging. PoS tagging. Named entity recognition. Hidden Markov models. Decoding of the most probable sequence of states. Quality evaluation.
Syntax parsing. Syntax description approaches. Phrase structure grammar: the principles. Formal grammar. Chomsky Normal Form. Cocke-Kasami-Younger algorithm, its complexity. Parsing quality evaluation.

NLP course, ITMO University, Spring 2018

Slides

Written exam questions