NLP course at MCS Department of SPbU (2023)

Exam questions

Zipf's Law, its importance for NLP. Language processing in information retrieval: lemmatization, stemming, Boolean search, inverted indices, execution of Boolean queries on them, skip-lists.
Language processing in information retrieval: vector space model, cosine distance, TF-IDF. Common ways of representing texts for machine learning tasks.
Neural Networks: core principles, backpropagation, common optimizers, regularization techniques.
String distances and the algorithms for their computation: the Hamming distance, the Jaro-Winkler distance, the Levenshtein distance, the longest common subsequence, the Jaccard distance for character N-grams. Indices for typos detection/correction in words.
Markov chains. Ergodic theorem. PageRank and Markov chains. Direct applications in the text analysis.
Elements of information theory: self-information, bit, pointwise mutual information, Kullback-Leibler divergence, Shannon entropy, its interpretations. Cross-entropy. Example of an application: collocations extraction.
Language modeling. N-gram models. Perplexity. The reasons for doing smoothing. Additive (Laplace) smoothing. Interpolation and backoff. The ideas on which the Kneser-Ney smoothing is based.
Language modeling. Probabilistic Neural Language Model (2003). AWD-LSTM (2017). Perplexity.
Vector semantics: term-document matrices, term-context matrices, HAL. SVD, LSA, NMF. Methods for quality evaluation of vector semantics models.
Vector semantics: what is word2vec (the core principles of the SGNS algorithm and its relationship with matrix factorization), word2vec as a neural network. Methods for quality evaluation of vector semantics models.
Clustering: types of clustering algorithms. KMeans, agglomerative and divisive clustering (+ ways of estimating the distances between clusters), DBSCAN. Limitations and areas of applicability of all algorithms. Methods clustering quality evaluation, the shortcomings of each.
Duplicates search: statement of the problem, description of the MinHash algorithm. Probability of hashes matching is equal to Jaccard similarity (with proof).
Topic modeling. LSA, pLSA, LDA, ARTM. Advantages and disadvantages of each method. Topic modeling quality evaluation (perplexity, coherence and methods with experts involved).
Topic modeling + neural TM. pLSA, NTM, ABAE. Advantages and disadvantages of each method. Topic modeling quality evaluation (perplexity, coherence and methods with experts involved).
Sequence tagging. PoS tagging. Named entity recognition. Hidden Markov models. Estimation of the probability of a sequence of states. Estimation of the probability of a sequence of observations. Quality evaluation.
Sequence tagging. PoS tagging. Named entity recognition. Hidden Markov models. Decoding of the most probable sequence of states (Veterbi algorithm without proof). Quality evaluation.
Sequence tagging. PoS tagging. Named entity recognition. Structured perceptron. Structured perceptron training. Sequente tagging quality evaluation.
Neural sequence tagging. Simple RNN aproach, bidirectional RNNs, biLSTM-CRF.
Syntax parsing. Syntax description approaches. Phrase structure grammar: the principles. Formal grammar. Chomsky Normal Form. Cocke-Kasami-Younger algorithm, its complexity. Parsing quality evaluation.
Syntax parsing. Syntax description approaches. Phrase structure grammar: the principles. Probabilistic context-free grammar. Cocke-Kasami-Younger algorithm for PCFG (without proof), its complexity. Parsing quality evaluation.
Syntax parsing. Syntax description approaches. Dependency grammar, core principles. Parsing quality evaluation. Transition-based dependency parsing: how it works. The algorithm (everything but the 'oracle').
Encoding-decoding approach in NLP. OOV tokens processing. 'Transformer' architecture.
Transfer learning. ELMo. BERT.