Exam questions
- Zipf's Law, its importance for NLP. Language processing in information retrieval: lemmatization,
stemming, Boolean search, inverted indices, execution of Boolean queries on them, skip-lists.
- Language processing in information retrieval: vector space model, cosine distance, TF-IDF.
Common ways of representing texts for machine learning tasks.
- Neural Networks: core principles, backpropagation, common optimizers, regularization techniques.
- String distances and the algorithms for their computation: the Hamming distance, the Jaro-Winkler distance,
the Levenshtein distance, the longest common subsequence, the Jaccard distance for character N-grams.
Indices for typos detection/correction in words.
- Markov chains. Ergodic theorem. PageRank and Markov chains. Direct applications in the text analysis.
- Elements of information theory: self-information, bit, pointwise mutual information, Kullback-Leibler divergence,
Shannon entropy, its interpretations. Cross-entropy. Example of an application: collocations extraction.
- Language modeling. N-gram models. Perplexity. The reasons for doing smoothing. Additive (Laplace) smoothing.
Interpolation and backoff. The ideas on which the Kneser-Ney smoothing is based.
- Language modeling. Probabilistic Neural Language Model (2003). AWD-LSTM (2017). Perplexity.
- Vector semantics: term-document matrices, term-context matrices, HAL. SVD, LSA, NMF. Methods for quality evaluation
of vector semantics models.
- Vector semantics: what is word2vec (the core principles of the SGNS algorithm and its relationship with
matrix factorization),
word2vec as a neural network. Methods for quality evaluation of vector semantics models.
- Clustering: types of clustering algorithms. KMeans, agglomerative and divisive clustering
(+ ways of estimating the distances between clusters), DBSCAN.
Limitations and areas of applicability of all algorithms.
Methods clustering quality evaluation, the shortcomings of each.
- Duplicates search: statement of the problem, description of the MinHash algorithm. Probability of hashes matching
is equal to Jaccard similarity (with proof).
- Topic modeling. LSA, pLSA, LDA, ARTM. Advantages and disadvantages of each method.
Topic modeling quality evaluation (perplexity, coherence and methods with experts involved).
- Topic modeling + neural TM. pLSA, NTM, ABAE. Advantages and disadvantages of each method.
Topic modeling quality evaluation (perplexity, coherence and methods with experts involved).
- Sequence tagging. PoS tagging. Named entity recognition. Hidden Markov models.
Estimation of the probability of a sequence of states.
Estimation of the probability of a sequence of observations.
Quality evaluation.
- Sequence tagging. PoS tagging. Named entity recognition. Hidden Markov models.
Decoding of the most probable sequence of states (Veterbi algorithm without proof).
Quality evaluation.
- Sequence tagging. PoS tagging. Named entity recognition. Structured perceptron.
Structured perceptron training. Sequente tagging quality evaluation.
- Neural sequence tagging. Simple RNN aproach, bidirectional RNNs, biLSTM-CRF.
- Syntax parsing. Syntax description approaches.
Phrase structure grammar: the principles. Formal grammar.
Chomsky Normal Form. Cocke-Kasami-Younger algorithm, its complexity. Parsing quality evaluation.
- Syntax parsing. Syntax description approaches. Phrase structure grammar: the principles.
Probabilistic context-free grammar. Cocke-Kasami-Younger algorithm for PCFG (without proof), its complexity.
Parsing quality evaluation.
-
Syntax parsing. Syntax description approaches. Dependency grammar, core principles.
Parsing quality evaluation. Transition-based dependency parsing: how it works.
The algorithm (everything but the 'oracle').
- Encoding-decoding approach in NLP. OOV tokens processing. 'Transformer' architecture.
- Transfer learning. ELMo. BERT.