Written exam questions
	
-  Zipf's Law, its importance for NLP. Language processing in information retrieval: lemmatization, 
	stemming, Boolean search, inverted indices, execution of Boolean queries on them, skip-lists.
 -  Language processing in information retrieval: vector space model, cosine distance, TF-IDF.
	Common ways of representing texts for machine learning tasks.
 -   String distances and the algorithms for their computation: the Hamming distance, 
the Jaro-Winkler distance, 
	the Levenshtein distance, the longest common subsequence, the Jaccard distance for character N-grams. 
	Indices for typos detection/correction in words.
 -   Edit distances (definitions only). Regular expressions: basic constructions, recommendations for use.
 -   Markov chains. Ergodic theorem. PageRank and Markov chains. Direct applications in the text analysis.
 -   Elements of information theory: self-information, bit, pointwise mutual information, Kullback-Leibler divergence,
	Shannon entropy, its interpretations. Cross-entropy. Example of an application: collocations extraction.
 -   Language modeling. N-gram models. Perplexity. The reasons for doing smoothing. Additive (Laplace) smoothing. 
	Interpolation and backoff. The ideas on which the Kneser-Ney smoothing is based.
 -  Vector semantics: term-document matrices, term-context matrices, HAL. SVD, LSA, NMF. Methods for quality evaluation 
	of vector semantics models.
 -   Vector semantics: what is word2vec (the core principles of the SGNS algorithm and its relationship with 
	matrix factorization), 
		word2vec as a neural network. Methods for quality evaluation of vector semantics models.
 -   Clustering: types of clustering algorithms. KMeans, agglomerative and divisive clustering 
		(+ ways of estimating the distances between clusters), DBSCAN. 
		Limitations and areas of applicability of all algorithms. 
		Methods clustering quality evaluation, the shortcomings of each.
 -   Duplicates search: statement of the problem, description of the MinHash algorithm. Probability of hashes matching 
	is equal to Jaccard similarity (with proof). 
 -   Topic modeling. LSA, pLSA, LDA, ARTM. Advantages and disadvantages of each method. 
		Topic modeling quality evaluation (perplexity, coherence and methods with experts involved).
 -   Classification. Binary classification quality evaluation. 
		Metric classification methods. Logical methods of classification. Linear classification methods.
 -   Quality evaluation of machine learning models (why divide the data set into three parts). 
		Classification. Multi-class classification quality evaluation. 
		Naive Bayes Classifier. Ensembles of models of machine learning.
 -   Sequence tagging. PoS tagging. Named entity recognition. Hidden Markov models. 
		Estimation of the probability of a sequence of states. 
		Estimation of the probability of a sequence of observations. 
		Quality evaluation.
 -   Sequence tagging. PoS tagging. Named entity recognition. Hidden Markov models. 
		Decoding of the most probable sequence of states (Veterbi algorithm without proof). Quality evaluation.
 -   Sequence tagging. PoS tagging. Named entity recognition. Structured perceptron. 
		Structured perceptron training. Sequente tagging quality evaluation.
 -   Syntax parsing. Syntax description approaches. 
		Phrase structure grammar: the principles. Formal grammar. 
		Chomsky Normal Form. Cocke-Kasami-Younger algorithm, its complexity. Parsing quality evaluation.
 -   Syntax parsing. Syntax description approaches. Phrase structure grammar: the principles. 
		Probabilistic context-free grammar. Cocke-Kasami-Younger algorithm for PCFG (without proof), its complexity. 
		Parsing quality evaluation.
 -   Syntax parsing. Syntax description approaches. Dependency grammar, core principles. 
		Parsing quality evaluation. Transition-based dependency parsing: how it works. 
		The algorithm (everything but the 'oracle').