B024276 - DATA AND DOCUMENT MINING

Versione Italiana

Main information

Teaching Language

Course Content

Academic Year 2019-20

Coorte 2019 - Second Cycle Degree in Computer Engineering

Course year

First year - First Semester

Belonging Department

Information Engineering (DINFO)

Course Type

Single education field course

Scientific Area

ING-INF/05 - INFORMATION PROCESSING SYSTEMS

Credits

Teaching Hours

Teaching Term

23/09/2019 ⇒ 20/12/2019

Attendance required

Type of Evaluation

Final Grade

Course Content

show

Course program

show

Lectureship

MARINAI SIMONE

Mutuality

Course teached as:
B024275 - DATA AND DOCUMENT MINING
Second Cycle Degree in COMPUTER ENGINEERING

Teaching Language

Teaching in class is in Italian. Homework, projects and exams can be in English. All the textooks are in English

Course Content

Data Mining
Document Engineering
Document Image Analysis
Information Retrieval

Learning Objectives

The course first aims at introducing the main Data Mining techniques that allow you to model large amounts of data and extract useful information.
Secondly, we consider the problems arising when extracting information and indexing both textual and non-textual documents. To this purpose we introduce the main models and algorithms in Information Retrieval and describe the techniques for information extraction from digital born and digitized documents that are represented in the form of images.

Prerequisites

It is essential to know topics typically though in the Data Bases and Algorithms and Data Structures classes. Some knowledge of Artificial Intelligence can be useful.

Teaching Methods

Classes, homework and project.

Further information

Oral exams are usually made after completion of the assigned project.

Type of Assessment

Study and presentation of one research paper to the class (15%). Group project (2 people, 65%). Oral on a sub-set of the topics (20%).
Alternatively it is possible to have an oral on all the topics and a smaller project.

Course program

Data Mining
Datawarehouse. Hardware. Disk Organization. Access times
Distributed file system and the new software stack
Map Reduce, Word count, Matrix-Vector and Matrix Multiplication with Map Reduce
The market-basket model . Association rules. Implementation details. Algorithms for computing frequent item-sets and Association Rules.
Improving Apriori: Hash-based filtering. Bloom filters. PCY algorithm, Random sampling, SON algorithm, Apriori with MapReduce-
Finding similar items. Curse of dimensionality. Distance measures.
Document similarity, shingling, min-hashing
Locality sensitive hashing (LSH)
Families of hash functions. LSH for cosine distance. LSH for Euclidean distance.
Curse of dimensionality. Distance measures.
Clustering, Hierarchical clustering, k-means clustering. SOM clustering
BFR algorithm, CURE algorithm. Dimensionality reduction

Document Image Analysis and Recognition
DIAR: preprocessing
Object segmentation
Layout analysis : RLSA, Docstrum, Area Voronoi diagram, XY tree, MXY tree, Reading order detection, classification in layout analysis, page classification/retrieval.
Layout analysis : XY tree, MXY tree, Reading order detection, classification in layout analysis, page classification/retrieval. OCR.
Artificial Neural Networks. Perceptron, Backpropagation
Convolutional neural networks
Document Image Retrieval

Information Retrieval
Introduction to Information Retrieval. Boolean Retrieval
Term vocabulary and postings lists, Inverted files
Vector Space Model
Tokenization, stop-word removal, stemming
Index construction
Index compression
Processing boolean queries
Computing Scores in complete search system - Efficient scoring and ranking, Components of an information retrieval system. Vector space scoring and query operator interaction.
Phrase queries
Wildcard queries.
Orthographic correction.
Performance Evaluation in IR systems
Web mining

B024276 - DATA AND DOCUMENT MINING

Academic Year 2019-20

Teaching Language

Course Content

Suggested readings (Search our library's catalogue)

Learning Objectives

Prerequisites

Teaching Methods

Further information

Type of Assessment

Course program