# Statistical methods in network science and data analysis

**Level:** Doctoral**Course Status: **Elective

## Course description

The increasing volume and nature of big datasets in business, economics, social and political sciences call for more complex and sophisticated data mining tools. The complex systems monitored by big databases are successfully described in terms of networks. In the present course we will present and discuss data mining tools used to characterize large empirical or model networks. Large data sets will be computationally investigated and the limits of the used algorithms will be discussed. The assessment of the statistical validity of the observed results will be analyzed and, when possible, quantitatively evaluated.

## Course schedule

Lecture Session title.

### 1 Introduction

Overview of the course. Basic concepts that will be treated in the course. Large data sets. Data mining approaches. Network science approach to large set of data. Computational tools.

### 2 Background on networks

Networks. Basic indicators of a network. Degree, betweennes, correlation, diameter. Random networks. Scale free networks. Small world. Core periphery networks.

### 3 Background on probability and statistics

Basic concepts of probability. Basic concepts of statistical inference. Methods of statistical inference.

### 4 Data mining basic concepts

Types of data. Different types of attributes. Asymmetric attributes. Dimensionality, sparsity and resolution. Data pre-processing. Measures of similarity and dissimilarity.

### 5 Network analysis with numerical tools

Basic introduction to R and RStudio. Assignment. Variables (objects). Vectors, matrices, arrays. List and data frames. Subsetting, indexing. Branching conditions and loops. Functions. Graphics. Packages. RStudio environment.

### 6 Classification in data mining

Classification and regression for predictive analysis. Classification model. Target function. Training set and test set. Decision trees. Sensitivity and specificity. Model overfitting.

### 7 Principal Component Analysis

Properties of the PCA. Selection of the most relevant principal components. PCA with covariance and correlation matrices. Selection of the principal components.

### 8 Random matrix theory and data mining

The concept of random matrix. Density function of eigenvalues. The Semi-circle law. Statistical uncertainty of a correlation matrix. The Marcenko-Pastur probability density function.

### 9 Cluster analysis

Similarity measures. Agglomerative and divisive clustering techniques. k-means clustering algorithm. Single linkage, average linkage and Complete linkage. Comparison of outputs of different methods.

### 10 Proximity based networks

Similarity based networks. Correlation networks. Partial correlation networks. Planar maximally graph networks.

### 11 Sampling and estimating in networks

Basic elements of statistical sampling theory. Induced and incident subgraph sampling. Star and snowball sampling. Estimation of totals network graphs

### 12 Estimating the degree distribution in sampled graphs

Mapping to the problems of species detection or population detection. The case of random and scale free graphs. The case of Internet. Estimating in subgraphs sampled by induced and incident subgraph sampling. Star and snowball sampling. Estimation of original network graphs.

### 13 Midterm test

### 14 Information networks

The World Wide Web. Information networks and associative memory. The Bow-Tie structure of the web.

### 15 Link analysis and web search

Searching the web. The ranking of web pages. Link analysis using hubs and authorities. PageRank. Link analysis in modern web search. Applications beyond the web.

### 16 Link prediction

Problem description and evaluation metrics. Local similarity indices. Global similarity indices. Applications of the problem of link prediction. Classification of partially labeled nodes.

### 17 Estimation of the weights of a network

Networks with information about marginal distributions of link weights. Maximum entropy estimation. Minimum density estimation.

### 18 Multiple hypothesis test correction and information filtering procedures

Total information awareness. Multiple hypothesis test corrections. The Bonferroni correction. The False Discovery Rate corrections. False positive, False Negative, True positive and True negative outcomes.

### 19 Association networks and statistically validated networks

Networks based on statistical tests. Statistically validated networks. Precision and accuracy of a statistically validated network. Statistically validated network in bipartite networks. Statistically validated networks in weighted networks.

### 20 Statistical aspects of community detection algorithms

Overview of community detection algorithms. Stochastic and deterministic algorithms. Indicators for the comparison of different partitions. Landscape of the fitness of a partition.

### 21 Motifs definition and detection. Comparison with a null hypothesis

Motifs in a directed graph. Basic motifs. Algorithms to find motifs. Optimality of the triangle-finding algorithm. Shuffling ,methods in directed networks.

### 22 Time dependent network models

Network evolution models. Nodal attribute models. Exponential random graph models. Stylized models of social network formation. Hysteresis in exponential random graphs and in other time dependent networks.

### 23 Network flow analysis: an example

Internet traffic analysis: a case study on two large backbone networks. Diagnosing volume anomalies. Traffic matrix estimation. Flow of funds data in economics. Input-output analysis and estimation. Transfer of fluctuations.

### 24 Projects' presentation

## Suggested reading

Eric D. Kolaczyk, Statistical Analysis of Network Data. Springer, 2009.

Anad Rajaraman, Jure Leskovec and Jeffrey D. Ullman, Mining of Massive Datasets, Book available online.

David Easley and Jon Kleinberg, Networks crowds and markets, Cambridge University Press 2010.

Jiawhei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques. Third edition (2012).

## Course E-Learning site

To be assigned

Office hours: Upon agreement

By successfully completing the course the students will be able to:

- Learn how to perform data analysis and use statistical methods in the investigation of networks.

- Evaluate the statistical reliability of empirical estimations against an appropriate null hypothesis.

- Learn how to make use of large sets of data for investigating networks observed in the fields of social sciences.

- Perform empirical analyses and statistical validation of large datasets obtained from the Internet or from other business and scientific sources.

(1) Assessment type 1 (30 % of the final grade). Students will get *home work *consisting of statistical analyses, simple problems or data processing, which they will have to complete individually and submit electronically.

(2) Assessment type 2 (30% of the final grade). The midterm test will be after week 10, consisting of questions related to the course.

(3) Assessment type 3 (40% of the final grade). In the final *project work* students will have to perform a research project involving the investigation of a large network individually. The investigation and characterization of the network might include one or more of the following aspects: (i) developing of a network model, (ii) running network simulations, (iii) comparing network metrics with null hypotheses, and (iii) critically analyzing results of a large scale empirical investigations. Students will have to prepare a project presentation and a written report.

Knowledge of a basic programming language.