Abstract
In this notebook we will explore how to generate and analyse a multi-omic network comprising metabolites quantifications and gene expression. We will compare these networks against randomly generated networks, and compute different network metrics. At the end we will also perform a community analysis and functional characterization at the gene level.
Contents
## Preamble
import sklearn, leidenalg, itertools, random
import pandas as pd
import numpy as np
import igraph as ig
import sklearn.neighbors
import scipy as sp
from statsmodels.stats.multitest import multipletests
import gseapy as gp
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Objective
In this notebook you will learn how to build and analyse a network built and analysed from a gene-metabolite association analysis. Other mixed networks may also be similarly analyzed, differring only in whether and how you can apply the final functional analysis.
Data
As a test case we will be using the file met_genes.tsv which contains abundances for 125 metabolites and 1992 genes, for 24 samples.
Software
This notebook relies on python's igraph for most of the analyses. Most, if not all, functions used here can also be applied with R's igraph. Other packages exist for network analysis including networkx and graph-tool. Snap.py is also a good alternative for large networks.
We will build our network through an association analysis, but there are other methods to do this including Graphical Lasso or linear SVR.
You will use a gene expression dataset (RNAseq, expressed as TPM) from a disease group with 24 samples to keep analysis memory and time requirements reasonable for this lesson. It is assumed that all batch effects or other possible technical artifacts are not present, and that all data is ready for analysis. However, there are several important considerations in preprocessing your data:
This will depend on the type of that that you have and what you want to do with it, and will severely affect downstream results. It is thus important to carefully think about this.
data=pd.read_csv('data/met_genes.tsv', sep="\t", index_col=0)
data.head()
Type | p10 | p11 | p12 | p14 | p15 | p16 | p18 | p22 | p23 | ... | p37 | p38 | p4 | p40 | p41 | p46 | p48 | p5 | p8 | p9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
feature | |||||||||||||||||||||
C0_accarnitines | met | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.103152 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.049757 |
C2_accarnitines | met | 0.007356 | 0.038454 | 0.011738 | 0.011923 | 0.015104 | 0.012317 | 0.021905 | 0.017777 | 0.039760 | ... | 0.000000 | 0.012542 | 0.018896 | 0.020212 | 0.015263 | 0.012510 | 0.009677 | 0.015757 | 0.010706 | 0.018615 |
C3_accarnitines | met | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000888 | 0.001116 | 0.001682 | 0.001024 | 0.002179 | ... | 0.000000 | 0.000827 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000995 |
C3DC_C4OH_accarnitines | met | 0.000000 | 0.000000 | 0.000455 | 0.001163 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.001065 | ... | 0.000000 | 0.000480 | 0.000000 | 0.000717 | 0.000000 | 0.000782 | 0.000398 | 0.000000 | 0.000000 | 0.000312 |
C5DC_C6OH_accarnitines | met | 0.001035 | 0.001479 | 0.000430 | 0.001527 | 0.000485 | 0.000792 | 0.000954 | 0.000995 | 0.001356 | ... | 0.005792 | 0.000400 | 0.000987 | 0.001255 | 0.000885 | 0.001710 | 0.000597 | 0.001218 | 0.000848 | 0.000507 |
5 rows × 25 columns
No duplicated features are present.
any(data.index.duplicated())
False
data.groupby('Type').agg('count')[['p10']] #1992 genes, 125 metabolites
p10 | |
---|---|
Type | |
genes | 1992 |
met | 125 |
data.shape #2117 features, 25 samples
(2117, 25)
A very quick view shows that several gene clusters are found, including two major groups. However, the analysis below does not perform any statistical filtering.
Our initial network analysis will be performed on the association analysis using Spearman correlations. Because this network has a big chance of producing false positives we will consider Bonferroni correction to control for familywise error, as well as FDR.
A very quick view shows that several gene clusters are found, including two major groups. However, the analysis below does not perform any statistical filtering.
values=data.loc[:,data.columns!='Type']
meta=data[['Type']]
values.isna().any().any() #we have no rows with NA
False
We will perform a gene-gene, gene-metabolite, and metabolite-metabolite association analysis by computing pairwise Spearman correlations. Choosing other non-parametric (Kendall or Spearman) vs parametric (Pearson) methods depends on your data. Here, because we have a small sample size, and want to save on computational time we choose Spearman.
The following takes a few minutes to run.
#Correlation and P val matrices
Rmatrix, Pmatrix= sp.stats.spearmanr(values.T)
Rmatrix=pd.DataFrame(Rmatrix, index=values.index.copy(), columns=values.index.copy())
#resulting R matrix
sns.clustermap(Rmatrix, cmap="RdBu_r", center=0);
/Users/rui.benfeitas/opt/miniconda3/envs/mofa2_nets/lib/python3.6/site-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance. warnings.warn(msg)