idiffomix: integrated differential analysis of multi-omics data using a joint mixture model

Installation

To install this package, start R (version “4.3”) and enter:

library(plotly)
library(ggplot2)
library(foreach)
library(doParallel)

 # library(devtools)
 # install_github('koyelucd/idiffomix',force = TRUE)
 # library(idiffomix)

Introduction

Gene expression and DNA methylation are two interconnected biological processes and understanding their relationship is important in advancing understanding in diverse areas, including disease pathogenesis, environmental adaptation, developmental biology, and therapeutic responses. Differential analysis, including the identification of differentially methylated cytosine-guanine dinucleotide (CpG) sites (DMCs) and differentially expressed genes (DEGs) between two conditions, such as healthy and affected samples, can aid understanding of biological processes and disease progression. Typically, gene expression and DNA methylation data are analysed independently to identify DMCs and DEGs which are further analysed to explore relationships between them, or methylation data are aggregated to gene level for analysis. Such approaches ignore the inherent dependencies and biological structure within these related data.

The package idiffomix contains a joint mixture model is proposed that integrates information from the two data types at the modelling stage to capture their inherent dependency structure, enabling simultaneous identification of DMCs and DEGs. The model leverages a joint likelihood function that accounts for the nested structure in the data, with parameter estimation performed using an expectation-maximisation algorithm.

This document gives a quick tour of the functionalities in idiffomix. See help(package="idiffomix") for further details and references provided by citation("idiffomix").

Walk through

Prerequisites

Before starting the idiffomix walk through, the user should have a working R software environment installed on their machine. The idiffomix package has the following dependencies which, if not already installed on the machine will automatically be installed along with the package: foreach, doParallel, parallel, mclust, stats, utils, edgeR, magrittr, ggplot2, scales, tidyr, dplyr, reshape2, gridExtra, grid, tidyselect, cowplot.

Assuming that the user has the idiffomix package installed, the user first needs to load the package:

library(idiffomix)

Loading the data

The idiffomix package provides simulated gene expression data, methylation data and gene chromosome data. The data are simulated based on gene expression and methylation data of matched healthy and breast invasive carcinoma tumour samples from The Cancer Genome Atlas (TCGA) that are are publicly available in the Genomic Data Commons (GDC) data portal. The complete data are also available in the GitHub. Due to computational complexities of the package which needs parallel processing and package testing limitations, simulated data are used for this walk-through. The simulated data are assumed to be representing gene expression and methylation data from two biological conditions (benign and tumour) collected from N=4 patients. The genes and CpG sites are assumed to be located on 2 chromosomes (chromosome 1 and 2). Each gene has a number of CpG sites associated with it.

data(gene_expression_data)
data(methylation_data)
data(gene_chromosome_data)

Transforming the data

The RNA-Seq data consisted of raw counts depicting the gene expression levels. To ensure data quality, only genes whose sum of expression counts across both biological conditions  > 5 were retained. The data were normalized to account for differences in library sizes. The normalized count data were used to obtain CPM values which were further log-transformed to obtain log-CPM values. Given the paired design of the motivating setting, the log-fold changes between the tumour and benign samples were calculated for each gene in every patient and used in the subsequent analyses. For the methylation array data, the values at the CpG sites were logit transformed to M-values. Similar to the RNA-Seq data, given the paired design, the difference in M-values between tumour and benign samples were calculated for each CpG site in every patient and used in the subsequent analyses. The function data_transformation() filters and transforms the data and returns a list with two dataframes. The transformed gene expression data and the transformed methylation data are returned as dataframes.

N <- 4
data_transformed = data_transformation(seq_data=gene_expression_data,
                                       meth_data=methylation_data,
                                       gene_chr=gene_chromosome_data,
                                       N=N)

Applying the joint mixture model to identify DMCs and DEGs

A joint mixture model approach, termed idiffomix*, is proposed for the integrated differential analysis of gene expression data and methylation array data that accounts for their nested dependency structure. A key dependency matrix parameter is used in the joint mixture model to allow the methylation state of a CpG site to depend on the expression level status of the gene to which it is associated. The model parameters are estimated via an EM algorithm. In the E-step, as the expected values of the latent variables are intractable, approximate but tractable estimates are employed. Due to independence of chromosomes and to ease the computational burden, the model is fitted to each chromosome independently. When applying to real dataset the parallel_process argument should be set to TRUE for efficient execution. Gene expression levels are modeled to undergo three state changes between benign and tumor conditions: downregulated (E-, negative log-fold change), upregulated (E+, positive log-fold change), or non-differentially expressed (E0, log-fold change near zero). Similarly, methylation levels at CpG sites are classified as hypomethylated (M-, negative M-value difference), hypermethylated (M+, positive M-value difference), or non-differentially methylated (M0, M-value difference near zero). Under idiffomix, these changes are modeled using and component mixture models for gene expression and methylation, respectively.
The model parameters are: The function idiffomix() takes the transformed data as input and applies the joint mixture model to estimate the model parameters and the DMCs and DEGs.

N <- 4
data_output = idiffomix(seq_data=data_transformed$seq_transformed,
                        meth_data=data_transformed$meth_transformed,
                        gene_chr=gene_chromosome_data,
                        N=N, K=3, L=3, probs=c(0.25,0.75),
                        parallel_process = FALSE)

print(data_output$tau[[1]])
#> Cluster1 Cluster2 Cluster3 
#>      0.4      0.2      0.4
head(data_output$seq_classification)
#>            Gene  Patient1   Patient2   Patient3   Patient4 Final_Cluster
#> gene_1   gene_1  2.340567  1.3592557  0.2499695  2.0096230             3
#> gene_10 gene_10 -1.516838 -1.7266992 -2.0006156 -0.9142228             1
#> gene_2   gene_2 -1.810118 -2.4486927 -3.7578701 -3.6756716             1
#> gene_3   gene_3 -2.931101 -2.0544608 -0.6025684 -1.5649753             1
#> gene_4   gene_4  2.608504  1.9117714  0.7904971  3.3069637             3
#> gene_5   gene_5 -1.948577 -0.4773091 -2.1640730 -1.0853623             1

Plotting the conditional probabilities estimated for each chromosome

The conditional probabilities estimated when the joint mixture model is applied to data from a chromosome can be visualized. Panel A displays the probability of a gene in the chromosome belonging to each of the K clusters. Panel B details the estimated matrix pi of conditional probabilities of a CpG site’s cluster membership, given its gene’s cluster. Panel C details the conditional probabilities of a gene belonging to cluster k given a single CpG site associated with the gene belongs to cluster l, computed using Bayes’ theorem, given tau and pi.

plot(data_output,what="chromosome",CHR=1, Gene=NULL,K=3,L=3,
     gene_cluster_name=c( "E-","E0","E+"),
     cpg_cluster_name=c( "M-","M0","M+"))

#> TableGrob (2 x 1) "arrange": 2 grobs
#>   z     cells    name              grob
#> 1 1 (1-1,1-1) arrange   gtable[arrange]
#> 2 2 (2-2,1-1) arrange gtable[guide-box]

Plotting the estimated posterior probability of the gene belonging to the K clusters

The log-fold changes and differences in M-values and the estimated posterior probability of the gene belonging to the K clusters can be visualized. Panel A shows the log-fold change and difference in M-values for the given gene and its associated CpG sites while Panel B shows the posterior probabilities of cluster membership for the gene under idiffomix.

plot(data_output,what="gene",CHR=1, Gene="gene_1",K=3,L=3,
     gene_cluster_name=c( "E-","E0","E+"),
     cpg_cluster_name=c( "M-","M0","M+"))

#> TableGrob (2 x 1) "arrange": 2 grobs
#>   z     cells    name            grob
#> 1 1 (1-1,1-1) arrange gtable[arrange]
#> 2 2 (2-2,1-1) arrange gtable[arrange]