Title: | A Family of Beta Mixture Models for Clustering Beta-Valued DNA Methylation Data |
---|---|
Description: | A family of novel beta mixture models (BMMs) has been developed by Majumdar et al. (2022) <doi:10.48550/arXiv.2211.01938> to appositely model the beta-valued cytosine-guanine dinucleotide (CpG) sites, to objectively identify methylation state thresholds and to identify the differentially methylated CpG (DMC) sites using a model-based clustering approach. The family of beta mixture models employs different parameter constraints applicable to different study settings. The EM algorithm is used for parameter estimation, with a novel approximation during the M-step providing tractability and ensuring computational feasibility. |
Authors: | Koyel Majumdar [aut, cre] , Romina Silva [aut], Antoinette Sabrina Perry [aut], Ronald William Watson [aut], Andrea Rau [aut] , Florence Jaffrezic [aut], Thomas Brendan Murphy [aut] , Isobel Claire Gormley [aut] |
Maintainer: | Koyel Majumdar <[email protected]> |
License: | GPL-3 |
Version: | 1.0.4 |
Built: | 2024-11-23 05:50:58 UTC |
Source: | https://github.com/koyelucd/betaclust |
Function to find the level of similarities between the cumulative distributions estimated in each of the
clusters.
AUC_WD_metric(alpha, delta, K, R)
AUC_WD_metric(alpha, delta, K, R)
alpha |
The first shape parameter estimated for the beta mixture model. |
delta |
The second shape parameter estimated for the beta mixture model. |
K |
The number of clusters estimated. |
R |
The number of sample types in the dataset. |
Function to find the level of similarities between the cumulative distributions estimated in each of the
clusters.
The list with AUC and WD values.
Fit the K.. model from the betaclust
family of beta mixture models for DNA methylation data.
The K.. model analyses a single DNA sample type and identifies the thresholds between the different methylation states.
beta_k(data, M = 3, parallel_process = FALSE, seed = NULL)
beta_k(data, M = 3, parallel_process = FALSE, seed = NULL)
data |
A dataframe of dimension |
M |
Number of methylation states to be identified in a DNA sample type. |
parallel_process |
The "TRUE" option results in parallel processing of the models for increased computational efficiency. The default option has been set as "FALSE" due to package testing limitations. |
seed |
Seed to allow for reproducibility (default = NULL). |
The K.. model clusters each of the CpG sites into one of
methylation states, based on data from
patients for one DNA sample type (i.e.
).
As each CpG site can belong to any of the
methylation states (hypomethylated, hemimethylated or hypermethylated), the default value of
.
Under the K.. model the shape parameters of each cluster are constrained to be equal for each patient. The returned object from this function can be passed as an input parameter to the
threshold
function available in this package to calculate the thresholds between the methylation states.
A list containing:
cluster_size - The total number of CpG sites in each of the K clusters.
llk - A vector containing the log-likelihood value at each step of the EM algorithm.
alpha - The first shape parameter for the beta mixture model.
delta - The second shape parameter for the beta mixture model.
tau - The estimated mixing proportion for each cluster.
z - A matrix of dimension containing the posterior probability of each CpG site belonging to each of the
clusters.
classification - The classification corresponding to z, i.e. map(z).
uncertainty - The uncertainty of each CpG site's clustering.
my.seed <- 190 M <- 3 data_output <- beta_k(pca.methylation.data[1:30,2:5], M, parallel_process = FALSE, seed = my.seed) thresholds <- threshold(data_output, pca.methylation.data[1:30,2:5], "K..")
my.seed <- 190 M <- 3 data_output <- beta_k(pca.methylation.data[1:30,2:5], M, parallel_process = FALSE, seed = my.seed) thresholds <- threshold(data_output, pca.methylation.data[1:30,2:5], "K..")
Fit the KN. model from the betaclust
family of beta mixture models for DNA methylation data.
The KN. model analyses a single DNA sample type and identifies the thresholds between the different methylation states.
beta_kn(data, M = 3, parallel_process = FALSE, seed = NULL)
beta_kn(data, M = 3, parallel_process = FALSE, seed = NULL)
data |
A dataframe of dimension |
M |
Number of methylation states to be identified in a DNA sample type. |
parallel_process |
The "TRUE" option results in parallel processing of the models for increased computational efficiency. The default option has been set as "FALSE" due to package testing limitations. |
seed |
Seed to allow for reproducibility (default = NULL). |
The KN. model clusters each of the CpG sites into one of
methylation states, based on data from
patients for one DNA sample type (i.e.
).
As each CpG site can belong to any of the
methylation states (hypomethylated, hemimethylated or hypermethylated), the default value of
.
The KN. model differs from the K.. model as it is less parsimonious, allowing cluster and patient-specific shape parameters. The returned object can be passed as an input parameter to the
threshold
function available in this package to calculate the thresholds between the methylation states.
A list containing:
cluster_size - The total number of CpG sites in each of the K clusters.
llk - A vector containing the log-likelihood value at each step of the EM algorithm.
alpha - The first shape parameter for the beta mixture model.
delta - The second shape parameter for the mixture model.
tau - The estimated mixing proportion for each cluster.
z - A matrix of dimension containing the posterior probability of each CpG site belonging to each of the
clusters.
classification - The classification corresponding to z, i.e. map(z).
uncertainty - The uncertainty of each CpG site's clustering.
my.seed <- 190 M <- 3 data_output <- beta_kn(pca.methylation.data[1:30,2:5], M, parallel_process = FALSE, seed = my.seed) thresholds <- threshold(data_output, pca.methylation.data[1:30,2:5], "KN.")
my.seed <- 190 M <- 3 data_output <- beta_kn(pca.methylation.data[1:30,2:5], M, parallel_process = FALSE, seed = my.seed) thresholds <- threshold(data_output, pca.methylation.data[1:30,2:5], "KN.")
A beta mixture model for identifying differentially methylated CpG sites between DNA sample types collected from
patients.
beta_kr(data, M = 3, N, R, parallel_process = FALSE, seed = NULL)
beta_kr(data, M = 3, N, R, parallel_process = FALSE, seed = NULL)
data |
A dataframe of dimension |
M |
Number of methylation states to be identified. |
N |
Number of patients in the study. |
R |
Number of sample types collected from each patient for study. |
parallel_process |
The "TRUE" option results in parallel processing of the models for increased computational efficiency. The default option has been set as "FALSE" due to package testing limitations. |
seed |
Seed to allow for reproducibility (default = NULL). |
The K.R model allows identification of the differentially methylated CpG sites between the DNA sample types collected from each of
patients.
As each CpG site in a DNA sample can belong to one of
methylation states, there can be
methylation state changes between
DNA sample types.
The shape parameters vary for each DNA sample type but are constrained to be equal for each patient. An initial clustering using k-means is performed to identify
clusters. The resulting clustering solution is provided as
starting values to the Expectation-Maximisation algorithm. A digamma approximation is used to obtain the maximised
parameters in the M-step.
A list containing:
cluster_size - The total number of CpG sites in each of the K clusters.
llk - A vector containing the log-likelihood value at each step of the EM algorithm.
alpha - The first shape parameter for the beta mixture model.
delta - The second shape parameter for the beta mixture model.
tau - The estimated mixing proportion for each cluster.
z - A matrix of dimension containing the posterior probability of each CpG site belonging to each of the
clusters.
classification - The classification corresponding to z, i.e. map(z).
uncertainty - The uncertainty of each CpG site's clustering.
DM - The AUC and WD metric for distribution similarity in each cluster.
my.seed <- 190 M <- 3 N <- 4 R <- 2 data_output = beta_kr(pca.methylation.data[1:30,2:9], M, N, R, parallel_process = FALSE, seed = my.seed)
my.seed <- 190 M <- 3 N <- 4 R <- 2 data_output = beta_kr(pca.methylation.data[1:30,2:9], M, N, R, parallel_process = FALSE, seed = my.seed)
A family of model-based clustering techniques to identify methylation states in beta-valued DNA methylation data.
betaclust( data, M = 3, N, R, model_names = "K..", model_selection = "BIC", parallel_process = FALSE, seed = NULL )
betaclust( data, M = 3, N, R, model_names = "K..", model_selection = "BIC", parallel_process = FALSE, seed = NULL )
data |
A dataframe of dimension |
M |
Number of methylation states to be identified in a DNA sample type. |
N |
Number of patients in the study. |
R |
Number of sample types collected from each patient for the study. |
model_names |
Models to run from the set of models, K.., KN. and K.R, default = K.. . See details. |
model_selection |
Information criterion used for model selection. Options are AIC, BIC or ICL (default = BIC). |
parallel_process |
The "TRUE" option results in parallel processing of the models for increased computational efficiency. The default option has been set as "FALSE" due to package testing limitations. |
seed |
Seed to allow for reproducibility (default = NULL). |
This is a wrapper function which can be used to fit all three models (K.., KN., K.R) within a single function.
The K.. and KN. models are used to analyse a single DNA sample type () and cluster the
CpG sites into the
clusters which represent the different methylation states in a DNA sample type. As each CpG site can belong to any of the
methylation states (hypomethylation, hemimethylation and hypermethylation), the default value for
.
The thresholds between methylation states are objectively inferred from the clustering solution.
The K.R model is used to analyse independent sample types collected from
patients, where each sample contains
CpG sites, and cluster
the dataset into
clusters to identify the differentially methylated CpG (DMC) sites between the
DNA sample types.
The function returns an object of the betaclust
class which contains the following values:
information_criterion - The information criterion used to select the optimal model.
ic_output - The information criterion value calculated for each model.
optimal_model - The model selected as optimal.
function_call - The parameters passed as arguments to the function betaclust
.
K - The number of clusters identified using the beta mixture models.
C - The number of CpG sites analysed using the beta mixture models.
N - The number of patients analysed using the beta mixture models.
R - The number of sample types analysed using the beta mixture models.
optimal_model_results - Information from the optimal model. Specifically,
cluster_size - The total number of CpG sites in each of the K clusters.
llk - A vector containing the log-likelihood value at each step of the EM algorithm.
alpha - This contains the first shape parameter for the beta mixture model.
delta - This contains the second shape parameter for the beta mixture model.
tau - The proportion of CpG sites in each cluster.
z - A matrix of dimension containing the posterior probability of each CpG site belonging to each of the
clusters.
classification - The classification corresponding to z, i.e. map(z).
uncertainty - The uncertainty of each CpG site's clustering.
thresholds - Threshold points calculated under the K.. or the KN. model.
DM - The AUC and WD metric for distribution similarity in each cluster.
Silva, R., Moran, B., Russell, N.M., Fahey, C., Vlajnic, T., Manecksha, R.P., Finn, S.P., Brennan, D.J., Gallagher, W.M., Perry, A.S.: Evaluating liquid biopsies for methylomic profiling of prostate cancer. Epigenetics 15(6-7), 715-727 (2020). doi:10.1080/15592294.2020.1712876.
Majumdar, K., Silva, R., Perry, A.S., Watson, R.W., Murphy, T.B., Gormley, I.C.: betaclust: a family of mixture models for beta valued DNA methylation data. arXiv [stat.ME] (2022). doi:10.48550/ARXIV.2211.01938.
my.seed <- 190 M <- 3 N <- 4 R <- 2 data_output <- betaclust(pca.methylation.data[1:30,2:9], M, N, R, model_names = c("K..","KN.","K.R"), model_selection = "BIC", parallel_process = FALSE, seed = my.seed)
my.seed <- 190 M <- 3 N <- 4 R <- 2 data_output <- betaclust(pca.methylation.data[1:30,2:9], M, N, R, model_names = c("K..","KN.","K.R"), model_selection = "BIC", parallel_process = FALSE, seed = my.seed)
A function to identify the most differentially
methylated clusters from clusters.
DMC_identification( object, data, CpG_site_list, threshold = 0.65, metric = "AUC" )
DMC_identification( object, data, CpG_site_list, threshold = 0.65, metric = "AUC" )
object |
A betaclust object |
data |
A dataframe of dimension |
CpG_site_list |
The IlmnID of all the CpG sites analysed by
|
threshold |
The threshold value/s for selecting the most differentially methylated clusters, default= 0.65 |
metric |
The metric (AUC or WD selected). default="AUC" |
This function selects the most diffentially methylated clusters based on AUC and WD metric calculated and returns the CpG sites belonging to those clusters.
The function returns a dataframe of CpG sites and methylation values identified to belong to the most differentially methylated clusters
my.seed <- 190 M <- 3 N <- 4 R <- 2 data_output <- betaclust(pca.methylation.data[1:30,2:9], M, N, R, model_names = "K.R", parallel_process = FALSE, seed = my.seed) dmc_df <-DMC_identification(data_output,pca.methylation.data[1:30,2:9], pca.methylation.data[1:30,1], threshold = 0.65, metric = "AUC")
my.seed <- 190 M <- 3 N <- 4 R <- 2 data_output <- betaclust(pca.methylation.data[1:30,2:9], M, N, R, model_names = "K.R", parallel_process = FALSE, seed = my.seed) dmc_df <-DMC_identification(data_output,pca.methylation.data[1:30,2:9], pca.methylation.data[1:30,1], threshold = 0.65, metric = "AUC")
An empirical cumulative distribution function (ECDF) plot for a betaclust
object.
ecdf.betaclust(x, R = 2, sample_name = NULL, title = NULL)
ecdf.betaclust(x, R = 2, sample_name = NULL, title = NULL)
x |
A dataframe containing methylation values of identified differentially methylated regions related to a gene. Samples are grouped together in the dataframe such that the columns are ordered as Sample1_Patient1, Sample1_Patient2, Sample2_Patient1, Sample2_Patient2, etc. |
R |
Number of tissue sample types from which DNA methylation data are collected (default R = 2). |
sample_name |
The order in which the sample types are grouped in the dataframe. If no value is specified then default values of sample names, e.g. Sample 1, Sample 2, etc are used (default = NULL). |
title |
The title that the user wants to display on the graph. The default is "NULL". |
This function plots the ECDF of the differentially methylated CpG sites identified using the K.R model for all patient samples. The plot visualises the methylation state changes between the different DNA samples for each patient.
The ECDF plot for the selected CpG sites for all patients and their DNA sample types.
Compute the AIC value for the optimal model.
em_aic(llk, C, M, N, R, model_name = "K..")
em_aic(llk, C, M, N, R, model_name = "K..")
llk |
Log-likelihood value. |
C |
Number of CpG sites. |
M |
Number of methylation states identified in a DNA sample. |
N |
Number of patients. |
R |
Number of DNA sample types collected from each patient. |
model_name |
Fitted mixture model. Options are "K..", "KN." and/or "K.R" (default = "K.."). |
Computes the AIC for a specified model given the log-likelihood, the dimension of the data, and the model names.
The AIC value for the selected model.
Compute the BIC value for the optimal model.
em_bic(llk, C, M, N, R, model_name = "K..")
em_bic(llk, C, M, N, R, model_name = "K..")
llk |
Log-likelihood value. |
C |
Number of CpG sites. |
M |
Number of methylation states identified in a DNA sample. |
N |
Number of patients. |
R |
Number of DNA sample types collected from each patient. |
model_name |
Fitted mixture model. Options are "K..", "KN." and/or "K.R" (default = "K.."). |
Computes the BIC for a specified model given the log-likelihood, the dimension of the data, and the model names.
The BIC value for the selected model.
Compute the ICL value for the optimal model.
em_icl(llk, C, M, N, R, model_name = "K..", z)
em_icl(llk, C, M, N, R, model_name = "K..", z)
llk |
Log-likelihood value. |
C |
Number of CpG sites. |
M |
Number of methylation states identified in a DNA sample. |
N |
Number of patients. |
R |
Number of DNA sample types collected from each patient. |
model_name |
Fitted mixture model. Options are "K..", "KN." and/or "K.R" (default = "K.."). |
z |
A matrix of posterior probabilities of cluster membership, stored as z in the object from |
Computes the ICL for a specified model given the log-likelihood, the dimension of the data, and the model names. This criterion penalises the BIC by including an entropy term favouring well separated clusters.
The ICL value for the selected model.
A dataset containing a subset of the manifest data from the Illumina MethylationEPIC beadchip array. A subset of the complete dataset has been uploaded in the package for testing purpose. The complete dataset is available on GitHub.
data(legacy.data)
data(legacy.data)
A data frame with 5,080 rows and 6 columns.
IlmnID: The unique identifier from the Illumina CG database, i.e. the probe ID.
Genome_Build: The genome build referenced by the Infinium MethylationEPIC manifest.
CHR: The chromosome containing the CpG (Genome_Build = 37).
MAPINFO: The chromosomal coordinates of the CpG sites.
UCSC_RefGene_Name: The target gene name(s), from the UCSC database. Note: multiple listings of the same gene name indicate splice variants.
UCSC_CpG_Islands_Name: The chromosomal coordinates of the CpG Island from UCSC.
A dataset containing pre-processed beta methylation values from sample types, collected from
patients with prostate cancer.
data(pca.methylation.data)
data(pca.methylation.data)
A data frame with 5,067 rows and 9 columns. The data contain no missing values.
IlmnID: The unique identifier from the Illumina CG database, i.e. the probe ID.
Benign_Patient_1: Methylation values from benign prostate tissue from patient 1.
Benign_Patient_2: Methylation values from benign prostate tissue from patient 2.
Benign_Patient_3: Methylation values from benign prostate tissue from patient 3.
Benign_Patient_4: Methylation values from benign prostate tissue from patient 4.
Tumour_Patient_1: Methylation values from tumor prostate tissue from patient 1.
Tumour_Patient_2: Methylation values from tumor prostate tissue from patient 2.
Tumour_Patient_3: Methylation values from tumor prostate tissue from patient 3.
Tumour_Patient_4: Methylation values from tumor prostate tissue from patient 4.
The raw methylation array data was first quality controlled and preprocessed using the RnBeads package. The array data were then normalized and and probes located outside of CpG sites and on the sex chromosome were filtered out. The CpG sites with missing values were removed from the resulting dataset. A subset of the complete dataset has been uploaded in the package for testing purposes. The complete dataset is available on GitHub.
Mueller F, Scherer M, Assenov Y, Lutsik P, Walter J, Lengauer T, Bock C (2019). “RnBeads 2.0: comprehensive analysis of DNA methylation data.” Genome Biology, 20(55).
Assenov Y, Mueller F, Lutsik P, Walter J, Lengauer T, Bock C (2014). “Comprehensive Analysis of DNA Methylation Data with RnBeads.” Nature Methods, 11(11), 1138–1140.
Visualise a betaclust
clustering solution by plotting the fitted and kernel density estimates, the uncertainty and the information criterion.
## S3 method for class 'betaclust' plot( x, what = "fitted density", plot_type = "ggplot", data = NULL, sample_name = NULL, title = NULL, patient_number = 1, threshold = FALSE, scale_param = "free_y", ... )
## S3 method for class 'betaclust' plot( x, what = "fitted density", plot_type = "ggplot", data = NULL, sample_name = NULL, title = NULL, patient_number = 1, threshold = FALSE, scale_param = "free_y", ... )
x |
A |
what |
The different plots that can be obtained are either "fitted density","kernel density","uncertainty" or "information criterion" (default = "fitted density"). |
plot_type |
The plot type to be displayed are either "ggplot" or "plotly" (default = "ggplot"). |
data |
A dataframe of dimension |
sample_name |
The names of DNA sample types in the dataset analysed by the K.R model. If no value is passed then default values of sample names, e.g. Sample 1, Sample 2, etc are used as legend text (default = NULL). |
title |
The title that the user wants to display. If no title is to be displayed the default is "NULL". |
patient_number |
The column number representing the patient in the patient-wise ordered dataset selected for visualizing the clustering solution of the K.. or KN. model (default = 1). |
threshold |
The "TRUE" option displays the threshold points in the graph for the K.. and the KN. model (default = "FALSE"). |
scale_param |
The position scales can be fixed or allowed to vary between different panels generated for the density estimate plots for visualizing the K.R clustering solution. Options are "fixed", "free_y","free_x" or "free" (default = "free_y"). The option "fixed" results in the x and y scales being fixed across all panels, "free" varies the x and y scales across the panels, "free_x" fixes the y scale and lets the x scale vary across all panels and "free_y" fixes the x scale and lets the y scale vary across all panels. |
... |
Other graphics parameters. |
The fitted density estimates can be visualized under the optimal clustering solution by specifying what = "fitted density" and kernel density estimates under the optimal clustering solution by specifying what = "kernel density".
The threshold inferred can be visualized by specifying threshold = TRUE. The KN. model calculates different pairs of threshold points for each patient as the shape parameters are allowed to vary for each patient. So the patient for whom the threshold needs to be displayed can be specified by inputting the column number representing the patient in the patient-wise ordered dataset in the parameter patient_number.
Interactive plots can also be produced using plot_type = "plotly". The uncertainty in the clustering solution can be plotted using what = "uncertainty". The information criterion values for all fitted models can be plotted using what = "information criterion".
This function displays the following plots as requested by the user:
fitted density estimates - Plot showing the fitted density estimates of the clustering solution under the optimal model selected.
kernel density estimates - Plot showing the kernel density estimates of the clustering solution under the optimal model selected.
uncertainty - A boxplot showing the uncertainties in the optimal clustering solution.
information criterion - Plot showing the information criterion values for all models fitted to support the selection of the optimal model.
my.seed <- 190 M <- 3 N <- 4 R <- 2 data_output <- betaclust(pca.methylation.data[1:100,2:9], M, N, R, model_names = c("K..","KN.","K.R"), model_selection = "BIC", parallel_process = FALSE, seed = my.seed) plot(data_output, what = "fitted density", plot_type = "ggplot", data=pca.methylation.data[1:100,2:9]) plot(data_output, what = "uncertainty", plot_type = "ggplot") plot(data_output, what = "information criterion", plot_type = "ggplot")
my.seed <- 190 M <- 3 N <- 4 R <- 2 data_output <- betaclust(pca.methylation.data[1:100,2:9], M, N, R, model_names = c("K..","KN.","K.R"), model_selection = "BIC", parallel_process = FALSE, seed = my.seed) plot(data_output, what = "fitted density", plot_type = "ggplot", data=pca.methylation.data[1:100,2:9]) plot(data_output, what = "uncertainty", plot_type = "ggplot") plot(data_output, what = "information criterion", plot_type = "ggplot")
Summary method for a betaclust
object containing the results under the optimal model selected.
## S3 method for class 'betaclust' summary(object, ...)
## S3 method for class 'betaclust' summary(object, ...)
object |
A |
... |
Further arguments passed to or from other methods. |
An object of class summary.betaclust
which contains the following list:
C - The number of CpG sites analysed using the beta mixture models.
N - The number of patients analysed using the beta mixture models.
R - The number of sample types analysed using the beta mixture models.
K - The number of methylation states in R DNA samples.
modelName - The optimal model selected.
loglik - The log-likelihood value for the selected optimal model.
information_criterion - The information criterion used to select the optimal model.
ic_output - This stores the information criterion value calculated for each model.
classification - The total number of CpG sites in each cluster.
prop_data - The estimated mixing proportion for each cluster.
my.seed <- 190 M <- 3 N <- 4 R <- 2 data_output <- betaclust(pca.methylation.data[1:30,2:9], M, N, R, model_names=c("K..","KN.","K.R"), model_selection="BIC", parallel_process = FALSE, seed=my.seed) summary(data_output)
my.seed <- 190 M <- 3 N <- 4 R <- 2 data_output <- betaclust(pca.methylation.data[1:30,2:9], M, N, R, model_names=c("K..","KN.","K.R"), model_selection="BIC", parallel_process = FALSE, seed=my.seed) summary(data_output)
An objective method to calculate the threshold points for the clustering solution of the K.. and the KN. models.
threshold(object, data, model_name)
threshold(object, data, model_name)
object |
|
data |
A dataframe of dimension |
model_name |
The name of the model for which the thresholds need to be calculated. |
As the K.. model constrains the shape parameters to be equal for all patients, a single pair of threshold points are calculated for all patients. The KN. model allows patient-specific shape parameters which results in a pair of threshold points for each patient based on the shape parameters for that patient. The first threshold point means that any CpG site with beta value less than this value is likely to be hypomethylated. The second threshold point means that any CpG site with beta value greater than this is highly likely to be hypermethylated. A CpG site with beta value lying between the two threshold points is likely to be hemimethylated.
thresholds - The threshold points calculated for the selected model. A vector containing two threshold points are returned for the K.. model whereas a matrix containing two threshold points for each patient is returned for the KN. model.