Human CpG island Cluster analysis:

We develop MR-CpGCluster method to speed up CGI predive procedure based on MapReduce and Hadoop Streaming.The main steps of the MR-CpGCluster framework are as follows:

1. Extraction and preprocessing of genomic data. Sequence data were extracted from the genetic genome database (Step 1.1), classified by chromosome (Step 1.2), and preprocessed. The sequence data is then stored in HDFS. Hadoop divides the data into blocks based on the block size (here we set it as 256MB) and stores the data in a distributed mode (step 1.3).

2. Use MapReduce to compute distributed CpGcluster results. Like the normal MapReduce architecture, MapReduce manages the mapping of file blocks and then starts the Master node to manage Mapper and Reducer tasks. First, MapReduce sends Map tasks to different Mapper nodes and splits them according to chromosome numbers (Step 2.1). Second, the Mappers node receives split data from the Master, starts the Hadoop Streaming process and passes the data to the Streaming as standard input (Step 2.1.1); Streaming runs the CpGcluster algorithm in any language and outputs the local results as key-value pairs using the standard method; To Mapper node (Step 2.1.2); The master node triggers Reducers to process the chromosome clustering results of single nodes generated by Mapper nodes (Step 2.2), and merge the clustering results of complete human genome (Step 2.3).

3. CGI statistics and analysis. All data of human genome were obtained from MapReduce (Step 3.1), and CpGcluster results were analyzed (Step 3.2) for subsequent further CGIs analysis (Step 4). The source code of MR-CpGCluster can be download from here.