After reading the introductory article
portah on
bioinformatics , in particular Chip-Seq and RNA-Seq technologies, I really liked the idea of replenishing Russian-language articles on bioinformatics as much as possible, and especially on its “practical” component. Therefore, I offer this brief overview of the pipeline for the analysis of methyl by the technology of
Illumina 450K Human Methylation .
During the life of an organism, its nucleotide sequence in general remains unchanged (for more information on genes, genome and DNA, see, for example,
this article ). Nevertheless, there are processes that allow influencing the genome, its work and even being inherited. These processes are called epigenetic changes.
One of the main epigenetic mechanisms is DNA methylation. Methylation is a change in the DNA molecule by attaching a methyl group (-CH3) to nucleotide C, and it is necessary that C follows the nucleotide G. The nucleotide sequence -CG- is called a CpG dinucleotide, or CpG site. Methylation does not occur in all cells at the same time, so they speak about the methylation percentage of a certain CpG site.
DNA methylation is one of the important mechanisms of regulation of gene expression. It has been shown that diseases such as various types of cancer, type 1 and type 2 diabetes, schizophrenia, etc. are associated with changes in the methylation profile. Therefore, it is important to be able to analyze the genome methylation profile.
')
Currently, several methods for quantitative measurements of the methylation profile are common. One of the most common is the Illumina microchip series. I will discuss in more detail the description of the Illumina 450K Infinium Array chip and the analysis of the data obtained with its help.
The 450K chip measures the methylation level of approximately 486000 CpG sites, more or less evenly distributed throughout the genome. Without going into the biochemical details of the functioning of the chip, the technology can be briefly described as follows. Each CpG site is measured using two fluorescent samples. The fluorescent signal of the samples is proportional to the number of methylated and unmethylated CpG sites in the test sample, respectively. The chip allows you to test up to 12 biological samples simultaneously.
So, at the output we have a table of values, in which the number of rows is equal to the number of CpG sites, and the number of columns to the number of analyzed biological samples. From this moment begins the actual bioinformatics.
The pipeline for analyzing data using the R language and the
Bioconductor library has approximately the following items (with the corresponding packages from the Bioconductor):
1. Select measurement scale (Beta or M value). Read more
here .
2. Adjust the color balance (color channel balance adjustment). Part of the CpG sites is measured using samples of one color, and some using two. This problem is eliminated by normalizing the signals of two samples in each biological sample.
3. Background correction (background correction). Each slot for biological samples on a chip has a different default background. Therefore, to align the values between samples, a background correction is necessary.
4. Normalization between samples (between-sample normalization). Quantile normalization and SVN normalization (
lumi package) are mainly used.
5. Testing for group effect (batch effect) with the help of principal component analysis.
6. Peak
correction .
7. Correction to group effect using
ComBat and
SVA packages.
8. Testing for statistical significance using linear models, permutations, or conventional tests for testing hypotheses (
limma and
multtest packages ).
9. Data analysis using various machine learning algorithms (I will not enumerate, there is a whole ocean of possibilities).
10. Correlation with the data of gene expression and
SNP (methylation of
quantitative trait loci ). Recommended for using the
matrixEQTL package.
I apologize for the confusion - this is a consequence of the attempt to present everything in one short review article. If anyone is interested, I will describe the process of building the pipeline in several more detailed articles with code examples in R.