Bioinformática multifractal: Una propuesta hacia la interpretación no-lineal del genoma

The first draft of the human genome (HG) sequence was published in 2001 by two competing consortia. Since then, several structural and functional characteristics for the HG organization have been revealed. Today, more than 2.000 HG have been sequenced and these findings are impacting strongly on the academy and public health. Despite all this, a major bottleneck, called the genome interpretation persists. That is, the lack of a theory that explains the complex puzzles of coding and non-coding features that compose the HG as a whole. Ten years after the HG sequenced, two recent studies, discussed in the multifractal formalism allow proposing a nonlinear theory that helps interpret the structural and functional variation of the genetic information of the genomes. The present review article discusses this new approach, called: “Multifractal bioinformatics”.


Omic Sciences and Bioinformatics
In order to study the genomes, their life properties and the pathological consequences of impairment, the Human Genome Project (HGP) was created in 1990.Since then, about 500 Gpb (EMBL) represented in thousands of prokaryotic genomes and tens of different eukaryotic genomes have been sequenced (NCBI, 1000 Genomes, ENCODE).Today, Genomics is defined as the set of sciences and technologies dedicated to the comprehensive study of the structure, function and origin of genomes.Several types of genomic have arisen as a result of the expansion and implementation of genomics to the study of the Central Dogma of Molecular Biology (CDMB), Figure 1 (above).The catalog of different types of genomics uses the Latin suffix "-omic" meaning "set of" to mean the new massive approaches of the new omics sciences (Moreno et al, 2009).Given the large amount of genomic information available in the databases and the urgency of its actual interpretation, the balance has begun to lean heavily toward the requirements of bioinformatics infrastructure research laboratories Figure 1 (below).
The bioinformatics or Computational Biology is defined as the application of computer and information technology to the analysis of biological data (Mount, 2004).An interdisciplinary science that requires the use of computing, applied mathematics, statistics, computer science, artificial intelligence, biophysical information, biochemistry, genetics, and molecular biology.Bioinformatics was born from the need to understand the sequences of nucleotide or amino acid symbols that make up DNA and proteins, respectively.These analyzes are made possible by the development of powerful algorithms that predict and reveal an infinity of structural and functional features in genomic sequences, as gene location, discovery of homologies between macromolecules databases (Blast), algorithms for phylogenetic analysis, for the regulatory analysis or the prediction of protein folding, among others.This great development has created a multiplicity of approaches giving rise to new types of Bioinformatics, such as Multifractal Bioinformatics (MFB) that is proposed here.

Multifractal Bioinformatics and Theoretical Background
MFB is a proposal to analyze information content in genomes and their life properties in a non-linear way.This is part of a specialized sub-discipline called "nonlinear Bioinformatics", which uses a number of related techniques for the study of nonlinearity (fractal geometry, Hurts exponents, power laws, wavelets, among others.)and applied to the study of biological problems (http://pharmaceuticalintelligence.com/tag/fractal-geometry/).For its application, we must take into account a detailed knowledge of the structure of the genome to be analyzed and an appropriate knowledge of the multifractal analysis.

From the Worm Genome toward Human Genome
To explore a complex genome such as the HG it is relevant to implement multifractal analysis (MFA) in a simpler genome in order to show its practical utility.For example, the genome of the small nematode Caenorhabditis elegans is an excellent model to learn many extrapolated lessons of complex organisms.Thus, if the MFA explains some of the structural properties in that genome it is expected that this same analysis reveals some similar properties in the HG.
The C. elegans nuclear genome is composed of about 100 Mbp, with six chromosomes distributed into five autosomes and one sex chromosome.The molecular structure of the genome is particularly homogeneous along with the chromosome sequences, due to the presence of several regular features, including large contents of genes and introns of similar sizes.The C. elegans genome has also a regional organization of the chromosomes, mainly because the majority of the repeated sequences are located in the chromosome arms, Figure 2 (left) (C.elegans Sequencing Consortium, 1998).Given these regular and irregular features, the MFA could be an appropriate approach to analyze such distributions.
Meanwhile, the HG sequencing revealed a surprising mosaicism in coding (genes) and noncoding (repetitive DNA) sequences, Figure 2 (right) (Venter et al., 2001).This structure of 6 Gbp is divided into 23 pairs of chromosomes (diploid cells) and these highly regionalized sequences introduce complex patterns of regularity and irregularity to understand the gene structure, the composition of sequences of repetitive DNA and its role in the study and application of life sciences.The coding regions of the genome are estimated at ~25,000 genes which constitute 1.4% of GH.  2012).
Given that all these genomic variations both in worm and human produce regionalized genomic landscapes it is proposed that Fractal Geometry (FG) would allow measuring how the genetic information content is fragmented.In this paper the methodology and the nonlinear descriptive models for each of these genomes will be reviewed.

The MFA and its Application to Genome Studies
Most problems in physics are implicitly non-linear in nature, generating phenomena such as chaos theory, a science that deals with certain types of (non-linear) but very sensitive dynamic systems to initial conditions, nonetheless of deterministic rigor, that is that their behavior can be completely determined by knowing initial conditions (Peitgen et al, 1992).In turn, the FG is an appropriate tool to study the chaotic dynamic systems (CDS).In other words, the FG and chaos are closely related because the space region toward which a chaotic orbit tends asymptotically has a fractal structure (strange attractors).Therefore, the FG allows studying the framework on which CDS are defined (Moon, 1992).And this is how it is expected for the genome structure and function to be organized.
The MFA is an extension of the FG and it is related to (Shannon) information theory, disciplines that have been very useful to study the information content over a sequence of symbols.Initially, Mandelbrot established the FG in the 80's, as a geometry capable of measuring the irregularity of nature by calculating the fractal dimension (D), an exponent derived from a power law (Mandelbrot, 1982).The value of the D gives us a measure of the level of fragmentation or the information content for a complex phenomenon.That is because the D measures the scaling degree that the fragmented self-similarity of the system has.Thus, the FG looks for self-similar properties in structures and processes at different scales of resolution and these self-similarities are organized following scaling or power laws.
Sometimes, an exponent is not sufficient to characterize a complex phenomenon; so more exponents are required.The multifractal formalism allows this, and applies when many subgroups of fractals with different scalar properties with a large number of exponents or fractal dimensions coexist simultaneously.As a result, when a spectrum of multifractal singularity measurement is generated, the scaling behavior of the frequency of symbols of a sequence can be quantified (Vélez et al, 2010).
The MFA has been implemented to study the spatial heterogeneity of theoretical and experimental fractal patterns in different disciplines.In post-genomics times, the MFA was used to study multiple biological problems (Vélez et al, 2010).Nonetheless, very little attention has been given to the use of MFA to characterize the content of the structural genetic information of the genomes obtained from the images of the Chaos Representation Game (CRG).
First studies at this level were made recently to the analysis of the C. elegans genome (Vélez et al, 2010) and human genomes (Moreno et al, 2011).
The MFA methodology applied for the study of these genomes will be developed below.

Methodology
The Multifractal Formalism from the CGR

Data Acquisition and Molecular Parameters
Databases for the C. elegans and the 36.2Hs_ refseq HG version were downloaded from the NCBI FTP server.Then, several strategies were designed to fragment the genomic DNA sequences of different length ranges.For example, the C. elegans genome was divided into 18 fragments, Figure 2 (left) and the human genome in 9,379 fragments.According to their annotation systems, the contents of molecular parameters of coding sequences (genes, exons and introns), noncoding sequences (repetitive DNA, Alu, LINES, MIR, MER, LTR, promoters, etc.) and coding/ non-coding DNA (TTAGGC, AAAAT, AAATT, TTTTC, TTTTT, CpG islands, etc.) are counted for each sequence.

Construction of the CGR
Subsequently, the CGR, a recursive algorithm (Jeffrey, 1990;Restrepo et al, 2009) is applied to each selected DNA sequence, Figure 3 (above, left) and from which an image is obtained, which is quantified by the box-counting algorithm.For example, in Figure 3 (above, left) a CGR image for a human DNA sequence of 80,000 bp in length is shown.Here, dark regions represent sub-quadrants with a high number of points (or nucleotides).Clear regions, sections with a low number of points.The calculation for the D for the Koch curve by the box-counting method is illustrated by a progression of changes in the grid size, and its Cartesian graph, Table 1 2

.3 Fractal Measurement by the Box Counting Method
The CGR image for a given DNA sequence is quantified by a standard fractal analysis.
A fractal is a fragmented geometric figure whose parts are an approximated copy at full scale, that is, the figure has self-similarity.
The D is basically a scaling rule that the figure obeys.Generally, a power law is given by the following expression: Where N(E) is the number of parts required for covering the figure when a scaling factor E is applied.The power law permits to calculate the fractal dimension as: The D obtained by the box-counting algorithm covers the figure with disjoint boxes ɛ = 1/E and counts the number of boxes required.Figure 4 (above, left) shows the multifractal measure at momentum q=1.

Multifractal Measurement
When generalizing the box-counting algorithm for the multifractal case and according to the method of moments q, we obtain the equation ( 3) (Gutiérrez et al, 1998;Yu et al, 2001): Where the M i number of points falling in the i-th grid is determined and related to the total number M 0 and ɛ to box size.Thus, the MFA is used when multiple scaling rules are applied.Figure 4 (above, right) shows the calculation of the multifractal measures at different momentum q (partition function).Here, linear regressions must have a coefficient of determination equal or close to 1. From each linear regression D are obtained, which generate an spectrum of generalized fractal dimensions D q for all q integers, Figure 4 (below, left).So, the multifractal spectrum is obtained as the limit: The variation of the q integer allows emphasizing different regions and discriminating their fractal behavior: positive q values emphasize the dense regions; a high D q is synonymous of the structure's richness and the properties of these regions.Negative q values emphasize the scarce regions; a high D q indicates a lot of structure and properties in these regions.In real world applications, the limit D q readily approximated from the data using a linear fitting: the transformation of the equation (3) yields: for set q is a linear function in the ln(ɛ), D q can therefore be evaluated as the slope of a fixed relationship between ( ) ( )( ) ( ) ( In In M D q 1 i q q f f = -+ and (q-1) ln(ɛ).The methodologies and approaches for the method of box-counting and MFA are detailed in Moreno et al, 2000, Yu et al, 2001;Moreno, 2005.For a rigorous mathematical development of MFA from images consult Multifractal system, wikipedia.

Measurement of Information Content
Subsequently, from the spectrum of generalized dimensions D q , the degree of multifractality ΔD q (MD) is calculated as the difference between the maximum and minimum values of D q : ΔD q = D qmax -D qmin (Ivanov et al, 1999).When ΔD q is high, the multifractal spectrum is rich in information and highly aperiodic, when ΔD q is small, the resulting dimension spectrum is poor in information and highly periodic.It is expected then, that the aperiodicity in the genome would be related to highly polymorphic genomic aperiodic structures and those periodic regions with highly repetitive and not very polymorphic genomic structures.The correlation exponent t(q) = (q -1)D q , Figure 4 (below, right ) can also be obtained from the multifractal dimension Dq.The generalized dimension also provides significant specific information.D(q = 0) is equal to the Capacity dimension, which in this analysis is the size of the "box count".D(q = 1) is equal to the Information dimension and D(q = 2) to the Correlation dimension.Based on these multifractal parameters, many of the structural genomic properties can be quantified, related, and interpreted.

Multifractal Parameters and Statistical and Discrimination Analyses
Once the multifractal parameters are calculated (D q = (-20, 20), ΔDq, πq, etc.), correlations with the molecular parameters are sought.These relations are established by plotting the number of genome molecular parameters versus MD by discriminant analysis with Cartesian graphs in 2-D, Figure 5 (below, left) and 3-D and combining multifractal and molecular parameters.Finally, simple linear regression analysis, multivariate analysis, and analyses by ranges and clusterings are made to establish statistical significance.

Non-linear Descriptive Model for the C. elegans Genome
When analyzing the C. elegans genome with the multifractal formalism it revealed what symmetry and asymmetry on the genome nucleotide composition suggested.Thus, the multifractal scaling of the C. elegans genome is of interest because it indicates that the molecular structure of the chromosome may be organized as a system operating far from equilibrium following nonlinear laws (Ivanov et al, 1999;Burgos and Moreno-Tovar, 1996).This can be discussed from two points of view: 1) When comparing C. elegans chromosomes with each other, the X chromosome showed the lowest multifractality, Figure 5 (above).This means that the X chromosome is operating close to equilibrium, which results in an increased genetic instability.Thus, the instability of the X could selectively contribute to the molecular mechanism that determines sex (XX or X0) during meiosis.Thus, the X chromosome would be operating closer to equilibrium in order to maintain their particular sexual dimorphism.
2) When comparing different chromosome regions of the C. elegans genome, changes in multifractality were found in relation to the regional organization (at the center and arms) exhibited by the chromosomes, Figure 5 (below, left).These behaviors are associated with changes in the content of repetitive DNA, Figure 5 (below, right).The results indicated that the chromosome arms are even more complex than previously anticipated.Thus, TTAGGC telomere sequences would be operating far from equilibrium to protect the genetic information encoded by the entire chromosome.
All these biological arguments may explain why C. elegans genome is organized in a nonlinear way.These findings provide insight to quantify and understand the organization of the non-linear structure of the C. elegans genome, which may be extended to other genomes, including the HG (Vélez et al, 2010).

Nonlinear Descriptive Model for the Human Genome
Once the multifractal approach was validated in C. elegans genome, HG was analyzed exhaustively.This allowed us to propose a nonlinear model for the HG structure which will be discussed under three points of view.
1) It was found that the HG high multifractality depends strongly on the contents of Alu sequences and to a lesser extent on the content of CpG islands.These contents would be located primarily in highly aperiodic regions, thus taking the chromosome far from equilibrium and giving to it greater genetic stability, protection and attraction of mutations, Figure 6 (A-C).Thus, hundreds of regions in the HG may have high genetic stability and the most important genetic information of the HG, the  2) The multifractal context seems to be a significant requirement for the structural and functional organization of thousands of genes and gene families.Thus, a high multifractal context (aperiodic) appears to be a "genomic attractor" for many genes (KOGs, KEEGs), Figure 6 (E) and some gene families, Figure 6 (F) are involved in genetic and deterministic processes, in order to maintain a deterministic regulation control in the genome, although most of HG sequences may be subject to a complex epigenetic control.
3) The classification of human chromosomes and chromosome regions analysis may have some medical implications (Moreno et al, 2002;Moreno et al, 2009).This means that the structure of low nonlinearity exhibited by some chromosomes (or chromosome regions) involve an environmental predisposition, as potential targets to undergo structural or numerical chromosomal alterations in Figure 6 (G).Additionally, sex chromosomes should have low multifractality to maintain sexual dimorphism and probably the X chromosome inactivation.
All these fractals and biological arguments could explain why Alu elements are shaping the HG in a nonlinearly manner (Moreno et al, 2011).Finally, the multifractal modeling of the HG serves as theoretical framework to examine new discoveries made by the ENCODE project and new approaches about human epigenomes.That is, the non-linear organization of HG might help to explain why it is expected that most of the GH is functional.

Conclusions
All these results show that the multifractal formalism is appropriate to quantify and evaluate genetic information contents in genomes and to relate it with the known molecular anatomy of the genome and some of the expected properties.
Thus, the MFB allows interpreting in a logic manner the structural nature and variation of the genome.
The MFB allows understanding why a number of chromosomal diseases are likely to occur in the genome, thus opening a new perspective toward personalized medicine to study and interpret the GH and its diseases.
The entire genome contains nonlinear information organizing it and supposedly making it function, concluding that virtually 100% of HG is functional.
Bioinformatics in general, is enriched with a novel approach (MFB) making it possible to quantify the genetic information content of any DNA sequence and their practical applications to different disciplines in biology, medicine and agriculture.This novel breakthrough in computational genomic analysis and diseases contributes to define Biology as a "hard" science.
MFB opens a door to develop a research program towards the establishment of an integrative discipline that contributes to "break" the code of human life.(http://pharmaceuticalintelligence.com/page/3/).

Acknowledgements
Thanks to the directives of the EISC, the Universidad del Valle and the School of Engineering for offering an academic, scientific and administrative space for conducting this research.Likewise, thanks to coauthors (professors and students) who participated in the implementation of excerpts from some of the works cited here.Finally, thanks to Colciencias by the biotechnology project grant # 1103-12-16765.

Figure 1 .
Figure 1.Development of omics sciences through the flow of genetic information or CDMB (*), above.Changes in infrastructure requirements of the genome project, below.

Figure 2 .
Figure 2. Map of the C. elegans genome by chromosome divided into three regions (L: left, C: central and R: right) and map of the HG by chromosome.

Figure 3 .
Figure 3. CGR algorithm for a short DNA sequence: 5'-gaattc'-3 'and CGR for a long sequence of the HG, above.Calculation of the D by the box-counting method applied to the Koch curve, which follows a power law (ln(N(l)) = -ln(l) -0.2618 ) with exponent D = 1.2618, below.

Figure 4 .
Figure 4. MFA by method of moments, with multifractal measurements calculated by the box-counting algorithm onthe CGR for a DNA sequence.Adapted fromYu et. al., 2001.

Figure 5 .
Figure 5. MFA for the C. elegans genome.Multifractal spectra and distribution of multifractality by chromosome, above.2-D discrimination analysis by chromosome region and distribution of repeat length by chromosomal region, below.

Figure 6 .
Figure 6.MFA summary diagram of the HG.From left to right the multifractality increases.Adapted fromMoreno et. al., 2011.

Table 1 .
Data of the linear progression of grid size change for the Koch curve in Figure3, below.