PROMOTER RECOGNITION USING NEURAL NETWORK APPROACHES part 2

The output is fed to ANN, leather furniture which performs multisensor integration. Scores that make the NN output greater than the selected threshold are to be treated as positive pre- dictions in the promoter region. They have obtained a sensitivity rate of 67%. The authors have shown that their methods predict less false positives compared to the then existing algorithms.

Levitsky and Katokhin [4] have used the genetic algorithm based on iterative discriminant analysis, which is based on a global signal to classify eukaryotic (Dr osophi l a

) promoters. The negative set is obtained by shuffling the promot- ers.
African Mango Two promoter sample TATA and DPE containing sets are formed. The cross- correlation (CC) for TATA containing promoters is reported to be 0.92 and for DPE is shown to be 0.82.

Pedersen et al. characterized the promoters of prokaryotes (E. coli) and eukaryotes (human) using self-organizing parallel HMMs [5]. They plus size wedding dresses considered a set of three states (the main, the delete, and the insertion states), in addition to start and end states. The set of trade show booth emissions are the four nucleotides A,T,G,C. Main and insertion states always emit a nucleotide, whereas the deletion state is a no-emission state (i.e., a mute state). Given a penny stocks to watch set of K training sequences, the parameters of HMM are iteratively modified to optimize the data fit using a measure based on the log-likelihood. A set of HMMs trained on 38 σ 70 , and 3 σ 54 sequences are combined in parallel to create electric cigarettes a super HMM for E. coli promoter recognition. Similarly, human promoter sequences are used to train another HMM model. Clear patterns of well-known
SEO Services consensus signals (TATA box, etc.) could be obtained from the emission probabilities of main states of the HMM model. Their model is able to classify 162 σ70 out of 166 sequences σ 70 and 3 σ 54 out of 166 as σ 54 sequences. Only one σ 70

sequence out of 166 is misclassified. They have not been tested on nonpromoter sequences. It is said that DNA encodes two levels of functional information. The first level is for proteins and targets for activators, enhancers, repressors, transcription factor binders, and so on. The second level of information snoring chin strap is contained in the physical and structural properties of the DNA itself [15, 16]. In the literature, several groups have exploited these properties to distinguish between features specific to a partic- ular set of a DNA sequences and sequences that do not belong to a
leather furniture particular set. Physico-chemical parameters of a DNA double strand are available in the litera- ture [16]. Kobe et al., reviewed the work of baby shower cakes other groups that have considered the structural properties specific to mammals and plants [17]. There are some groups who have encoded the DNA independent of these properties in terms of binary values. Whatever encoding moncler is used, the whole sequence is considered for modeling in global signal-based methods. ugg bootsConformational and physicochemical properties of B-DNA
uggs on saledinucleotides [16] tabulated by the author and are used as global features for promoter recognition.

Based on global signal-based cheap uggsmethods using a neural network classifier. For this purpose, we considered two global features: n-gram features and features based on signal processing ugg outlet techniques. It is shown that the n-gram features extracted for n = 2, 3, 4,

5 efficiently discriminate promoters from nonpromoters.

Posted in Uncategorized | Comments Off

Promoter Recognition Using n-Gram Features

A few research papers on protein pokies sequence classification and gene identification that
use n-grams are seen, but pokies very few are available in the literature that are applied to pro- moter
recognition. A new class of pokies variable-order Bayesian network models (VOBN) is proposed by Ben-gal et
al. [25]. These models generalize the widely used position weight matrix (PWM), Markov, and
Bayesian network models. Instead of considering a
email lists fixed subset of the positions to model
dependencies, in VOBN models, these subsets may vary based on the specific nucleotides observed,
which are called the context. The VOBN model is sole f63 applied to a set of 238 σ 70 binding sites in E.
coli. The authors show that the VOBN model can distinguish those 238 sites from a set of 472 intergenic nonpromoter sequences with higher accuracy than fixed-order Markov models or Bayesian trees.
They consider the statistical dependencies sole-f80 between adjacent base pairs of nucleotides in E. coli to
achieve a true positive recognition rate of 47.56% [25].
Leu et al. used n-gram features for n = 6–20 to predict promoters for vertebrates
[26]. They consider sequences of length 550 bp. Each sequence segment of length
200 bp is given a cumulative score using all these n-grams with the individual n-gram score
designed based on its occurrence only in promoter or in nonpromoter or in both promoter and
nonpromoter. They achieve an accuracy rate of 88% with this method. Ji et al. implemented support
vector machine using n-gram features (n = 4,5,6,7) for target gene prediction of Arabidopsis [27].
Wang and Hannenhalli proposed a position specific propensity analysis model (PSPA), which extracts
the propensity of DNA elements at a particular position and their cooccurrence with respect to TSS
in mammals [28]. They considered a set of top ranking k-mers (k = 1–5) at each position ±100 bp
relative to TSS and computed the cooccurrence with other top-ranking k-mers at other downstream
positions. The PSPA score for a sequence is computed as the product of scores for the 200 positions
of ±100 bp relative to TSS. They found many position-specific promoter elements that are strongly
linked to gene product function.
Li and Lin considered position-specific weight matrices of hexamers at 10 spe- cific positions for
the promoter data of E. coli [29]. The position correlation scoring matrix (PCSM) is computed for
promoter as well as the nonpromoter set of training sequences. If the score is higher for positive
than in the negative PCSM, then the test sequence is identified as a promoter and similarly
nonpromoters are identified. Li and Lin [28] report performance of sensitivity being 91% and
specificity 81% for nonpromoter data consisting of coding regions alone and 90 and 77% for nonpro-
moter data taken from inter-genic portions only. Applying these scores to the whole genome to
predict the promoters, all 683 experimentally verified σ -70 promoters are
successfully predicted and 1567 predictions as probable promoters.

Posted in Uncategorized | Comments Off

Neural Network Classification Performance Promoter classification

Neural Network Classification Performance Promoter classification is ob-
tained using a multilayer perceptron having three layers, namely, an input, a hidden, and an output
layer. The output layer has one node to give a binary decision as to whether the given input
sequence is a promoter or nonpromoter. The input layer contains 16, 64, 256, and 1024 nodes
corresponding to the n-gram features for n =
2,3,4, and 5, respectively. Different experiments are carried out to find the optimal number of
hidden nodes that give the best classification performance. In a fivefold cross-validation, 80% of
the data set is used for training the network and the remain- ing 20% is used as the test data set.
Average performance of the neural network over fivefolds is reported in order to evaluate the
efficacy of the various n-gram features for promoter classification. These simulations are done
using the Stuttgart neural network simulator [33].
The classification results are evaluated on the test data set using different perfor- mance
measures (e.g., precision, specificity and sensitivity, and positive predictive value). Precision
is the proportion of the correctly classified sequences of the entire test data set. Specificity is
the proportion of the negative test sequences that are cor- rectly classified and sensitivity is
the proportion of the positive test sequences that are correctly classified. Positive predictive
value (PPV) is defined as the proportion of true positives with respect to the total number of
sequences that are predicted as positive (true positives + false positives).
Using this architecture of the neural network, promoter classification is carried out for E. coli
and Drosophila for n = 2,3,4, and 5 grams. It is found in E. coli that PPV for
2, 3, 4, and 5-grams is 81.29, 82.97, 80.03, and 81.09, respectively, and the percentage of PPV
obtained for Drosophila is 85.5, 89.28,89.35, and 91.2, respectively.

In the case of Drosophila, as
the sensitivity value for 5-grams is less than that of 4-grams, hence 4-grams is chosen as the best
n-gram features. The classification results for the best n-grams are presented in Table 4.1.
The results show that 3-grams are the best discriminators in E. coli, whereas,
4-grams are good in discriminating promoters from nonpromoters in Drosophila. It can be seen that
the identification of nonpromoters being 85% is much higher than

Posted in Uncategorized | Comments Off

GLOBAL SIGNAL-BASED METHODS FOR

Promoter recognition has been conventionally attempted using binding-site prediction algorithms
that are primarily based on motif search techniques. We believe that along with binding sites, the
upstream and downstream regions contribute to the function of the promoter, and hence we do an
indepth study of the entire promoter region. There is an indication that codons that are triplets
constitute useful features [18] in a DNA sequence and also the promoter regions are shown to have
conserved hexamers [19]. On the other hand, to compute hexamers that will be 46 in number for
every DNA sequence is computationally expensive. We present our study of the promoter region using
n-gram features that are contiguous blocks of n characters from a sequence for n = 2, 3, 4, and 5.
Traditionally, biomedical signals have been analyzed by signal processing tech- niques [e.g., FT
and wavelet transforms]. Biological data sets consist mostly of sequences made up of either
nucleotides or amino acids. Hence, an encoding system is required to convert these sequences into
numerical series. Once a numerical series is obtained, FT or wavelet transform (WT) can be applied.
Wavelets have been used in the literature to analyze biological signals (e.g., genome sequences,
protein structures, and gene expression data) [20]. It is assumed that the promoter signal that is
respon- sible for the binding is retained by the promoter whether it occurs in an inter-genic
portion or in a coding region [21]. To start with, FT of the sequences is used to analyze the
promoter region to gain knowledge in the frequency domain. Fourier transform per se cannot be used
for promoter recognition. Hence, its power spectrum computed using the Fourier coefficients are
used as features. Since in FT, positional information is lost, WT is being used to retain that
information. Promoter recognition is posed as a binary classification problem. So far FT has been
used by quite a few groups, but there is no work, as far as we know, which uses wavelets for
promoter recognition.
4.3.1 Data Set
This section describes the prokaryotic and eukaryotic data sets that are used for promoter
recognition problem and the n-gram feature extraction methods used for experimentation.
The prokaryotic data set of E. coli is built by taking 669 σ -70 promoter sequences
of length 80 with 60 base pairs (bp) upstream of the TSS and the rest downstream as is proposed in
the literature from RegulonDB and Promec data bases [22]. Both the positive and the negative data
sets are obtained from Gordon et al. [22]. There is no standard negative data set available. Gordon
et al., build the negative data set by choosing sequence fragments outside the promoter region.
This is a biologically meaningful data set that consists of 709 sequence fragments from the coding
region and 709 sequence segments from intergenic portions.
The eukaryotic promoter data set of Drosophila is obtained from Ohler et al. [23], which is taken
from the eukaryotic promoter database (EPD) [24]. A negative data

Posted in Uncategorized | Comments Off

PROMOTER RECOGNITION USING NEURAL NETWORK APPROACHES

The distinct feature in case of eukaryotic transcription is that the RNA polymerase does not bind to the promoter directly. A number of transcription-binding proteins bind to the binding sites and form a complex before RNA polymerase binds. Also, there are three kinds of RNApolymerase in eukaryotes unlike the prokaryotes. For the proteins to bind to deoxyribonucleic acid (DNA), it has to have a physical structure wherein the proteins can come and bind. Special proteins that are used for this purpose are Helix turn Helix, and Zn++ fingers. Promoter recognition is not a trivial problem due to the following reasons: Promoter recognition unlike other recognition problems (e.g., exon prediction and gene recognition) does not yield good results with methods of alignment or sequence similarity searches, since promoters have very low sequence similarity. Though the patterns (e.g., TATA box) are known to be conserved, there exist many exceptions to this rule. Nonconservation and distance between the patterns, the presence or absence of the patterns themselves make the task of promoter prediction an even more complex problem. Also, the occurrence of a promoter is not restricted to the 5’ end of a gene alone, but could in fact be found in a coding region or may overlap with another promoter [3] in the case of prokaryotes. Additionally, in the case of eukaryotes, promoters additionally may exist in an intron or in the untranslated region of 3’. Hence, the problem of recognition of promoter against various backgrounds gains importance computationally. Recently, there has been a deluge of sequencing information due to efficient sequencing methods. Several mammalian, bacterial, and plant species have been sequenced. One can use experimental methods [e.g., DNA footprinting, DNA protein cross-linking, X-ray crystallography, and nuclear magnetic resonance (NMR) spectroscopy] to identify a promoter or a gene. Typically, there are millions of protein sequences, but experimentally determined protein structures are only on the order of 1000. Experimental methods to determine a promoter, a gene, or a protein structure are time-consuming processes. Hence, annotation of important regions (e.g., genes) is not very fast. To overcome this handicap, computational techniques or algorithms that can automatically identify these regions are required. 4.2 RELATED LITERATURE /BACKGROUND The crux of the problem is to identify a promoter irrespective of its place of occurrence in the genome, by extracting features that are unique to it. Different research groups have been trying to identify these patterns or features specific for promoters by various featurextraction methods and different classifiers. Machine learning techniques can be used to address the issues mentioned above by modeling the recognition–prediction problem as a pattern recognition problem. To properly classify the promoter sequences in silico, one should get features that capture the essence of promoters. Some of the popular feature extraction methods are based on genetic algorithms [4], statistical models (e.g., hidden Markov models [5] and position weight matrices [6–8]), syntactic recognition algorithms [9], expectation, Methods based on features extracted from the binding sites or local consensus

regions can be termed as local signal-based methods. Position weight matrices, ex- pectation and maximization algorithm, and hidden Markov models have been used in the literature to extract local signals for the promoter recognition problem [5, 6, 10]. Local signal-based methods for eukaryotic promoter recognition use specific motifs like the four binding sites: The TATA box, the initiator (Inr) region, an upstream activating element (UPE), and a downstream promoter element (DPE). The detec- tion of transcription factor binding sites forms the core of the local signal-based methods.

The techniques that use the whole promoter sequence to extract features can be categorized as global signal-based methods. Techniques like Fourier transform (FT), sequence alignment method, and so on, fall under this category. Global signal- based methods use properties, such as GpC content, secondary structure elements, and cruciform DNA structure, for eukaryotic promoter recognition [12]. The lit- erature is abundant with local signal-based methods. Global signal-based methods are also catching up. Some of the work on the promoter recognition problem of both these kinds, which were carried out in the last few years, is presented in Section 4.3.

Das and Dai [13] present a comprehensive literature survey on the DNA motif find- ing algorithms. Motifs generally searched in the promoter sequences of coregulated genes and more recently integrated approaches that include phylogenetic footprinting are being used to find motifs. This survey gives a view of the local signal-based meth- ods that are used to extract conserved patterns in the DNA promoter sequences. Huerta and Collado-Vides created an E. coli promoter data set called Regulon database [14]. They extracted and aligned motifs in a given set of unordered sequences producing a frequency matrix. A set of 96 different weight matrices were created for promoter, coding, and noncoding regions. A score is computed using these weight matrices and the best weight matrix is used to predict a promoter. The predictive capacity of the method is 86%, however, accuracy defined as the average of sensitivity and positive predictive rate, is 53%. An important contribution of this work is that they predict a high number of putative promoters (promoter-like signals) in the vicinity of a true promoter, which show a better score than the true promoter. The authors suggest that these putative promoters may be trying to bring Ribonucleic polymerase (RNAP) closer to the functional promoter.

Bajic et al. designed a local signal-based algorithm that combines a nonlinear pro- moter recognition model with signal processing, artificial neural networks (ANNs), and a set of sensors in Dragon fly (Drosophila melanogaster) promoter prediction [6]. These sensors are based on the statistical concept of oligonucleotide positional dis- tributions in specific functional regions of DNA. Each sensor models a particular functional region (e.g., promoter, coding-exon, and intron). These distributions are modeled as a set of position weight matrices of the most significant oligonucleotides. Pentamers (regions of length 5) that most significantly contribute to the separation between the promoter and nonpromoter regions are chosen by determining the signif- icance using their statistical relevance. The signals of a sequence using the positional weight matrices for the three functional regions are fed to a signal processing block.

Posted in Uncategorized | Comments Off