Information

How to recognize a conserved motifs of the protein

How to recognize a conserved motifs of the protein



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I would like ensure that my reasoning is correct. Assuming that I know the aminoacids sequence of the protein of interest. I can't say anything about the structure looking only at the aminoacids sequence of this protein. But if I know this protein from another organism and the structure of this protein is known, then I can compare both of the sequences and conclude something, right? what I mean, is that there is no specific sequence corresponding to, for example, helix-two- turn-helix motif, and that I can take this sequence, check if my protein has it and say that there is helix-two- turn-helix motif or not. I can do this only by comparison to the protein which structure is already known, right?


It seems to me that you're asking about homology modelling. In that case, yes you need to compare your protein of interest to a protein (or proteins) of known structure. Homology modelling in a nutshell includes three (four?) steps: template identification/template alignment, modelling, quality assessment.

You start with finding a template for your modelling. This is usually done by sequence alignment, for instance BLASTing. Preferably you use multiple sequence alignment which more sensitively aligns conserved regions. You then want a template with as high sequence identity as possible (above 50 % usually produces models with about 1 Å RMSE [1] in main chain atoms. Avoid less than 30 % where modelling errors rapidly increase.)

There are then a number of different modelling strategies (wikipedia). But basically, they all aim to predict the structure of the conserved protein core as well as possible (which usually is what you're really interested in). Peripheral amino acids are more dynamic and more prone to evolution and are therefore more difficult to predict. Then, most importantly, you assess the quality of your model. This can be done by calculating violations of statistical potentials or physics based conformational energies (or using more advanced methods like multivariate regression methods). As in all modelling, this really is a most crucial step because prediction with a poor model is misleading and utterly useless.

If you don't find any template you could resort to the exciting field of De novo protein structure prediction, where the aim is to predict the structure from the amino acid sequence alone. I am not very familiar with their methods, but de novo prediction is hard (!). I don't remember any exact numbers but the number of conformations in a normal sized protein is astronomically large, which leads to great algorithmic and computational challenges. Additionally, without any reference sequence the model assumptions are greater than those of homology modelling. Although, I have heard that the field has been making great progress the last few years.


Edit: It struck me that you might be asking about protein fold recognition as well. There exists a large number of different tools and methods for recognizing and locating protein domains using the amino acid sequence as input. Many of them are available as web servers. For instance phyre which uses the amino acid profile and predicted secondary structures to search structure libraries. Threading based methods like MUSTER. A number based on Hidden Markov Models (HMMs) also exist. For instance FISH which uses structure anchored HMMs.


Interaction between ATP, a multifunctional and ubiquitous nucleotide, and proteins initializes phosphorylation, polypeptide synthesis and ATP hydrolysis which supplies energy for metabolism. However, current knowledge concerning the mechanisms through which ATP is recognized by proteins is incomplete, scattered, and inaccurate. We systemically investigate sequence and structural motifs of proteins that recognize ATP. We identified three novel motifs and refined the known p-loop and class II aminoacyl-tRNA synthetase motifs. The five motifs define five distinct ATP–protein interaction modes which concern over 5% of known protein structures. We demonstrate that although these motifs share a common GXG tripeptide they recognize ATP through different functional groups. The p-loop motif recognizes ATP through phosphates, class II aminoacyl-tRNA synthetase motif targets adenosine and the other three motifs recognize both phosphates and adenosine. We show that some motifs are shared by different enzyme types. Statistical tests demonstrate that the five sequence motifs are significantly associated with the nucleotide binding proteins. Large-scale test on PDB reveals that about 98% of proteins that include one of the structural motifs are confirmed to bind ATP.

(A) Superimposed cluster of ATP-binding site structures that belong to the “class II aminoacyl- tRNA synthetase” binding mode. (B) Structural motif identified by a clustering method for the “class II aminoacyl- tRNA synthetase” binding mode.


Reporter's comments

Timeliness

There is no indication of when the site was last updated, or what version of each of the sequence databases is being searched.

Best feature

The site is very simple to use, and the integration of the various resources is very useful. One can make a motif, search for proteins with the motif, and then determine if they, in turn, share any other motifs.

Worst feature

Unfortunately, the results are of dubious use. Using one of my favorite proteins - a putative glycosyltransferase from Arabidopsis - one of the true conserved motifs was buried in a mess of false positives (though the page claims that no false positives are expected at that stringency). Worse, when I went to check on the description of the 'true hit' in the BLOCKS database using the supplied link, I received an error saying that no such BLOCK exists. When I used the link to initiate an EMOTIF scan, I was presented with a substantial list of matching proteins, from both SwissPROT and GenBank. But closer inspection revealed that a number of proteins that should have matched the same motif were not present. In fact, of the 22 known Arabidopsis proteins with this particular glycosyltransferase motif, not a single one was in the list - a very glaring omission. In the interests of fairness, I decided to test another protein: a multifunctional protein involved in beta-oxidation of fatty acids. There are several very clear domains in this protein, which match the PROSITE consensus sequences for these motifs. One domain was identified (in fact, 18 times), but the other domains were not. An EMOTIF scan with several of the motif matches again revealed an absence of any of the Arabidopsis sequences that contain these motifs. Although it is not stated anywhere on the site, it seems clear that only a subset of the protein database (or a very old version) is being searched.

When I tried to allow a single mismatch in the EMOTIF scan, thinking that perhaps a single amino-acid mismatch might cause some proteins to be omitted, I discovered that this feature is obviously broken. Instead of a short list of matching proteins with the protein motif highlighted, the search instead started spewing an incredible number of full-length protein sequences, without any highlighting or notation.

It should be noted that the EMOTIF site has undergone some revisions in the month since this report was written. The navigation has not changed and there still appear to be problems with the results - now it is more likely that no results will be returned than the user will be given spurious ones.

Wish list

The site needs better documentation to let people know how the programs work and to state clearly the limitations of the tools. I searched through most of the site and the only help pages I could find were for the construction of EMOTIFs from multiple sequence alignments.

Related websites

There is no indication of when the site was last updated, or what version of each of the sequence databases is being searched.

Two better sites for motif searches are the BLOCKS servers and the PROSITE database of protein families and domains.


Protein domain prediction

Protein domains are arrangements of secondary structure elements, which confer a biological function. The complex proteins have evolved by a mix-and-match assembly of individual domains or by concatenating several units of the same domain together. Domains have a similar function in different organisms and the protein domains organisation leads to hints about the protein function. One of the wide-spread motifs is a “helix-turn-helix”, which hints that your protein is able to bind DNA in some capacity.

Examples of programs predicting specific domains:

PSIPRED – protein sequence analysis workbench including secondary structure and disordered protein prediction

Phobius – transmembrane helical segments and signal sequences

COILS – prediction of coiled-coil regions, characteristic for structural proteins or proteins involved in transcription regulation


Conservation motifs - a novel evolutionary-based classification of proteins

Cross-species protein conservation patterns, as directed by natural selection, are indicative of the interplay between protein function, protein-protein interaction and evolution. Since the beginning of the genomic era, proteins were characterized as either conserved or not conserved. This simple classification became archaic and cursory once data on protein orthologs became available for thousands of species.

To enrich the language used to describe protein conservation patterns, and to understand their biological significance, we classified 20,294 human proteins against 1096 species. Analyses of the conservation patterns of human proteins in different eukaryotic clades yielded extremely variable and rich patterns that had never been characterized or studied before. Using mathematical classifications, we defined seven conservation motifs: Steps, Critical, Lately Developed, Plateau, Clade Loss, Trait Loss and Gain, which describe the evolution of human proteins.

One type of motif, which we termed Gain, describes the human proteins that are highly conserved in a small number of organisms but are not found in most other species. Interestingly, this pattern predicts 73 possible instances of horizontal gene transfer in eukaryotes.

Overall, our work offers novel terms for conservation patterns and defines a new language intended to classify proteins based on evolution, reveal aspects of protein evolution, and improve the understanding of protein functions.


How to recognize a conserved motifs of the protein - Biology

Hi Nicholas, Thank you so much for giving a lot of information. Bioinformatic Methods II was little difficult but understood after repeating the lad discussions. Thanks a lot.

I really appreciate these series of courses, I want to thank Prof. Provart and his coligues for their great job on preparing and presenting these series. Thanks a lot!

In this module we'll be exploring conserved regions within protein families. Such regions can help us understand the biology of a sequence, in that they are likely important for biological function, and also be used to help ascribe function to sequences where we can't identify any homologs in the databases. There are various ways of describing the conserved regions from simple regular expressions to profiles to profile hidden Markov models (HMMs).

Преподаватели

Nicholas James Provart

Текст видео

All right, in today's lab, we are examining motifs in proteins. And often the presence of these motifs can tell us about the biological function of a given protein, especially if we can't find any homology to sequences in the database. So we're starting off using the Conserved Domain Database at NCBI. And, question 1a asks what the source databases are that comprise the CDD, and that should be pretty easy to find out under the help section. Question 1b asks about the size of the BRCA2 protein. It's quite large, greater than 3,000 amino acids long. Question 1c then asks how many distinct protein domains does BRCA2 possess. And if we simply count the number of unique accessions, or identifiers in this section here. We'll see that there are five different domains present. Question 1d asks, how many BRCA2 repeat domains are there? And there are eight of those things. So these are the green blobs in this region of the protein here. All right. Then we're exploring something called the CDART, which is the Conserved Domain Architecture Retrieval Tool. And this tool will allow you to identify proteins that have similar domain architectures, that is, the similar composition of domains as your protein of interest. So they don't necessarily have to be homologous. It just should have the same constituent parts as our protein of interest. [COUGH] Question 1e asks how many eukaryotic species contain the BRCA2 repeat region. BRCA2 region containing OB1, OB2, OB3, but actually lack the BRCA2 repeat region, which is denoted in CDART as BRCA2. We can use the filter tool to actually specify that by including and excluding those two domains, and we see that there are several proteins that actually lack the repeat region, the BRCA2 repeat region, but contain the OB1, OB2, OB3 domains. And that would suggest that those domains can actually function independently of one another. So, the one domain doesn't have to be there for the other one to function. So, they probably serve independent functions. So the next tool or the next database that we're exploring is SMART. And SMART will again scan a protein sequence for the presence of known regions, domains or of repetitive regions. We'll also identify repetitive regions. Or signal peptides, all of these kinds of signals in proteins that are important for function, and when we feed BRCA2 into SMART, we see that there are in fact no signal peptides or transmembrane domains. So it doesn't seem to be associated with, it would suggest that it's not associated with the membrane, and that it's not targeted to any particular subcellular compartment. Question 1h asks how many regions of low complexity does BRCA2 possess? And what we're looking for here are these low complexity regions. And we simply count those up in the list. So, that will help you answer one of the quiz questions. [COUGH] The next database that we're exploring is Pfam and here we are, again, feeding in our sequences. And asking how many different protein domains does Pfam identify. And here again we see five like we did with CDD. Again we see eight of the BRCA2 repeats, as well as these other unique domains at this end. So, that's nice to see the congruency between the CDD and Pfam search. So, question j asks whether or not we see the BRCA repeat domain occurring in non-BRCA2 orthologous proteins. So, this is kind of like the flip of the search that we did with CDART. Where we were looking for the presence of the non BRCA repeat domains in other proteins in the absence of the BRCA repeat. So here we're doing the opposite, we're asking whether or not the BRCA repeat occurs in non-BRCA2 orthologous proteins. And what we can do is we can simply scroll down the graphical output of our Pfam search, and we see that there are instances where we see the BRCA repeats in proteins, but we don't see these other regions here. So that does indicate again - confirms - that the two domains can act independently, presumably independently of one another, have different functions. that they don't have to be present to function together. So that's an important fact. And then question k asks, can we say anything interesting about the species that possess strictly the BRCA repeats and no other BRCA2-type domains and there seems to be quite a diversity of species. So, it's not limited to any particular species. All right. Now we're looking at the sequences that go into defining the BRCA2 repeat. This is Pfam entry number PF00634. And if we take all of those sequences that are found in various sequences that are in the databases, we can see that the best conserved position in this HMM for this BRCA2 repeat is in fact this position right here. Position seven. And that's a phenylalanine. It's almost completely conserved. There's a little bit of variation, but it's almost always a phenylalanine at that position. If you scroll across to the right, over here you'll find the answer to another quiz question. [COUGH] So question m asks, how was that HMM built for the BRCA2 repeat. And we can actually see the commands, the UNIX commands, that were issued in the standalone version of HMMer to create that HMM. And we're not using that, but it's good to know we can drill back down to the actual commands that were used to build that HMM. So the last part of the lab deals with using InterProScan. And as I mentioned in the lecture, InterPro's an overarching collection of all these different motifs and domains that have been collated into one master database. And this makes it very easy to search many different databases with the InterProScan tool. And question n asks, are the results of our InterProScan for BRCA2 congruent with those of the CDD search? And the answer again is yes, we do see congruency. So here are the BRCA repeats. There are eight of them plus these other domains that are found towards the C-terminal end of the BRCA2 protein. And one of the quiz questions asked about whether or not there's a Prosite motif that's identified and contained within the InterPro. And what weɽ be looking for here is the presence of a PS Designator on the accession identifiers here, so if there's a PS, that mean Prosite, the motif came from Prosite. So that should help you answer that quiz question. All right, by the end of the first lab of Bioinformatic Methods II, you should know why we're interested in searching for motifs and profiles in sequences. You should know the advantages and disadvantages of representing structural elements in protein sequences as motifs, or even as profiles, which are slightly better. You should be able to generate a motif given a specific alignment. You should also be able to understand how to score a given sequence with a given position specific scoring matrix, PSSM, and you should also be able to use CDD, CDART, SMART, Pfam, and InterProScan to identify specific functional units within the protein sequence


How to recognize a conserved motifs of the protein - Biology

Hi Nicholas, Thank you so much for giving a lot of information. Bioinformatic Methods II was little difficult but understood after repeating the lad discussions. Thanks a lot.

I really appreciate these series of courses, I want to thank Prof. Provart and his coligues for their great job on preparing and presenting these series. Thanks a lot!

In this module we'll be exploring conserved regions within protein families. Such regions can help us understand the biology of a sequence, in that they are likely important for biological function, and also be used to help ascribe function to sequences where we can't identify any homologs in the databases. There are various ways of describing the conserved regions from simple regular expressions to profiles to profile hidden Markov models (HMMs).

Преподаватели

Nicholas James Provart

Текст видео

[MUSIC] All right, welcome to Bioinformatic Methods II. I'm your instructor Nicholas Provart. Course material for this course was developed by Ryan Austin, David Guttman, Laura Hug, Momoko Price, and myself. And the course was produced by Jamie Waese, Rohan Patel, William Heikoop and again myself. As a reminder, please do use the Coursera tools to discuss the lecture content and labs. The course format and syllabus is as follows. The course will cover motif searching, protein-protein interactions, structural bioinformatics, gene expression, data analysis and cis-element prediction. Most of the tools used for exploration are web-based. Week 1, we'll cover protein motifs. Week 2, we'll cover protein-protein interactions. Week 3, protein structure. Week 4 and 5, gene expression analysis and Week 6, cis-regulatory elements. The weekly material with consists of a mini lectures of about 20 minutes long and short 2-minute intro and summary videos. Then there are the weekly labs which will take you about 1 to 2 hours to do and then there are lab quizzes associated with those, fairly short lab quizzes. There's also an optional online lab discussion video that you can watch to help you work through the lab. And there are two sectional quizzes. One after the first three weeks material and the other one at the end of the course. Finally, we'll finish up with one assignment, which is due at the end of the course. I should add that it's not necessary to have taken Bioinformatic Methods I for this course for Bioinformatic Methods II. It would help but it's not necessary. All right, so in this week, we're doing Motif and Profile Analysis and we'll talk about motifs and profiles and profile HMMs. And touch on a tool called HMMer and a database of profiles and motifs. So why do we want motifs and profiles? Why do we care about them? The reason is that divergence, evolutionary divergence, gives rise to sequence families. Given protein families have related structural elements necessary for biological function. And there tight constraints on amino acid composition and the orientation necessary for, for example, correct active site geometry. However, sequence divergence may result in no homologue being identified. But the structural elements might still be present and we can use these to infer function if we can't identify a homologue. And also having the model of the structural elements may allow better alignment of a new sequence family member. They're also sequence motifs that can be present in the promoters of genes. And these are necessary for the binding of transcription factors and other regulatory proteins. And we'll discuss these in greater detail in the cis-element laboratory in week 6. All right, we'll start with motifs which are also called patterns or rules. And this is the simplest approach to structural element identification. An example database for motifs is Prosite. So given an alignment, here's an example alignment here. We can start to see that certain residues within the alignment are conserved or at least semi-conserved. For instance, at the second position, we see in aspartate which seems to be conserved. And then at the 4th position, we see a glycine that seems to be absolutely conserved. We can use the following set of rules to create or derive a motif. And the patterns in Prosite are described using these these rules. First of all, we use the standard IUPAC one letter code for the amino acids. We use an X to denote a position where any amino acid is accepted. We denote ambiguities within square parentheses. So if we see something that looks like this, that means that an alanine, a leucine or threonine is allowed at that position. More general ambiguities use a pair of curly braces to indicate what is disallowed at that position. So for instance, this means that any amino acid except alanine or methionine is allowed at that position. Now each element in the pattern is separated using a dash. It's not an absolute rule, repetition is denoted using numerical values or numerical range between the parentheses. So x 3 for instance would be I mean three Xs, x 2 comma 4 would mean you could have two Xs in a row, three Xs in a row or four Xs in a row. Patterns at the N or C-terminal end of the sequence can be denoted using this leftward pointing arrow or the rightward pointing greater than symbol, respectively. And a period ends the pattern that's also not always observed. All right, coming back to our alignment, we use those rules to derive a motif, which we can see here. And we would read that motif as an alanine or serine at the first position, followed by an absolutely conserved aspartate followed by IV or L, followed by an absolutely conserved glycine, Any one of four amino acids, anything except proline or glycine, followed by an absolutely conserved cysteine then D or E, arginine. Any one of phenylalanine or tyrosine, twice, and then ending up with a glutamine. So a real life example would be C2H2 zinc finger. And here we see two absolutely conserved cysteines, which are zinc ligands, as well as the two absolutely conserved histidines, which are also zinc ligands and then this sort of intervening spacer region. The problem with the motif approach though is that there is no such thing as a partial match. So for instance, if we're searching with an evolutionarily divergent sequence and are trying to identify C2H2 zinc fingers. If that sequence doesn't have one of these amino acids in the spacer region, then it won't be found a through database search. So this leads us to the next way of scoring patterns and that's using profiles and we these are also called position-specific scoring matrices or PSSMs. So here, we've got another alignment five sequences. One, two, three, four five and there are five positions in this alignment, five columns. So we build a matrix of all of the amino acids on on the rows here, cysteine, lysine, histidine, serine, and so on. And then at each position in the matrix, the positions correspond to the alignment columns. We just record the value, the number of times we see a cysteine or a glycine or histidine at that position. So in the first column, we have four of the five amino acids being cysteines. So we put in a probability of observing a cysteine in that position of 0.8. And a probability of observing a glycine of 0.2. And we do that across all of the positions. So we can then use this profile, this PSSM, to actually score any given sequence, to score any given sequence as to how well it matches the profile. So if we're given a sequence, so here C G G S V, we can calculate a score based on the profile that we have for it simply by multiplying the probabilities of observing a C at the first position times the probability of observing a G at the second position, a G at the third position, an S at the fourth position and a V at the fifth position to come up with an overall score of 0.031. So it seems like a great thing. We can actually take into account the abundance of certain amino acids at given positions. There is some some leeway given when creating the profiles in terms of deletions and the weights given to unlikely amino acids and so on. But these are all kind of tweaks that have to be done manually and this leads us to a new kind of profile based on hidden Markov models. Now just as an aside, Iɽ like to introduce sequence logos to allow the visualization of conserved residues. So what we're looking at here, even though you can't see anything, is a set of sequences that are in common between triose phosphate isomerases. This is from a profile database and we see that there's phenylalanine at the first position, some tryptophans here in sort of the middle and so on. But even if we add colour to denote residues that have the same physico- chemical properties, it's really hard to tell which residues are conserved and how well they are conserved. We might pick up this this lysine here at this position here, the red stripe. But otherwise, it's kind of difficult so we can use something called sequence logos to actually get at this in a visual way. And here, this is a sequence logo of that alignment and what we can see actually is that there is absolute conservation at the of the lysine at the 7th position, semi- conservation of the asparagine at the fifth position. And this tryptophan here is also somewhat conserved at the sixth position. Now the height of the letters in this sequence logo is determined by the conservation, as measured by the entropy. And we use something called the bit score to calculate that and the bit score is calculated given this equation here. Basically, we sum across for each amino acid at a given position. We compute the frequency of that amino acid and we multiply it by the log 2 of the frequency of that amino acid at that position and then we sum over all amino acids at a given position. And we subtract that value from the log 2 of 20 in the case of protein sequences, amino acid sequences, there 20 amino acids, and we in the case of nucleotide sequences we would actually subtract the entropy value, the Shannon entropy value, from the log 2 of four because there are four different nucleotides. So the maximum value you can have the residues absolutely conserved as is the case of this lysine residue at position 7 is 4.32, so keep that in mind. The other nice thing that sequence logos is that you could read off a consensus sequence by simply reading off the top letter in each pile. The letters are ordered in each pile according to their abundance in the amino acid alignment at the given column position. So to read off the consensus sequence, we would simply read the top letter in each column. W V M G N W K M N G T and that will give us the consensus sequence for that particular alignment. So we can use these to examine bits of biology and look at for instance the CAP-DNA binding complex. We see that there are certain residues on the DNA sequence that this CAP protein recognizes and these are visible here. We need a T G T G A and a T C A C A at this position and then these in term map to residues on the protein structure. these residues on the protein structure bind to these DNA residues. And we see conservation of these protein residues in terms of the DNA binding region of the helix-turn-helix Motif. In the case of yeast TATA sites, we see that certainly it does seem to be a TATA motif. This is the start of transcription in yeast promoters, for the yeast promoters. Some sites within the TATA box are better conserved than others. For instance, the second A seems to be an absolute requirement. We can also see in the case of intron-exon splice junctions that the signal is actually fairly weak. There does seem to be requirement of a G and T at the first and second position of the intron, and A and G at the last position of the intron. And then there's this polypyrimidine trapped here at towards the three prime end of the intron that is also required. But it here again, it's not a very strong signal. We also see some requirement here of some nucleotide specificity at the 3 prime end of the exon. So we're coming now back to Hidden Markov models, and hidden Markov models or HMMs offer a more systematic approach to estimating the model parameters. If we're trying to describe a specific structural pattern. It's a dynamic kind of statistical profile and as with an ordinary profile, we can build it by analyzing the distribution of the amino acids in the training set of related proteins of an alignment. However, an HMM has more complex topology than a profile. So rather than just having a matrix of values, we can use a finite state machine to represent not only the values at a given position but also the ability to transition into different states, so an insert state or delete state. And this little cartoon here just shows the kinds of states the hidden states that can exist within a model in terms of a finite state machine. In the case of a sequence HMM typically we have a certain number of match states for each position in the alignment that's well conserved / not gappy. And then we also have insert states as denoted by these characters here and then we also have delete states denoted by the circles. And to generate a sequence once we've created this HMM, we can actually generate a sequence by moving through the HMM starting at the beginning and then transitioning in any number of ways into either an insert state or a match state or a delete state. And the transition probabilities can all be described based on the data that we use to generate the HMM. And the emission probabilities associated with the match states and the insert states are also described based on the data that we use to generate the HMM. So this is sort of a cartoon of what a sequence HMM would look like. In the case of a real alignment, something like this where we have eight match states, we would basically for each match state in the sequence alignment where we have more than 50% of residues at each position, that's how we determine the number of match states with a simple heuristic here. there are more sophisticated ways of doing this, we would compute the frequency of each residue at each match state. So in this first column, for instance, we have one two, three, four five valines plus phenylalanine plus an isoleucine. And in the match state emission probability series, we would have the highest probability of emitting a valine at this given position followed by isoleucine and phenylalanine. We typically add in a very small probability of emitting other amino acids at a given position so that we can still use the HMM to score sequences rationally, and as I mentioned before we also capture the transition probabilities between states. So the transition probabilities here are denoted by the width of the arrows. So the vast majority of the the sequences don't contain any insertions or deletions. And so the transition would be typically in this direction. However, we can at some points transition into delete state or insert state. We would need to transition into an insert state to generate this sequence. Or to generate this sequence, we need to transition into a delete state, and then we finish up at the end. And then we can use this HMM using the Viterbi algorithm, sort of beyond the scope of this course. But we can use this model of sequence properties, alignment properties to then score any given sequence as to whether or not it matches the HMM or how well it matches the HMM. A database of profile HMMs is Pfam. And it encompasses a large collection of multiple sequence alignments, which are then used to generate a large collection of hidden Markov models. The current iteration encompasses around 18,000 protein families. A Pfam is formed in two separate ways. There are two flavours of Pfam models. Pfam-A HMMs are based on fairly accurate human-crafted multiple sequence alignments, whereby Pfam-B models are based on an automated clustering of the rest of SWISS-PROT using a program called Domainer. Pfam-A uses high-quality seed alignments to build HMMs and then additional sequences are added to generate a final set of aligned sequences. And the seeds for those alignments are honed by iterative methods. So there are issues. HMMs sound great and sounds like they've solved all our problems. They allow gaps. They allow deletions. However, it's a linear model and it's unable to capture a higher order correlations among amino acids in a protein molecule. So for instance, amino acids which are far apart in the linear chain, but which may be in proximity to each other when the protein folds, those interactions between, those amino acids, the dependencies can't be predicted with a linear model. And for HMMs, we assume that any amino acid in the sequence is independent of the probability of its neighbours. And this may not always be true. So in the case of a hydrophobic core of proteins, hydrophobic amino acids are likely to appear in proximity to each other. And so researchers have developed new kinds of statistical models and neural nets, hybrid HMMs, dynamic Bayesian nets, factorial HMMs, and so on. But for the purpose of this course, we're just going to explore HMMs and they really are quite useful. So in today's lab, we'll use several domain, motif, profile HMM databases and tools to examine a representative sequence. We'll look at the CDD, Conserved Domain Database. You should consider what was used to generate the CDD. We'll use CDART to identify conserved domain architectures. We'll look at SMART, which is Simple Modular Architecture Research Tool, look at Pfam. And if there's, actually, we won't be looking at HMMer, but there is a suite of tools for generating profile HMMs if you're interested in exploring that on your own. Interproscan offers a convenient way to search Pfam and other profile and motif databases. It's not completely comprehensive, but it's a really good starting place to scan for sequence patterns in a protein of unknown function if you can't find a homolog. All right, well, I hope you enjoy the lab and I'll see you in a bit.


A novel method to identify the DNA motifs recognized by a defined transcription factor

The interaction between a protein and DNA is involved in almost all cellular functions, and is vitally important in cellular processes. Two complementary approaches are used to detect the interactions between a transcription factor (TF) and DNA, i.e. the TF-centered or protein–DNA approach, and the gene-centered or DNA–protein approach. The yeast one-hybrid (Y1H) is a powerful and widely used system to identify DNA–protein interactions. However, a powerful method to study protein–DNA interactions like Y1H is lacking. Here, we developed a protein–DNA method based on the Y1H system to identify the motifs recognized by a defined TF, termed TF-centered Y1H. In this system, a random short DNA sequence insertion library was generated as the prey DNA sequences to interact with a defined TF as the bait. Using this system, novel interactions were detected between DNA motifs and the AtbZIP53 protein from Arabidopsis. We identified six motifs that were specifically bound by AtbZIP53, including five known motifs (DOF, G-box, I-box, BS1 and MY3) and a novel motif BRS1 [basic leucine zipper (bZIP) Recognized Site 1]. The different subfamily bZIP members also recognize these six motifs, further confirming the reliability of the TF-centered Y1H results. Taken together, these results demonstrated that TF-centered Y1H could identify quickly the motifs bound by a defined TF, representing a reliable and efficient approach with the advantages of Y1H. Therefore, this TF-centered Y1H may have a wide application in protein–DNA interaction studies.

This is a preview of subscription content, access via your institution.


How to recognize a conserved motifs of the protein - Biology

There are many structural elements (motifs) that are conserved among different proteins. For example carbohydrates can be attached to the amino acid asparagine in proteins through N-glycosylation sites which are indicated by the consensus sequence Asn-Xaa-Ser/Thr. The first amino acid is Asparagine (Asn), the second amino acid can be any of the 20 amino acids (Xaa), and the third amino acid is either Serine (Ser) or Threonine (Thr). However, just because this consensus sequence appears does not mean that the site is glycosylated. You can also look for more complex motifs or domains, such as enzyme active sites and receptor binding sites.

We will look at four different programs.

CDART: gives an interactive graphical display of conserved motifs in a protein

The following three can be accessed through BIOLOGY WORKBENCH.

PROSITE analyzes a protein sequence for known motifs

RPSBLAST performs a blast search of your sequence vs. a database of conserved domains

BLIMPS is similar to RPSBLAST, except that it looks for specific blocks or domains of sequence similarity

CDART: Conserved Domain Architecture Retrieval Tool. This program gives an interactive graphical display of the conserved motifs found in an amino acid sequence. You can click on each domain to learn more about its properties and consensus sequence. The program also provides graphical displays of all known proteins containing at least one of the domains found in your protein. One drawback is that this program only reports major domains, and not smaller motifs, and has fairly brief descriptions. It is a good place to start, but the programs described below under BIOLOGY WORKBENCH are more descriptive and thorough.

1. The program PROSITE analyzes a protein sequence for these known motifs and gives a description of each. This is useful when analyzing the sequence of a new protein to try to gain clues to its function.

Enter the amino acid sequence that you wish to analyze or the accession number of the protein and press Start the Scan . You will be given an output which lists several motifs present in the protein, indicating the sequence that was identified and its position in the protein. Each will also contain a link to more information on that particular motif.

For example the sequence being analyzed has potential N-glycosylation sites at amino acids 233 and 556. By clicking on PDOC00001 more information on N-glycosylation will be provided.

Other motifs are more complex and can include sites that bind cofactors or substrates (active site). Such information would be valuable in identifying the function of a protein.

2. RPSBLAST performs a blast search of your sequence vs. a database of conserved domains in families of proteins. Your sequence is compared to the consensus sequence of many families of proteins to look for a match. This is very useful in identifying which family your protein belongs to, especially over larger domains.

For example, if we sumbitted a serine protease we would get the following matches.

If we click on the link smart00020 we would learn about the consensus sequenced used, information on the family of proteins, and other sequences which are closely aligned to our sequence. There is a new 3D imaging program which allows one to view the aligned sequences. This is not loaded on our computer, but we can view it as an html image.

3. BLIMPS is similar to RPSBLAST, except that it looks for specific blocks or domains of sequence similarity. A protein may overall have relatively low similarity to another protein, but if it has high similarity in specific important regions it may have the same activity and be a homologous protein. BLIMPS compares a protein or nucleic acid sequence against an the BLOCKS database of conserved protein motifs. The scores for high scoring BLOCKS found within the query sequence are totalled and a family classification is made based on the total score for each block found in the query sequence. Individual block scores are listed beneath the family classification along with the highest scoring alignments.

For example, the protein below matched 3 out of 3 blocks for the conserved sequence of an active site of a serine protease.


Watch the video: Bioinformatics practical 23 motif scan tool to identify known domains in protein sequence (August 2022).