However, any computational method is quite inefficient when there are no specific information and one has to scan an entire protein surface because results can give hundreds of possible binding sites of roughly equivalent energy. These methods may be useful to determine the kinds of chemical groups which might bind to already known active site (Petsko and Ringe, 2004).
To identify correct and accurate binding site for a particular ligand, several experimental methods have been used and improved in the recent years. These experimental methods can be X-ray crystallography-based or NMR-based and both approaches have shown good success in the performance.
X-ray crystallography-based technique, MSCS (multiple solvent crystal structures) can map an entire protein binding surface and identify binding sites of small organic molecules in order to find possible functional sites on the surface of any protein that can be crystallized. In MSCS method, protein crystals are soaked in an organic solvent probe that mimics a particular functional group on a ligand. Since solvent probes are usually found to cluster in only a few binding sites, regardless of their polarity, these sites are identified as functional sites. An advantage of MSCS over computational methods is that it finds much more restricted set of binding sites. For example, comparison of experimentally determined solvent positions with those obtained by MCSS and GRID have revealed that both computational methods found the same sites, but also identified many others equally favorable for probe binding (Petsko and Ringe, 2004). The biggest advantage of MSCS technique is that the locations of various probes may be used to generate a map of functional groups that have good specificity and affinity for particular binding sites. With this map, scientists can get an information of how to modify existing compounds to obtain higher specificity or alternatively, it may be used to create a de novo design of a compound (Mattos and Ringe, 1996).
Recently, a cost-effective alternative, a computational probe-mapping technique called Mixed-solvent molecular dynamics (MixMD) has been developed. MixMD considers the dynamic aspect of protein and uses molecular dynamics simulations of proteins in binary solvent mixtures. Moreover, it is able to identify both competitive and allosteric sites on proteins which are especially important for drug design (Ghanakota and Carlson, 2016).
NMR techniques are usually fragment screening-based. In these techniques, besides organic solvents probes, small-molecule compounds probes are used as well. A particular SAR (structure-activity relationships) by NMR technique (Shuker, Hajduk, Meadows and Fesik, 1996) has been found quite successful. The technique generates a ligand from building blocks which all bind to individual protein subsites and finds the consensus site which actually represents pockets of the protein’s active sites (Petsko and Ringe, 2004).
1.3 Deriving Function from Sequence
In the past few decades many evolutionary changes had happened in the field of Genomics. From understanding gene expression to genome-sequencing of more than 800 organisms. Arm-in-arm with technological advancements, sequence information is growing at an exponential rate. However, a large number of sequences in many sequenced organisms still remain functionally uncharacterized. Due to this fact, it is not surprising that a lot of effort is made to predict structures and inter functions of proteins directly from the sequence. Such efforts are usually based on the comparison of sequences from other organisms through computational tools to obtain information about related sequences, that is, bearing homology in mind. This moreover, is based on the assumptions that homologous proteins that have similar sequences also have similar structures and functions (Lee, Redfern and Orengo, 2007). Although the possession of sequence similarity is usually indicative of underlying structural similarity, function prediction by sequence-based methods remains less reliable. Sequence-based methods to derive protein structure and function can be used only for sequences that are quite closely related to those of known structure and function, but sometimes, not even then (Petsko and Orengo, 2004; Sangar et al., 2007). Generally, what makes it possible to assume that similar sequences have similar functions is the fact that protein function is often carried out by a set of speci?c conserved amino acids which often come in the form of a pattern. For example, residues forming active sites or binding speci?c ligands. These kinds of patterns are known as ‘deterministic patterns’. On the other hand, a ‘stochastic pattern’ reports the probability that one amino acid occupies a certain position (Tramontano, 2005).
Sequence similarity between two proteins ;40% is usually considered as high enough for safely transferring the function between those proteins. What remains a big challenge in sequence-based methods is deducing the structure and function for proteins which have sequence identity significantly below the 40% threshold, and as many studies revealed, identification of functional similarity appeared to be far more difficult than the identification of structural similarity (Petsko and Ringe, 2004).
As already mentioned, annotating the function from sequence alone may be a challenging task. The most important question to be considered is: ”what sequence similarity measures/thresholds should be used for the safely transferring function between related proteins?” Besides this issue, nature provides us with examples where underlying sequence similarity doesn’t imply functional similarity. Several studies in the past few decades have investigated this issue and tried to elucidate the sequence-function relationship. Most authors agree that sequence identity > 40% between two proteins can be enough to say that they share a common function (Table 1). Some authors also reported that approximately 90% of pairs of proteins with sequence identity > 40% conserve all four EC numbers. The authors also investigated the level of accuracy in annotation process and concluded that >60% pairwise sequence identity is required for a transfer with less than 30% errors and for errors below 10%, >75% sequence identity (Rost et al., 2003).
Important to emphasize is the fact that even identification of the same, common biochemical function does not mean the same or similar cellular and other higher-level functions. However, local alignments of motifs can usually identify at least one function of the protein. If long enough, motifs can be identified as a domain with a particular structure and function, what is of great importance for function prediction.
1.3.1 Sequence Alignment and Comparison
Sequence comparison represents one of the key analysis in deducing the function since sequence similarity can reveal if they derived by evolution from the same ancestral sequence. If they are similar in this way, they are considered to be homologous and can be aligned to identify correspondence of their amino acid sequences that most likely reflects their evolutionary history (Petsko and Ringe, 2004; Tramontano, 2005). Identified closely related homologs then can be used to infer the function of unknown proteins.
However, one important issue must be considered in homology-based function inference. Since there are two types of homologous proteins: orthologues (proteins related by a speciation event) and paralogues (proteins related by duplication event), it may be the case that the detection of homology between two proteins does not guarantee that they share a common function.