Design a Simple Computational Protocol to Estimate the Secondary Structure of Proteins Using the Amino Acid Propensities and the Graphic Programming Language LabVIEW

Glomen Tovar*

Institute of Biomedical Research (BIOMED), University of Carabobo, Núcleo Aragua, Las Delicias Maracay, Venezuela

*Corresponding Author:
Glomen Tovar
Institute of Biomedical Research (BIOMED)
University of Carabobo
Núcleo Aragua, Las Delicias Maracay
Venezuela
E-mail: tglomen@gmail.com

Received Date: February 08, 2021; Accepted Date: February 22, 2021; Published Date: March 01, 2021

Citation: Tovar G (2021) Design a Simple Computational Protocol to Estimate the Secondary Structure of Proteins Using the Amino Acid Propensities and the Graphic Programming Language LabVIEW. Am J Compt Sci Inform Technol Vol.9 No.2: 76.

Abstract

A simple computational protocol to estimate the secondary structure of a protein is developed, using the amino acid propensities and the graphic programming language LabVIEW. The protocol estimates the number of residues of the structures α-helix, β-sheet and turns; their validation can be checked by the method of leastsquares in the different established relationships. 31 proteins were analyzed using the NADH dehydrogenase as a standard to evaluate the performance of protocol, the results obtained when comparing the values of the programmed structures with those calculated, showed that the Pearson’s correlation coefficients (r) obtained were statistically significant (P<0.001). This result shows that the number of α-helix, β-sheet and turns are a linear function of the number of amino acid residues in the proteins. It can be concluded that this procedure can be useful for novice researchers with limited technological resources and poor laboratory infrastructure, to estimate the secondary structure of a variety of proteins, suggesting some consequences in protein folding.

Keywords

Protocol; Estimate; Least squares; Propensities; Secondary structure; LabVIEW

Introduction

Estimate the secondary structure of proteins is a fundamental part in the analytical study of their structure and function, whose sequences can help to assess their superior structures. Establishing a computational protocol to estimate the secondary structure of proteins, it can be knows the composition in terms of its amino acid sequence in order to reach its presumption [1]. In first instance, it can be visualized in a UniProt Swiss-Prot text-based computer file format FASTA the amino acid sequence, transpose it and generate a file. In the past decade’s technique of predicting the secondary structure of proteins has been developed in several studies. The Chou-Fasman method [2], was one of the first methods tested for this purpose, which is based on a successful statistical procedure on the potential conformation or propensities for all amino acid residues [3]. However, this method presented limitations due to the low precision in the results. These values have been utilized to offer a simple procedure without complex computer calculations to predict the secondary structure of proteins from their known amino acid sequence. Other prediction methods were developed, based on different algorithms, such as: the advent of X-ray crystallography [4], involving compose, to create the amino acids of the secondary structures of the proteins [5], exhaustive statistical analyses of the amino acid sequence as a function of (C) and (N) terminal [6,7], improvement of multiple linear regression methods based on the frequency of different amino acids and the length of the protein [8,9], improvement of the Chou-Fasman method thanks to the recent development in the folding of proteins, through improvements in propensities and the wavelet transform [10], choice of the best data set of proteins to check their propensities [11]. Other methods develop algorithms based on, such as, nonlinear models of neural networks [12,13], methods of the nearest neighbor [14,15], using multiple alignments [16], production of FASTA files from complete α-helix, β-sheet, and turns sequences using their propensities in computational applications or by their hydrophobic characteristics [17] among others. In the present work, the goal is to develop a simple protocol to estimate the secondary structure of proteins using LabVIEW programming language, the propensities of amino acids are finally evaluated and validated by comparing the results obtained in the structure programmed with the calculated structure. The protocol can be useful for novice researcher with limited technological resources and laboratory infrastructure.

Materials and Methods

Propensity of amino acids in different types of secondary structures

The Chou-Fasman method is an empirical technique for predict of secondary structures of proteins, originally developed. The method is based on the analysis of the relative frequencies of each amino acid in α-helix, β-sheets and turns on structures of known proteins solved with X-ray crystallography. From these frequencies, a set of probability parameters was derived for the appearance of each amino acid in each type of secondary structure, and these parameters are used to predict the probability that a given sequence of amino acids will form α-helix, a β-chain or a turn in a protein.

Residual propensity values (P (ES)) in different types of secondary structures, were determined from the ratio of the frequency of occurrence of the residue in α-helix(Pα), β-chain(Pβ) and turns(Pturn) versus their frequency of appearance in the protein by the following expression:

P (ES)=fraction of the residue/fraction of total residue (1)

In this work the propensities of Chou Fasman were used. The values can be seen, with the following considerations:

1. Any segment of six residues or more in a native protein with <Pα> ≥ 1.03 and <Pα >><Pβ>, is predicted as helical.

2. Any segment of three residues or more in a native protein with <Pβ> ≥ 1.05 and <Pβ >><Pα>, is predicted as β-sheet.

3. Any segment of four or more residues in a native protein with <Pturn> ≥ 1.00 and P (α-helix)<P (turn)>P (β-sheet), is predicted as a turn sheet.

The values shown in (1) were applied to get the graphs of the secondary structures (α, β and turns), considering the values of <Pα>, <Pβ> and <Pturn> ≥ 1.00 both for non-superimposed and superimposed on the front panel.

However, there are other versions of propensity estimates for the amino acids of other authors but they do not differ appreciably.

Method of least-squares: The least-squares is a numerical analysis procedure, in which we try to find the continuous function that best approximates the data by providing a visual demonstration of the relationship, between the points of the data and is expressed in the equation of the straight line, expressed as follows:

y=mx+b (2)

Where m is the slope and b the interception with the “x” axe.

The development and application of the equation, allows to search for a line of better fit that explains the possible relationship between the independent variable (in this case is the number of amino acid residues of the NADH dehydrogenase protein) and the dependent variable (number of residues of the structures secondary α-helix, β-sheet and turns). From these designations is obtained the equation for the line of best fit, determined from the method of least-squares validating the values gotten in the protocol designed [18].

Software used: The software LabVIEW (a trademark of National Instruments of Austin TX brand) is used to create the computational protocol, this software is a programming language that uses icons instead of lines of text. The user interface (known as the front panel) is constructed from graphic codes. The block diagram has the lines connected by guiding the flow of information from one code to the next [19].

Protocol: A set of proteins obtained from the UniProt database were used to build and validate the protocol. This protocol can be used to estimate about the secondary structure of the proteins, by means of a programmatic procedure and validated using the method of the least-squares [20].

The DATASET or amino acid sequences of the proteins were obtained through the UniProt FASTA. A total of 31 proteins randomly chosen with avary of the number of residues from 54 to 476 represent a significant sample to relate them to the number of secondary structures (α-helix, β-sheet and turns).This relationship was made based on its conformational parameters or propensities, which represents an intrinsic property of the amino acids.

Once the DATASET has been selected, it is necessary to create a SAR (Amino acid Residue String), that is amino acid residues of the protein coming from the FASTA format, which must be aligned through program TAAS (Transposition Algorithm Amino acid Sequences), to generate a xls file and converter a csv file for transposition or GFTS (Generate File Transpose Sequences). These files are introduced to the program or SSPA (Secondary Structure Predictive Algorithm), whose development will generate the -xls file of the estimated protein or GFPSS (Generate File Protein Secondary Structure), whose operability is explained in the programs following (Figure 1).

information-technology-predict

Figure 1: Protocol to predict secondary structure of proteins: DATASET (Amino Acid Sequences FASTA-Uniprot) Abbreviations: SAR: Amino acid Residues String; TAAS: Transposition Algorithm Amino Acid Sequences; GTFS Generate File Transpose Sequences; SSPA: Secondary Structure Predictive Algorithm; GFPSS: Generate File Secondary Structure Protein.

Operability of the programs

TAAS (Transposition Algorithm Amino acid Sequences): The algorithm generated in the so-called block diagram allows to concatenate amino acid sequences (FASTA) through the concatenate string function, which are connected to the programming structures: While loop that has the function of stopping program execution, the programmatic structure event structure that executes the action commands of the program and the for Loop structure that allows iteration of the data supplied from the concatenated sequences [21]. These structures generate data that is quantified using the string length function and once the program is activated, it generates the information; it transmits and connects to the reshape array function, which changes the dimensions of an array according to the size values of the dimension. When the previous actions are completed, an output array is generated that connects to the Insert into array function to which a control or column name is added and whose result is finally connected to a write to spreadsheet file, where it is generated the information that designs the route from the -xls file (Figure 2).

information-technology-visualize

Figure 2: Block diagram to visualize programming structure (while loop, even structure and for loop) to transpose sequences in NADH dehydrogenase and generate-xls file.

Front panel: When a virtual instrument (vi) is opened in LabVIEW, the front panel window representing the user interface appears. In the case of the TAAS program, it represents the virtual instrument of the user interface, which is made up of 4 string control where the “strings” that represent the amino acid sequences in the FASTA format and a file path control where the name and path of the generated file [22]. This file is generated by controls numerical where the size of the analyzed sequence appears, once the program has been activated with the button to generate the file, an array control to name the columns of the parameters of the data to be analyzed, and a program stop of exit button (Figure 3).

information-technology-transpose

Figure 3: Front panel to visualize transpose sequences secondary structure in NADH dehydrogenase and generated path-xls file.

SSPA (Secondary Structure Predictive Algorithm): This algorithm originated in the so-called block diagram allows producing XY graphs, control tables and -xls file that offer data to distinguish the secondary structure of the protein. These elements are produced by beginning to explore through the programming functions Read from Spreadsheet File, the -csv files of the protein sample, the patterns of the secondary structures and the overlapping and not overlapping structures; which will then be connected through a programming function delete from array with a for loop structure to iterate with a Search 1D array and an index array that are associated with a ship register and a built array, to produce an Array containing the information required to complement the action of the program structures (while loop, event structure, for loop and case structure). These structures will operate to originate the expected results when associated with other programming functions such as fract/exp, string to number and transpose 2D array to get the XY graphs of the secondary structures α-helix, β-sheet, turns and overlay graphics. At the same time the array produced that has the information required to complement the action of the programmatic structures, is associated with another functions as transpose array and a delete from array to link with the programming structures for loop and case structure to discriminate among them the α, β and turn forms that when associated in a built array generate a table of overlapping secondary structures and a -xls file using a function write to spreadsheet file (Figure 4).

information-technology-programming

Figure 4: Block diagram for programming structures (for loop, while loop, case structure and event structure) in the estimation of the secondary structure in NADH dehydrogenase.

Front panel: When the virtual instrument for estimate the secondary structure of proteins is open in LabVIEW, the user interface appears, in this case, the estimate algorithm of the secondary structure of the protein is constituted by a set of express control that are associated with the functions and controls of the user interface. In this way, can be found text control as file path control, for proteins, for amino acid patterns (α-helix, β-sheet and turns) and superimposed residuals, XY graph for α-helix, β-sheet, turns and overlays; list, table and tree of table for table of superimposed residues, string and path for generation of -xls file, numeric controls for name of rows of table and –xls file; button and switches to visualize graphics, tables, graphics of superposition, generate files and of exit of the program (Figure 5).

information-technology-Graph

Figure 5: (A) Graph overlap estimate secondary structure in NADH dehydrogenase and (B) Table overlap estimate secondary structure in NADH dehydrogenase.

The programs are activated in the status toolbar with the program execution buttons.

Particle swarm-based scheduling agent

In this section, the scheduling factor is an imitation-based method to create an optimal scheduling of the PSO algorithm. In this algorithm, the population is equal to the number of particles in the problem space. The particles are randomly initialized. Each particle will have a compatibility value and will be evaluated by a compatibility function that must be optimized in each generation [23,24]. In other words, this solution is equivalent to a bird in the pattern of collective movement of birds. Each particle has a merit value calculated by a merit function. The closer the particle is to the target in the search space (food in the bird movement model), the greater the merit is. Each particle also has a velocity that directs the particle's motion. Each particle continues to move in the problem space by following the optimal particles in the current state (Figure 6).

information-technology-squares

Figure 6: Least squares analysis as linear function between the number of residues (Nr) and the number of amino acids (Na) in NADH dehydrogenase for (A) alpha helice (B) beta sheet and (C) turns.

Click on → and activate → to run the program

With the program activated the button is pressed and the user must introduce the information required by the program, in order to generate the -xls file and visualize the table and the graphics of amino acid residues, without superimposing and superimposed, that have been related to the propensities of the amino acids of secondary structure α-helix, β-sheet and turns (Figure 7) (Tables 1 and 2).

Amino acid Propensity
Pa Pt
A-Ala 1,4 0,8 0,7
R-Arg 1,0 0,9 1,0
D-Asp 1,0 0,5 1,5
N-Asn 0,7 0,9 1,6
C-Cys 0,7 1,2 1,2
E-Glu 1,5 0,4 0,7
Q-Gln 1,1 1,1 1,0
G-Gly 0,6 0,8 1,6
H-His 1,0 0,9 1,0
I-Ile 1,1 1,6 0,5
L-Leu 1,2 1,3 0,6
K-Lys 1,1 0,7 1,0
M-Met 1,5 1,1 0,6
F-Phe 1,1 1,4 0,6
P-Pro 0,6 0,6 1,5
S-Ser 0,8 0,8 1,4
T-Thr 0,8 1,2 1,0
W-Trp 0,8 1,2 1,0
Y-Tyr 0,7 1,5 1,1
V-Val 1,1 1,7 0,5

Table 1: Relative amino acids propensity value for secondary structure used in the Chou-fasman methods.

Cod. Uniprot Proteins NAs Programmed structure Calculated structure  
Helix sheet Turn Helix Sheet Turn
      Nr Nr Nr Nr Nr Nr
P18669 Phosphoglycerate mutase 1 525 103 84 65 81 93 78
P00168 Cytochrome b5 87 35 29 23 27 94 26
Q3E846 Cytochrome c oxidase 104 50 28 26 32 40 32
Q9HTK7 Rubredoxin 3 54 15 22 17 16 22 16
Q4HWU5 Trypsin inhibitor 56 13 28 15 17 23 17
P80176 High potencial iron 83 33 25 25 26 32 25
P08905 Lysosome C -2 148 50 60 38 47 55 46
P83443 Macrodontain-1 213 53 83 77 68 79 66
P40571 Ribonuclease P 144 53 52 39 45 54 44
P00272 Rubredoxin 2b 173 49 65 59 55 64 53
Q9HTK8 Rubredoxin 2a 55 17 23 15 16 22 16
P04069 Carboxypeptidase B 303 77 123 103 97 111 94
P00766 Chymotrypsinogen A 245 59 103 83 78 90 76
P07630 Carbonic anhydrase 253 88 90 80 82 95 80
P168870 Carboxypeptidase E 476 159 161 156 153 173 149
P17538 Chymotrypsinogen B 263 67 108 88 84 97 82
P02866 Chonconavalin A 290 68 118 104 93 106 90
P61949 Ravadoxin 1 176 59 68 49 56 65 54
P35557 Glucokinase 465 181 163 121 150 169 146
O75438 NADH dehydrogenase 58 21 25 12 17 23 17
P00441 Superoxide dismutase 154 47 49 58 49 58 47
P83748 Serine Protease 304 90 117 97 97 111 95
P00044 Cytochrome c iso-1 109 42 36 31 34 42 33
A0A1 K0FU49 Myoglobin 154 68 48 38 49 58 47
P022655 Apolipoprotein C2 101 34 43 24 31 39 31
F7VJQ1 Alternative P nonprotein 73 24 27 22 22 29 22
PO2656 Apolipoprotein C3 99 40 34 25 25 38 30
QOVDE8 Adipogenin 80 18 40 22 31 31 24
Q13015 Protein A F1 q 90 27 32 31 35 35 27
Q9NZD4 Alpha hemoglobin 102 37 36 29 39 39 31
P14621 Acyphosphatase-2 99 30 36 33 38 38 30

Table 2: Secondary structure (alpha-helices, beta-sheet and turn) programmed and calculated of protein Uniprot.

information-technology-Pearson

Figure 7: Pearson correlation between the secondary structures (xxx) alpha helice, (+++) beta sheet and (...) turns programmed and calculated of the 31 proteins analysed.

Results and Discussion

The operability of the program is evidenced with the use of a protein sample extracted from the DATASET FASTA of UniProt from a total of 31 proteins randomly chosen; the NADH dehydrogenase of 58 amino acid residues was selected. When the program is activated in the bar tools, the front panel asks, for the place and identification of the corresponding parameters, then the file name is generated, the representative graphs of the amino acid residues of the secondary structure will be displayed (α-helix, β-sheet and turns) on the front panel. The graph of the amino acid residues of the secondary structure superimposed (define the protein secondary structure estimated), the table of the superimposed residues that show the quantified state of the same and the site path of the -xls file can observed. The result of the process described in the graph and the table of the quantification of the superimposed amino acid residues of the NADH dehydrogenase were extracted.

The procedure of the activation of the program previously described, was applied completely in the set of 31 selected proteins (DATASET FASTA of UniProt), obtaining the results quantified programmatically for each of them.

The values obtained with the program were validated using the method of the least-squares.

Both results were tabulated for the 31 proteins, reflecting the number of amino acids (Na), and the number of amino acid residues (Nr) of the estimated secondary structures (α-helix, β-sheet and turns).

It was constructed a X Y graph to relate the values obtained by the method of least-squares, and the values of the programmed residues (Nr) with the number of amino acids (Na) of the protein samples, obtaining the expressions of the straight lines for each structure programmed with their pearson correlation coefficients (r), which shown statistically significant results for P<0.001.

With the expressions of the straight line of the secondary structures (α-helices, β-sheet, and turns), the values of the calculated residuals (Nr) were obtained. All the values for the amino acid residues (Nr) of the programmed and calculated secondary structure in the 31 proteins set and their Pearson correlation coefficients (r), which showed statistically significant results for a P<0.001. This result shows that the number of α-helix, β-sheet and turns are a linear function of the number of amino acid residues in the protein, it can be suggested this fact as a result for protein folding due fundamentally to the linearity of the turns in the peptide chain as indicated.

Conclusion

A simple computational protocol to estimate the secondary structure of a protein is developed in this research, using the propensities of the amino acids and the graphic programming language LabVIEW.

The NADH dehydrogenase protein is utilized to evaluates the performance of protocol, the results shown that Pearson's correlation coefficients (r) obtained is statistically significant (P<0.001).

Estimate the secondary structure of proteins using the LabVIEW programming language, provides a tool to be used in some daily laboratory routines where the resources of a laboratory and the technology are insufficient to fulfill the purposes suitable for an investigation. That is because this protocol could solve these deficiency and help in the routine tasks of novice researchers, and can be used to estimate the properties of a variety of proteins.

The program would allow the user to manage a data from an Excel file in the -csv format for amino acid sequence of a protein, using residues of the 20 amino acids with propensity for structures (α, β or turns) without overlapping and overlapping.

In this way, the user can manipulate any protein in the previously proposed scheme and get all the necessary information to graph, tabulate and store information generated in -xls files that will be useful for his research work.

Acknowledgements

I want to thank to my daughter, Carolina Tovar PhD, Professor of the Central University of Venezuela for her valuable advice and continue support in writing the manuscript. I also want to thank to my son, Engineer Raimundo Tovar for his insightful comments on the manuscript.

The world is trying to become completely wireless, demanding uninterrupted access to information anytime and anywhere with better quality, high speed, increased bandwidth and reduction in cost.

References

Select your language of interest to view the total content in your interested language

Viewing options

Flyer image
journal indexing image

Share This Article