PhosPhAt 4.0 - Hotspots

Prediction of Phosphorylation Hotspots

Training data set

Experimentally determined phosphorylation hotspots: hotspots-experimental_w17_d10.txt

Resulting positive and negative data sets as sequences and vectors:

negative sequences
positive sequences
all resulting vectors
vectors training set 1
vectors training set 2
vectors training set 3
vectors training set 4
vectors training set 5
vectors test set 1
vectors test set 2
vectors test set 3
vectors test set 4
vectors test set 5

Prediction

Raw data of predictions

For each of the 12.866.960 Windows within the Arabidopsis proteome a prediction score was determined. It is possible that a window receives a positive score even if it does not contain phosphorylatable amino acids, as within the training dataset the fraction of S, T and Y were not used as additional parameters. This was done, to maintain an unbiased view at this step, and only upon consolidation the S, T and Y content was consisdered.

All 12.866.960 Scores

Consolidation step 1: Runs

Consecutive windows with positive scores (without interruption by a stretch of negaive scores) were consolidated to one "run". For each "run" it was then checked if conditions of a phosphorylation hotspot are met: A phosphorylation hotspot was defined as containing 4 phosphorylatable amino acids (S, T or Y) which were not further apart than 10 amino acids.

If there was less than 17 amino acids (a window size) between two "runs", these two runs were overlapping. For example: if windows of amino acid positions 1-17, 2-18; 3-19; 4-20 and 5-21 were predicted with positive score, they will form a "run" (positions 1-21). If the next window (6-22) has a negative score, and the following windows again form a stretch of positive scores (7-23; 8-24; 9-25; = "run" 9-25), these two runs overlap on the positions 7-21.

Runs for SVM (score>0)

Runs für die SVM (score>1)

Consolidation step 2: From Runs zu Hotspots

Overlapping "runs" were consolidated to Hotspot regions. In the example above, the two "runs" 1-21 and 9-25 would be combined to a single hotspot 1-25.

predicted Hotspots (Score > 0)

predicted Hotspots (Score > 1)

Statistics

Number of Windows	Windows with score >0	Valid windows with score >0	Valid runs with score >0	consolidated Hotspots with score >0	Windows with score >1	Valid windows with score >1 (at least 4 STY)	Valid runs with score >1	consolidated Hotspots with score >1
12866960	945670	592681	1563102	54329	160780 (sic!)	102664	338681	13677

An overview of the score distributions (i.e. how many windows were predicted with which score) can be found here: window-score-distribution.ods