Biologists regularly search databases of DNA or protein sequences for evolutionary or functional relationships to a given query sequence. We describe a ranking algorithm that exploits the entire network structure of similarity relationships among proteins in a sequence database by performing a diffusion operation on a pre-computed, weighted network. The resulting ranking algorithm, evaluated using a human-curated database of protein structures, is efficient and provides significantly better rankings than a local network search algorithm such as PSI-BLAST. Protein ranking: from local to global structure in the protein similarity network -- Supplementary data
Jason Weston, Andre Elisseeff, Dengyong Zhou, Christina Leslie and William Stafford Noble
Abstract
- Supplementary results (in PDF format).
- Animation (GIF) of Figure 12 from the supplement.
- ROC50 scores for all queries and all detection methods from the paper in plain text format.
- SCOP sequence file in FASTA format containing all sequences in SCOP version 1.59 with less than 95% identity.
- Swiss-Prot Sequence file in FASTA format containing all sequences in Swiss-Prot version 40 (zipped, 21 Mb).
- 7329x7329 Kernel matrices for methods used in the experiments: (here are the IDs by row or column)
- BLAST matrix, ascii text file, gzipped (49 MB).
- PSI-BLAST(SCOP) matrix using the complete 7329 examples as a database, ascii text file, gzipped (52 MB).
- PSI-BLAST (SCOP+SPROT) matrix using all SCOP+SPROT examples as a database, ascii text file, gzipped (9 MB).
- 108,931x108,931 PSIBLAST score kernel matrix for RankProp used in the experiments (342 Mb): the first 7329 IDs are from SCOP, IDs from 7330 onwards are SPROT proteins. Format: <index> <number of homologs> <indices of homlogs> <e-values of homologs>.
- 7329x108,931 PSIBLAST score kernel matrix for RankProp used for the queries in the experiments (189 Mb): unlike the above file, all edges are given (not just the first 1000).
- C++ code to run the experiments:
- RankProp code (there is also a more general command line driven version here).
- Evaluation of a ranking provided by a given distance matrix, returns ROC-50 scores of each query.