Input File Formatting

Overview:

Your input file should be a simple text file, with comma-separated or tab-separated columns. If you use Microsoft Excel, save your data as comma-separated file (the resulting file will have .cvs extension) or as tab-separated file (the resulting file will have .txt extension). You will need to specify the format of your file when uploading your data.

Protein RefSeq ID, UniProt ID, Ensembl ID, or Official Gene Symbol can be used to specify the proteins identified in each AP experiment.

Bait name must be specified as the Official Gene Name of the bait (for AP experiments with a bait protein). In the case of S. cerevisiae, please use the SGD standard name. To label the AP experiments that are negative controls, use C or CONTROL as the bait name.

In CRAPome version 1.1, the specification of the bait name for each AP experiment was extended to allow for different bait ‘conditions’ - different AP experiments with the same bait gene that should be considered separately. This was implemented, e.g., to distinguish isoforms or mutants of the same bait protein, or to distinguish between the interactomes of the same bait in untreated and drug treated cells.

The extended format has the following pattern: BaitName_Condition, where BaitName is the Official Gene Symbol (e.g. SGD name) of the bait protein as described above, and Condition is a text labeling the condition (this text must not contain any white spaces or special characters). The two descriptors (BaitName and Condition) must be separated using ‘_’ character. Note that Condition is optional.

For example, if in your experiment you compare the interactomes of a wild type EZH2 protein and its mutant version, you can specify the baits in the BaitName_Condition field as EZH2 and EZH2_MUTANT for the wild type and the mutant experiment, respectively

The input data can be structured in two ways:
i) List Format
ii) Matrix Format

List Format:

1. Each row provides information for a single bait-prey interaction. The file must contain the following four columns: 1) BaitName (extended to BaitName_Condition in version 1.1 as described above); 2) APName; 3) PreyName; 4) SpectralCounts.
2. Each "BaitName - APName - PreyName" combination must be unique.
3. SpectralCounts must be integers.
4. You do not need to provide the header column. If you choose to have a header, you need to prefix it with a # character.

An example file of the List format (CRAPome input file (benchmark data): MEPCE, WASL, EIF4A2, RAF1 bait s) can be downloaded from the Supplementary Data page
http://reprint-apms.org/?q=suppdata

For another example of the file in the List format, see Supplementary Data page,
User case 2: Analysis of EIF4A interactions:
http://reprint-apms.org/sites/default/files/UserCase_3_EIF4A_from_prohit...
link to the input data file (in comma-separated format):
http://reprint-apms.org/sites/default/files/Dunham_UserCase_3.csv

Need to correct typo in the word ‘interactions’ on the Supplementary Data page

Matrix Format:

1. Each row provides information for one protein and its spectral counts in all of the AP experiments in the dataset, including the negative controls.

2. The first column in the file must be the protein ID (the column must be named PROTID). Protein RefSeq ID, UniProt ID, Ensembl ID, or Official Gene Symbol can be used to specify the proteins.

3. One can specify the Official Gene Name (columns named GENEID), protein length (PROTLEN) and protein description (DEFLINE) as additional columns in the input file. These columns can be located either immediately after the PROTID column, or as the last three columns (following the columns with spectral counts), but their corresponding column names must be specified exactly as shown in the parentheses.

4. The rest of the columns in the file should contain the spectral counts.

5. The file must start with a header containing two rows.
The first row in the header should contain the names of the columns: PROTID (first column), GENEID, PROTLEN, and DEFLINE (second-fourth columns, or as the last columns in the file if used). The names of the columns containing the spectral counts must correspond to the AP Names, with a '_SPC' or a '_NUMSPECSTOT' suffix following the AP Name. For example, if the AP name is 7576_RAF1, the column containing the protein spectral counts in that experiment should be named 7576_RAF1_SPC or 7576_RAF1_NUMSPECSTOT.
The second row in the header should list the corresponding bait names/conditions for each of the AP experiments named in the first row (in BaitName_Condition format) directly underneath the AP name. For the other columns (PROTID, GENEID, PROTLEN, and DEFLINE), the second row in the header can contain any text, e.g. the names of these columns can be repeated.

Special Instructions for MAC Users