Filtering Crosslink-Spectrum-Matches and Crosslinks


from pyXLMS import __version__
 
print(f"Installed pyXLMS version: {__version__}")

✓


    Installed pyXLMS version: 1.4.3


from pyXLMS import parser
from pyXLMS import transform

All data transformation functionality - including all filters - is available via the transform submodule. We also import the parser submodule here for reading result files.


parser_result = parser.read(
    "../../data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1.pdResult",
    engine="MS Annika",
    crosslinker="DSS",
)

✓


    Reading MS Annika CSMs...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 826/826 [00:00<00:00, 14874.06it/s]
    Reading MS Annika crosslinks...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 20000.18it/s]

We read crosslink-spectrum-matches and crosslinks using the generic parser from a single .pdResult file.


csms = parser_result["crosslink-spectrum-matches"]
xls = parser_result["crosslinks"]

For easier access we assign our crosslink-spectrum-matches to the variable csms and our crosslinks to the variable xls.

Filtering by Crosslink Type


csms_by_crosslink_type = transform.filter_crosslink_type(csms)

We can filter crosslink-spectrum-matches by their crosslink type (intra-links and inter-links) by calling transform.filter_crosslink_type() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the keys "Intra" and "Inter" with their associated values being lists of intra-link CSMs and inter-link CSMs respectively. You can read more about the filter_crosslink_type() function and all its parameters here: docs.


print(f"Number of intra-link CSMs: {len(csms_by_crosslink_type['Intra'])}")
print(f"Number of inter-link CSMs: {len(csms_by_crosslink_type['Inter'])}")

✓


    Number of intra-link CSMs: 803
    Number of inter-link CSMs: 23

Our example file contains 803 intra-link CSMs and 23 inter-link CSMs.

Important

Please note that any CSMs without associated protein accessions would be considered inter-links!


xls_by_crosslink_type = transform.filter_crosslink_type(xls)

Similarly, we can filter crosslinks by their crosslink type (intra-links and inter-links) by calling transform.filter_crosslink_type() and passing the crosslinks as the first argument. The function returns a dictionary containing the keys "Intra" and "Inter" with their associated values being lists of intra-link crosslinks and inter-link crosslinks respectively. You can read more about the filter_crosslink_type() function and all its parameters here: docs.


print(f"Number of intra-link XLs: {len(xls_by_crosslink_type['Intra'])}")
print(f"Number of inter-link XLs: {len(xls_by_crosslink_type['Inter'])}")

✓


    Number of intra-link XLs: 279
    Number of inter-link XLs: 21

Our example file contains 279 intra-link crosslinks and 21 inter-link crosslinks.

Important

Please note that any crosslinks without associated protein accessions would be considered inter-links!

Filtering by Peptide Pair


csms_by_peptide_pair = transform.filter_peptide_pair_distribution(csms)

We can filter crosslink-spectrum-matches by their associated peptide pairs by calling transform.filter_peptide_pair_distribution() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the peptide pairs as keys with their associated values being lists of the corresponding CSMs. You can read more about the filter_peptide_pair_distribution() function and all its parameters here: docs.


list(csms_by_peptide_pair.keys())[:5]

✓


    ['GQKNSR-GQKNSR',
     'GQKNSR-GSQKDR',
     'SDKNR-SDKNR',
     'DKQSGK-DKQSGK',
     'DKQSGK-HSIKK']

Here are the first five peptide pairs that were encountered in our crosslink-spectrum-matches.


len(csms_by_peptide_pair["LSKSR-MKNYWR"])

✓

For the peptide pair LSKSR-MKNYWR we found 9 crosslink-spectrum-matches in our result…


import random
 
random.choice(csms_by_peptide_pair["LSKSR-MKNYWR"])

✓


    {'data_type': 'crosslink-spectrum-match',
     'completeness': 'full',
     'alpha_peptide': 'LSKSR',
     'alpha_modifications': {3: ('DSS', 138.06808)},
     'alpha_peptide_crosslink_position': 3,
     'alpha_proteins': ['Cas9'],
     'alpha_proteins_crosslink_positions': [222],
     'alpha_proteins_peptide_positions': [220],
     'alpha_score': 116.2226526440155,
     'alpha_decoy': False,
     'beta_peptide': 'MKNYWR',
     'beta_modifications': {2: ('DSS', 138.06808)},
     'beta_peptide_crosslink_position': 2,
     'beta_proteins': ['Cas9'],
     'beta_proteins_crosslink_positions': [884],
     'beta_proteins_peptide_positions': [883],
     'beta_score': 97.32530917354812,
     'beta_decoy': False,
     'crosslink_type': 'intra',
     'score': 97.32530917354812,
     'spectrum_file': 'XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw',
     'scan_nr': 8961,
     'charge': 4,
     'retention_time': 2802.83574,
     'ion_mobility': 0.0,
     'additional_information': None}

…and here is a random LSKSR-MKNYWR crosslink-spectrum-match.

Filtering by Protein


csms_by_protein = transform.filter_protein_distribution(csms)

We can filter crosslink-spectrum-matches by their associated proteins by calling transform.filter_protein_distribution() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the proteins as keys with their associated values being lists of the corresponding CSMs. You can read more about the filter_protein_distribution() function and all its parameters here: docs.


list(csms_by_protein.keys())

✓


    ['Cas9', 'sp']

In our example we have crosslink-spectrum-matches from two different proteins (or rather one protein "Cas9" and one protein group "sp" which denotes contaminants).


len(csms_by_protein["Cas9"])

✓

In total we have 821 crosslink-spectrum-matches where at least one of the two crosslinked peptides is from Cas9.


xls_by_protein = transform.filter_protein_distribution(xls)

Similarly, we can filter crosslinks by their associated proteins by calling transform.filter_protein_distribution() and passing the crosslinks as the first argument. The function returns a dictionary containing the proteins as keys with their associated values being lists of the corresponding crosslinks. You can read more about the filter_protein_distribution() function and all its parameters here: docs.


list(xls_by_protein.keys())

✓


    ['Cas9', 'sp']

We also have crosslinks from two different proteins (or rather one protein "Cas9" and one protein group "sp" which denotes contaminants).


len(xls_by_protein["Cas9"])

✓

In total we have 295 crosslinks where at least one of the two crosslinked peptides is from Cas9.

Getting Crosslink-Spectrum-Matches or Crosslinks of Specific Proteins


csms_cas9 = transform.filter_proteins(csms, proteins=["Cas9"])

If we are only interested in crosslink-spectrum-matches of a specific protein (or set of proteins) we can further investigate this with the transform.filter_proteins() function and passing the crosslink-spectrum-matches as the first argument. The second argument proteins should be a list or set of protein accessions that we are interested in - in the example here we are only interested in a single protein, namely "Cas9". You can read more about the filter_proteins() function and all its parameters here: docs.


list(csms_cas9.keys())

✓


    ['Proteins', 'Both', 'One']

The function returns a dictionary with keys "Proteins", "Both", and "One":

"Proteins" allows you to access your original list of proteins that was used for filtering (e.g., what was passed via the proteins parameter).
"Both" contains all crosslink-spectrum-matches where both peptides were of one of the specified proteins, in our case both peptides are from Cas9.
"One" contains all crosslink-spectrum-matches where only one of the two crosslinked peptides was of the specified proteins, in our case from Cas9.


csms_cas9["Proteins"]

✓


    ['Cas9']

Via "Proteins" we can access our original list of proteins that was used for filtering.


len(csms_cas9["Both"])

✓

Via "Both" we get all crosslink-spectrum-matches where both peptides are of one of the specified proteins of interest (in our case there was only one protein of interest: Cas9).


random.choice(csms_cas9["Both"])

✓


    {'data_type': 'crosslink-spectrum-match',
     'completeness': 'full',
     'alpha_peptide': 'APLSASMIKR',
     'alpha_modifications': {9: ('DSS', 138.06808)},
     'alpha_peptide_crosslink_position': 9,
     'alpha_proteins': ['Cas9'],
     'alpha_proteins_crosslink_positions': [327],
     'alpha_proteins_peptide_positions': [319],
     'alpha_score': 150.4839485561563,
     'alpha_decoy': False,
     'beta_peptide': 'TEVQTGGFSKESILPK',
     'beta_modifications': {10: ('DSS', 138.06808)},
     'beta_peptide_crosslink_position': 10,
     'beta_proteins': ['Cas9'],
     'beta_proteins_crosslink_positions': [1111],
     'beta_proteins_peptide_positions': [1102],
     'beta_score': 304.4428310555497,
     'beta_decoy': False,
     'crosslink_type': 'intra',
     'score': 150.4839485561563,
     'spectrum_file': 'XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw',
     'scan_nr': 17156,
     'charge': 4,
     'retention_time': 4648.5357,
     'ion_mobility': 0.0,
     'additional_information': None}

Here would be a random example crosslink-spectrum-match from "Both" and as you can see both peptides are from Cas9.


len(csms_cas9["One"])

✓

Via "One" we get all crosslink-spectrum-matches where only one of the two crosslinked peptides was of the specified proteins of interest (in our case there was only one protein of interest: Cas9).


random.choice(csms_cas9["One"])

✓


    {'data_type': 'crosslink-spectrum-match',
     'completeness': 'full',
     'alpha_peptide': 'VPSKKFK',
     'alpha_modifications': {4: ('DSS', 138.06808)},
     'alpha_peptide_crosslink_position': 4,
     'alpha_proteins': ['Cas9'],
     'alpha_proteins_crosslink_positions': [34],
     'alpha_proteins_peptide_positions': [31],
     'alpha_score': 23.380279868710552,
     'alpha_decoy': False,
     'beta_peptide': 'VTEFKYGAK',
     'beta_modifications': {5: ('DSS', 138.06808)},
     'beta_peptide_crosslink_position': 5,
     'beta_proteins': ['sp'],
     'beta_proteins_crosslink_positions': [272],
     'beta_proteins_peptide_positions': [268],
     'beta_score': 52.214502774371496,
     'beta_decoy': False,
     'crosslink_type': 'inter',
     'score': 23.380279868710552,
     'spectrum_file': 'XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw',
     'scan_nr': 9474,
     'charge': 3,
     'retention_time': 2945.7904200000003,
     'ion_mobility': 0.0,
     'additional_information': None}

Here would be a random example crosslink-spectrum-match from "One" and as you can see only one of the two peptides is from Cas9.


xls_cas9 = transform.filter_proteins(xls, proteins=["Cas9"])

Similarly, if we are only interested in crosslinks of a specific protein (or set of proteins) we can further investigate this with the transform.filter_proteins() function and passing the crosslinks as the first argument. The second argument proteins should be a list or set of protein accessions that we are interested in - in the example here we are only interested in a single protein, namely "Cas9". You can read more about the filter_proteins() function and all its parameters here: docs.


list(xls_cas9.keys())

✓


    ['Proteins', 'Both', 'One']

The function returns a dictionary with keys "Proteins", "Both", and "One":

"Proteins" allows you to access your original list of proteins that was used for filtering (e.g., what was passed via the proteins parameter).
"Both" contains all crosslinks where both peptides were of one of the specified proteins, in our case both peptides are from Cas9.
"One" contains all crosslinks where only one of the two crosslinked peptides was of the specified proteins, in our case from Cas9.


xls_cas9["Proteins"]

✓


    ['Cas9']

Via "Proteins" we can access our original list of proteins that was used for filtering.


len(xls_cas9["Both"])

✓

Via "Both" we get all crosslinks where both peptides are of one of the specified proteins of interest (in our case there was only one protein of interest: Cas9).


random.choice(xls_cas9["Both"])

✓


    {'data_type': 'crosslink',
     'completeness': 'full',
     'alpha_peptide': 'KSEETITPWNFEEVVDK',
     'alpha_peptide_crosslink_position': 1,
     'alpha_proteins': ['Cas9'],
     'alpha_proteins_crosslink_positions': [472],
     'alpha_decoy': False,
     'beta_peptide': 'QITKHVAQILDSR',
     'beta_peptide_crosslink_position': 4,
     'beta_proteins': ['Cas9'],
     'beta_proteins_crosslink_positions': [933],
     'beta_decoy': False,
     'crosslink_type': 'intra',
     'score': 123.60677704209819,
     'additional_information': None}

Here would be a random example crosslink from "Both" and as you can see both peptides are from Cas9.


len(xls_cas9["One"])

✓

Via "One" we get all crosslinks where only one of the two crosslinked peptides was of the specified proteins of interest (in our case there was only one protein of interest: Cas9).


random.choice(xls_cas9["One"])

✓


    {'data_type': 'crosslink',
     'completeness': 'full',
     'alpha_peptide': 'KAMAYWTGSFRAK',
     'alpha_peptide_crosslink_position': 1,
     'alpha_proteins': ['sp'],
     'alpha_proteins_crosslink_positions': [155],
     'alpha_decoy': True,
     'beta_peptide': 'NSDKLIAR',
     'beta_peptide_crosslink_position': 4,
     'beta_proteins': ['Cas9'],
     'beta_proteins_crosslink_positions': [1122],
     'beta_decoy': True,
     'crosslink_type': 'inter',
     'score': 14.236170633441338,
     'additional_information': None}

Here would be a random example crosslink from "One" and as you can see only one of the two peptides is from Cas9.

Filtering by Target-Decoy Type


csms_td = transform.filter_target_decoy(csms)

We can filter crosslink-spectrum-matches by their target-decoy type by calling transform.filter_target_decoy() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the keys "Target-Target", "Target-Decoy", and "Decoy-Decoy" with their associated values being lists of the corresponding target-target, target-decoy, and decoy-decoy CSMs respectively. You can read more about the filter_target_decoy() function and all its parameters here: docs.


len(csms_td["Target-Target"])

✓

Via "Target-Target" we can access all crosslink-spectrum-matches where both peptides are from the target database.


len(csms_td["Target-Decoy"])

✓

Via "Target-Decoy" we can access all crosslink-spectrum-matches where one peptide is from the target database and one peptide is from the decoy database. Therefore both target-decoy and decoy-target matches are contained in "Target-Decoy".


len(csms_td["Decoy-Decoy"])

✓

Via "Decoy-Decoy" we can access all crosslink-spectrum-matches where both peptides are from the decoy database.


xls_td = transform.filter_target_decoy(xls)

Similarly, we can filter crosslinks by their target-decoy type by calling transform.filter_target_decoy() and passing the crosslinks as the first argument. The function returns a dictionary containing the keys "Target-Target", "Target-Decoy", and "Decoy-Decoy" with their associated values being lists of the corresponding target-target, target-decoy, and decoy-decoy crosslinks respectively. You can read more about the filter_target_decoy() function and all its parameters here: docs.


len(xls_td["Target-Target"])

✓

Via "Target-Target" we can access all crosslinks where both peptides are from the target database.


len(xls_td["Target-Decoy"])

✓

Via "Target-Decoy" we can access all crosslinks where one peptide is from the target database and one peptide is from the decoy database. Therefore both target-decoy and decoy-target matches are contained in "Target-Decoy". As you can see here the number of "Target-Decoy" matches is zero for our MS Annika crosslink results because on the crosslink-level MS Annika reports any target-decoy and decoy-target matches as full decoy-decoy matches.


len(xls_td["Decoy-Decoy"])

✓

Via "Decoy-Decoy" we can access all crosslinks where both peptides are from the decoy database.

Important

Please note that any crosslink-spectrum-matches or crosslinks with missing target-decoy labels will be filtered out by this function!

Filtering Target Matches Only

Because we are often only interested in target-target matches there is a shorthand function that returns only target-target matches called transform.targets_only(). In contrast to all previous filter functions targets_only() accepts both lists of crosslink-spectrum-matches or crosslinks or a parser_result as input (see data type documentation here: docs). The return type will be the same as the input type. You can read more about the targets_only() function and all its parameters here: docs.


csms = transform.targets_only(csms)
print(f"Nr. of TT CSMs: {len(csms)}")

✓


    Nr. of TT CSMs: 786

Here is an example of calling targets_only() on a list of crosslink-spectrum-matches: a list of crosslink-spectrum-matches containing only target-target matches is returned.


xls = transform.targets_only(xls)
print(f"Nr. of TT crosslinks: {len(xls)}")

✓


    Nr. of TT crosslinks: 265

Here is an example of calling targets_only() on a list of crosslinks: a list of crosslinks containing only target-target matches is returned.


parser_result = transform.targets_only(parser_result)
print(f"Nr. of TT CSMs: {len(parser_result['crosslink-spectrum-matches'])}")
print(f"Nr. of TT crosslinks: {len(parser_result['crosslinks'])}")

✓


    Nr. of TT CSMs: 786
    Nr. of TT crosslinks: 265

Here is an example of calling targets_only() on a parser_result: a parser_result containing only target-target matches is returned.

Important

Please note that any crosslink-spectrum-matches or crosslinks with missing target-decoy labels will be filtered out by this function!