Reading Scout Result Files
from pyXLMS import __version__
print(f"Installed pyXLMS version: {__version__}") Installed pyXLMS version: 1.5.2from pyXLMS import parser
from pyXLMS import transformAll functionality to parse crosslink-spectrum-matches (CSMs) and crosslinks (XLs) from Scout result files is available via the parser submodule. We also import the transform submodule to show some summary statistics of the read files.
Reading Scout Result Files via parser.read()
parser_result = parser.read(
"../../data/scout/Cas9_Filtered_CSMs.csv",
engine="Scout",
crosslinker="DSSO",
) Reading Scout filtered CSMs...: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1306/1306 [00:00<00:00, 12731.16it/s]Any Scout result file can be read using the parser.read() method and setting engine="Scout". The method also requires us to specify the crosslinker that was used for the experiment, which in this case is DSSO (crosslinker="DSSO"). You can read the documentation for the parser.read() method here: docs.
for k, v in parser_result.items():
print(f"{k}: {type(v) if isinstance(v, list) else v}") data_type: parser_result
completeness: partial
search_engine: Scout
crosslink-spectrum-matches: <class 'list'>
crosslinks: NoneThe parser.read() method returns a dictionary with a set of specified keys and their values. We refer to this dictionary as a parser_result object. All parser.read* methods return such a parser_result object, you can read more about that here: docs, and here: data types specification.
As you can see from the parser_result this Scout result file contains CSMs. See crosslink-spectrum-matches: <class 'list'> in the print out. We would be able to access those via parser_result["crosslink-spectrum-matches"]. We will do this a bit further down.
_ = transform.summary(parser_result) Number of CSMs: 1306.0
Number of unique CSMs: 1306.0
Number of intra CSMs: 1306.0
Number of inter CSMs: 0.0
Number of target-target CSMs: 1306.0
Number of target-decoy CSMs: 0.0
Number of decoy-decoy CSMs: 0.0
Minimum CSM score: 0.061476927286783
Maximum CSM score: 0.462267769996323With the transform.summary() method we can also print out some summary statistics about the CSMs in the file. You can read more about the method here: docs.
sample_csm = parser_result["crosslink-spectrum-matches"][0]
for k, v in sample_csm.items():
print(f"{k}: {v}") data_type: crosslink-spectrum-match
completeness: partial
alpha_peptide: MLASAGELQKGNELALPSK
alpha_modifications: {10: ('DSSO', 158.00376), 1: ('Oxidation', 15.994915)}
alpha_peptide_crosslink_position: 10
alpha_proteins: ['sp|Cas9|Cas9']
alpha_proteins_crosslink_positions: [1226]
alpha_proteins_peptide_positions: [1217]
alpha_score: None
alpha_decoy: False
beta_peptide: MLASAGELQKGNELALPSK
beta_modifications: {10: ('DSSO', 158.00376)}
beta_peptide_crosslink_position: 10
beta_proteins: ['sp|Cas9|Cas9']
beta_proteins_crosslink_positions: [1226]
beta_proteins_peptide_positions: [1217]
beta_score: None
beta_decoy: False
crosslink_type: intra
score: 0.3903793560071014
spectrum_file: XLpeplib_Beveridge_Lumos_DSSO_stHCD-MS2.raw
scan_nr: 21781
charge: 3
retention_time: None
ion_mobility: None
additional_information: NoneUsing parser_result["crosslink-spectrum-matches"][0] we can get the first CSM of the file and take a closer look at that.
This is an example CSM, you can learn more about the specific attributes and their values here: docs, and here: data types specification.
parser_result = parser.read(
"../../data/scout/Cas9_Residue_Pairs.csv",
engine="Scout",
crosslinker="DSSO",
) Reading Scout crosslinks...: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 200/200 [00:00<00:00, 39993.36it/s]We can also as easily read a Scout crosslink result file using parser.read(), we donβt have to change anything other than the filename - the parser automatically infers if crosslink-spectrum-matches or crosslinks should be read by the content of the file.
_ = transform.summary(parser_result) Number of crosslinks: 200.0
Number of unique crosslinks by peptide: 200.0
Number of unique crosslinks by protein: 200.0
Number of intra crosslinks: 200.0
Number of inter crosslinks: 0.0
Number of target-target crosslinks: 200.0
Number of target-decoy crosslinks: 0.0
Number of decoy-decoy crosslinks: 0.0
Minimum crosslink score: 0.999900916500584
Maximum crosslink score: 0.9999984750460398The transform.summary() method also works for printing summary statistics of the read crosslinks.
sample_crosslink = parser_result["crosslinks"][0]
for k, v in sample_crosslink.items():
print(f"{k}: {v}") data_type: crosslink
completeness: full
alpha_peptide: MLASAGELQKGNELALPSK
alpha_peptide_crosslink_position: 10
alpha_proteins: ['sp|Cas9|Cas9']
alpha_proteins_crosslink_positions: [1226]
alpha_decoy: False
beta_peptide: MLASAGELQKGNELALPSK
beta_peptide_crosslink_position: 10
beta_proteins: ['sp|Cas9|Cas9']
beta_proteins_crosslink_positions: [1226]
beta_decoy: False
crosslink_type: intra
score: 0.9999984750460398
additional_information: NoneJust like for the CSMs, we can also look into specific crosslinks using parser_result["crosslinks"].
Here is an example crosslink, you can learn more about the specific attributes and their values here: docs, and here: data types specification.
parser_result = parser.read(
[
"../../data/scout/Cas9_Filtered_CSMs.csv",
"../../data/scout/Cas9_Residue_Pairs.csv",
],
engine="Scout",
crosslinker="DSSO",
) Reading Scout filtered CSMs...: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1306/1306 [00:00<00:00, 23845.80it/s]
Reading Scout crosslinks...: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 200/200 [00:00<00:00, 32944.30it/s]It is also possible to read multiple files into one parser_result, this for example makes sense if you have XLs and CSMs from the same run.
_ = transform.summary(parser_result) Number of CSMs: 1306.0
Number of unique CSMs: 1306.0
Number of intra CSMs: 1306.0
Number of inter CSMs: 0.0
Number of target-target CSMs: 1306.0
Number of target-decoy CSMs: 0.0
Number of decoy-decoy CSMs: 0.0
Minimum CSM score: 0.061476927286783
Maximum CSM score: 0.462267769996323
Number of crosslinks: 200.0
Number of unique crosslinks by peptide: 200.0
Number of unique crosslinks by protein: 200.0
Number of intra crosslinks: 200.0
Number of inter crosslinks: 0.0
Number of target-target crosslinks: 200.0
Number of target-decoy crosslinks: 0.0
Number of decoy-decoy crosslinks: 0.0
Minimum crosslink score: 0.999900916500584
Maximum crosslink score: 0.9999984750460398If the parser_result contains both crosslinks and CSMs, summary statistics for both will be calculated by transform.summary().
parser_result = parser.read(
"../../data/scout/Cas9_Unfiltered_CSMs.csv",
engine="Scout",
crosslinker="DSSO",
) Reading Scout unfiltered CSMs...: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1697/1697 [00:00<00:00, 25499.98it/s]We can of course also read in the unfiltered search results from Scout.
parser_result = parser.read(
"../../data/scout/Cas9_Filtered_CSMs.csv",
engine="Scout",
crosslinker="DSSO",
parse_modifications=False,
) Reading Scout filtered CSMs...: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1306/1306 [00:00<00:00, 29232.50it/s]We can also tell the parser to not parse modifications via parse_modifications=False, this might be useful if you do not care about post-translational modifications, or if you have unknown modifications in your results that you would have to manually specify.
In case you want to parse modifications but have unknown modifications in your results, you have to set them via the modifications parameter that can be passed via **kwargs to parser.read_scout(). We will get back to that.
sample_csm = parser_result["crosslink-spectrum-matches"][0]
for k, v in sample_csm.items():
print(f"{k}: {v}") data_type: crosslink-spectrum-match
completeness: partial
alpha_peptide: MLASAGELQKGNELALPSK
alpha_modifications: None
alpha_peptide_crosslink_position: 10
alpha_proteins: ['sp|Cas9|Cas9']
alpha_proteins_crosslink_positions: [1226]
alpha_proteins_peptide_positions: [1217]
alpha_score: None
alpha_decoy: False
beta_peptide: MLASAGELQKGNELALPSK
beta_modifications: None
beta_peptide_crosslink_position: 10
beta_proteins: ['sp|Cas9|Cas9']
beta_proteins_crosslink_positions: [1226]
beta_proteins_peptide_positions: [1217]
beta_score: None
beta_decoy: False
crosslink_type: intra
score: 0.3903793560071014
spectrum_file: XLpeplib_Beveridge_Lumos_DSSO_stHCD-MS2.raw
scan_nr: 21781
charge: 3
retention_time: None
ion_mobility: None
additional_information: NoneNotice how the fields alpha_modifications and beta_modifications are now empty (None) for our sample CSM in contrast to when we looked at it further up.
Reading Scout Result Files via parser.read_scout()
parser_result = parser.read_scout(
"../../data/scout/Cas9_Filtered_CSMs.csv", crosslinker="DSSO"
) Reading Scout filtered CSMs...: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββ| 1306/1306 [00:00<00:00, 23500.83it/s]We can also read Scout results files using the parser.read_scout() method which allows a more nuanced control over reading the result files - although everything theoretically can be done with the parser.read() function as well. You can read the documentation for the parser.read_scout() method here: docs.
_ = transform.summary(parser_result) Number of CSMs: 1306.0
Number of unique CSMs: 1306.0
Number of intra CSMs: 1306.0
Number of inter CSMs: 0.0
Number of target-target CSMs: 1306.0
Number of target-decoy CSMs: 0.0
Number of decoy-decoy CSMs: 0.0
Minimum CSM score: 0.061476927286783
Maximum CSM score: 0.462267769996323from pyXLMS.constants import SCOUT_MODIFICATION_MAPPING
SCOUT_MODIFICATION_MAPPING {'+57.021460': ('Carbamidomethyl', 57.021464),
'+15.994900': ('Oxidation', 15.994915),
'Oxidation of Methionine': ('Oxidation', 15.994915),
'Carbamidomethyl': ('Carbamidomethyl', 57.021464),
'BS3': ('BS3', 138.06808),
'DSS': ('DSS', 138.06808),
'DSSO': ('DSSO', 158.00376),
'DSBU': ('DSBU', 196.08479231),
'ADH': ('ADH', 138.09054635),
'DSBSO': ('DSBSO', 308.03883),
'PhoX': ('PhoX', 209.97181),
'DSG': ('DSG', 96.0211293726)}By default the Scout parser considers all modifications that are in constants.SCOUT_MODIFICATION_MAPPING as shown above for pyXLMS version 1.5.2 - a full list of the default Scout modifications is given here: docs.
my_mods = dict(SCOUT_MODIFICATION_MAPPING)
my_mods["Methylation"] = ("Methylation", 14.01565)
my_mods {'+57.021460': ('Carbamidomethyl', 57.021464),
'+15.994900': ('Oxidation', 15.994915),
'Oxidation of Methionine': ('Oxidation', 15.994915),
'Carbamidomethyl': ('Carbamidomethyl', 57.021464),
'BS3': ('BS3', 138.06808),
'DSS': ('DSS', 138.06808),
'DSSO': ('DSSO', 158.00376),
'DSBU': ('DSBU', 196.08479231),
'ADH': ('ADH', 138.09054635),
'DSBSO': ('DSBSO', 308.03883),
'PhoX': ('PhoX', 209.97181),
'DSG': ('DSG', 96.0211293726),
'Methylation': ('Methylation', 14.01565)}If you have any additional modifications in your result file(s) the parser needs to know about them, which is done via the modifications parameter that allows for passing a custom dictionary of modifications. It is usually a good idea to base this custom dictionary on constants.SCOUT_MODIFICATION_MAPPING and add your modifications after, as shown here for methylation.
parser_result = parser.read_scout(
"../../data/scout/Cas9_Filtered_CSMs.csv", crosslinker="DSSO", modifications=my_mods
) Reading Scout filtered CSMs...: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1306/1306 [00:00<00:00, 23852.55it/s]You can then pass the full list of expected modifications my_mods via the modifications parameter.
_ = transform.summary(parser_result) Number of CSMs: 1306.0
Number of unique CSMs: 1306.0
Number of intra CSMs: 1306.0
Number of inter CSMs: 0.0
Number of target-target CSMs: 1306.0
Number of target-decoy CSMs: 0.0
Number of decoy-decoy CSMs: 0.0
Minimum CSM score: 0.061476927286783
Maximum CSM score: 0.462267769996323There are several other parameters that can be set, you can read more about them here: docs.