From Formula to Structure: Inferring Compound Class from MS1 + MS2

With ultra-high resolution Orbitrap MS1 scan, I already obtain highly accurate chemical formulae. However, chemical information are still not enough.

First, each formula may correspond to multiple isomeric structures, and each isomer can carry unique information,such as its special emitters, formation pathways, or toxicity.

Even worse, some ions share nearly identical m/z values under the fitting threshold (20 ppm tolerance)but correspond to entirely different molecular formule.

1
2
3
m/z 261.12288 → C14H17N2O3  ↔  C12H18N2NaO3  
m/z 262.15073 → C12H21N3NaO2 ↔ C9H20N5O4
m/z 263.10217 → C13H15N2O4 ↔ C11H16N2NaO4

If we apply MS² analysis, we can go beyond the limitations of MS1. In this process, a specific parent ion is first isolated in the quadrupole. Only this ion (actually also co-isolated ions in the same window) is allowed to pass through to the next stage, where it enters the collision cell and collide with N2 gas.

The resulting fragment ions, which carry structural information about the parent ion, are then introduced into the Orbitrap for high-resolution detection. These fragments provide rich m/z information, often linked to the presence or loss of specific functional groups, making the spectrum highly informative.

By analyzing these MS2 spectra, either through in silico computational tools or by matching against curated spectral libraries, we can narrow down the structural identity of the parent ion. This capability is not only just valuable in environmental science but also has broad and profound applications in other fields. For example, in medical research, it is widely used for the structural identification of metabolites, disease diagnostics biomarker discovery (an interesting link). In food chemistry, it enables the profiling of bioactive compounds and nutritional components.

1. MS2 strategy knowledge base

Generally, there are two types of MS² (tandem mass spectrometry) acquisition strategies that can be adopted:

  1. DDA – Data-Dependent Acquisition
    In this approach, precursor ions are selected based on the MS1 signal intensity. The instrument automatically selects the top N most intense precursor ions (above a certain threshold) for fragmentation in MS2 (MS/MS).

    ✅ Advantage: High-abundance ions usually yield high-quality MS2 spectra, making structural interpretation more reliable.
    ⚠️ Limitation: Low-abundance ions may be missed, which is especially problematic in environmental samples like mine, where many of the most intense signals originate from background contaminants rather than meaningful analytes.

  2. DIA – Data-Independent Acquisition
    In DIA, all ions within predefined m/z windows are fragmented simultaneously, regardless of their intensity.

    ✅ Advantage: Provides broader MS2 coverage, ensuring that even low-abundance compounds are captured.
    ⚠️ Limitation: The resulting MS2 spectra are highly complex and often require advanced deconvolution algorithms to interpret.

In my case, I will first work with MS1 full-scan mode, using post-acquisition data analysis to identify and narrow down a list of important precursor ions across different m/z windows. Based on this information, I will then apply a targeted Selected Ion Monitoring (SIM) method focused on these specific ions. For each target, I will optimize key acquisition parameters with the goal of obtaining reproducible, stable, and chemically meaningful fragment patterns for downstream structural interpretation.

From the paper of Assress et al (2023), I summarize the important paramters used for MS2 DIA analysis

Parameter Description & Role Tuning Range Optimal Setting Notes
MS2 Resolution For MS/MS spectra quality 30k–120k 30k Higher resolution MS/MS = fewer scans due to slower cycle
RF Lens (%) Focuses ions before analysis 10–100% 70% Balance between signal intensity and ion transmission
Mass Isolation Width m/z window for selecting precursors 0.4–6.0 2.0 m/z Narrow windows increase purity but reduce sensitivity
Intensity Threshold Min signal to trigger MS/MS 1e3–1e8 1e4 Lower thresholds = more spectra, but lower quality
TopN (MS/MS Events) Max MS/MS per cycle 5–20 Top 10 More MS/MS = more compounds, but slower scans
Cycle Time (s) Max time per scan loop 1–7 s 1 s Shorter = better chromatographic peak sampling
AGC Target Target ion count (MS/MS) 50–500% 100% Prevents overfilling or low ion count
MIT (MS2) (ms) For MS/MS scans 50–300 ms 50 ms Longer MIT slows scan speed
Microscans Number of averaged scans 1–10 1 Higher = better S/N, but much slower cycle
Collision Energy For ion fragmentation Fixed/Stepped Stepped: 10 & 40 V Stepped energies improve MS/MS spectrum richness
Dynamic Exclusion Prevents redundant fragmentation 3–100 s 10 s Ensures broader MS/MS coverage

3. Simple MS2-Level Data Visualization

This section presents a basic workflow for parsing and visualizing MS1 and MS2 data exported from Orbitrap. The raw .raw files are first converted into .ms1 and .ms2 formats using tools such as MZmine3 or RawConverter.

Two functions are provided as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
def ms1_segment_scan_read(file_path):

segments = [] # To store data for each segment
current_segment = None # To store data for the current segment
with open(file_path, 'r') as f:
for line in f:
line = line.strip() # Remove leading/trailing whitespace
if line.startswith("S"): # Start of a new segment
if current_segment is not None:
# Save the current segment data
segments.append(current_segment)
current_segment = {
'scan': None,
'RetTime': None,
'IonInjectionTime': None,
'InstrumentType': None,
'data': [] # To store m/z, S/N, charge, Intensity
}

parts = line.split()
if len(parts) >= 2:
current_segment['scan'] = int(parts[1])
elif line.startswith("I"): # Metadata line
parts = line.split()
if len(parts) >= 3:
key = parts[1]
value = parts[2]
if key == 'RetTime':
current_segment['RetTime'] = float(value)
elif key == 'IonInjectionTime':
current_segment['IonInjectionTime'] = float(value)
elif key == 'InstrumentType':
current_segment['InstrumentType'] = value
elif line and not line.startswith(("H", "I", "S")): # Data line
parts = line.split()
if len(parts) == 4: # Ensure the line has m/z, S/N, charge, Intensity
mz = float(parts[0])
intensity = float(parts[1])
current_segment['data'].append([mz, intensity])
if current_segment is not None:
segments.append(current_segment)
data_list = []
for segment in segments:
scan = segment['scan']
ret_time = segment['RetTime']
ion_injection_time = segment['IonInjectionTime']
instrument_type = segment['InstrumentType']
for row in segment['data']:
data_list.append([scan, ret_time, ion_injection_time, instrument_type] + row)

columns = ['Scan', 'RetTime', 'IonInjectionTime', 'InstrumentType', 'm/z','Intensity']
df = pd.DataFrame(data_list, columns=columns)
return df

def parse_ms2_file(filename):
"""
Parse an MS2 file exported by RawConverter (or similar).
Returns a list of rows, each row is a dictionary of parsed data.
"""

rows = []
current_scan = {}
current_charges = [] # We'll store charge info here until we see fragments

with open(filename, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
if not line:
continue

# Split line by whitespace or tabs
parts = line.split()

# 1) S line: start a new scan
if parts[0] == 'S':
# Reset for new scan
current_scan = {
'Scan': parts[1], # e.g. 000009
'Scan2': parts[2], # sometimes the same as parts[1]
'PrecursorMz': parts[3], # e.g. 225.0737
# We'll add RetTime, IonInjectionTime, etc. as we encounter them
}
current_charges = []

# 2) I line: scan-level metadata
elif parts[0] == 'I':
# e.g. I RetTime 0.69
key = parts[1]
val = parts[2] if len(parts) > 2 else ''
current_scan[key] = val

# 3) Z line: charge line => Z 2 449.14012
elif parts[0] == 'Z':
# Format: Z <charge> <precursor_mz>
charge = parts[1]
prec_mz = parts[2]
current_charges.append((charge, prec_mz))

# 4) Otherwise, if line starts with a float, it's presumably a fragment line
else:
try:
float_vals = [float(x) for x in parts]


if current_charges:
for (c, pmz) in current_charges:
row = current_scan.copy()
row['Charge'] = c
row['ChargePrecursorMz'] = pmz
for idx, val in enumerate(float_vals):
row[f'FragVal_{idx+1}'] = val
rows.append(row)
else:
# If no Z line has appeared yet, store a single row with blank charge
row = current_scan.copy()
row['Charge'] = ''
row['ChargePrecursorMz'] = ''
for idx, val in enumerate(float_vals):
row[f'FragVal_{idx+1}'] = val
rows.append(row)

except ValueError:
# Not a numeric fragment line; ignore or handle differently
pass
df = pd.DataFrame(rows)
cols = [
'Scan', 'PrecursorMz',
'RetTime', 'IonInjectionTime', 'ActivationType', 'InstrumentType',
'TemperatureFTAnalyzer', 'Filter', 'PrecursorScan', 'PrecursorInt',
'Charge', 'ChargePrecursorMz',
]
frag_cols = [c for c in df.columns if c.startswith('FragVal_')]
# Create final column order
final_cols = [c for c in cols if c in df.columns] + frag_cols
df = df[final_cols]
df['RetTime'] = df['RetTime'].astype(float)
df = df.rename(columns = {"FragVal_1":'m/z','FragVal_2':'Intensity'}, )
return df

Load MS1 and MS2 data

1
2
3
4
5
filename = './raw_data/20250312_tunning/MS2_trial_res30K_HCD20%_MIT1000.ms1'  # Adjust to your file path
hcd20_ms1 = ms1_segment_scan_read(filename)
filename = './raw_data/20250312_tunning/MS2_trial_res30K_HCD20%_MIT1000.ms2'
hcd20_ms2 = parse_ms2_file(filename)
print (len(hcd20_ms2['Scan'].unique()),len(hcd20_ms1['Scan'].unique()))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
tar_mz   = 197.04248
sample_start,sample_end = 37.00,45.00
ms1_scan = 2242

fig = make_subplots(
rows=1, cols=2,
subplot_titles=("MS1 Profile around m/z 197.04248 (C7H12NaO5+)",'MS2 fragmentation spectra'),
vertical_spacing=0.1
)

sel_ms1_df = df_ms1[(df_ms1['m/z']>tar_mz-1)&(df_ms1['m/z']<tar_mz+1)]
fig.add_trace(go.Bar(
x=sel_ms1_df[sel_ms1_df['Scan'] == ms1_scan]['m/z'], # x-axis: m/z values
y=sel_ms1_df[sel_ms1_df['Scan'] == ms1_scan]['Intensity'], # y-axis: Intensity values
name="MS1", # Name for the legend
marker=dict(color='blue') # Customize bar color
), row=1, col=1)

tolerance = 0.02

precursor_mz_array = df_ms2['PrecursorMz'].unique()
precursor_mz_numeric = precursor_mz_array.astype(float)
selected_mz = precursor_mz_array[
(precursor_mz_numeric >= tar_mz - tolerance) &
(precursor_mz_numeric <= tar_mz + tolerance)
]


sel_ms2_df = df_ms2[(df_ms2['PrecursorMz'].isin(selected_mz))& (df_ms2['RetTime']>sample_start)&(df_ms2['RetTime']<sample_end)]
sel_ms2_df = sel_ms2_df.sort_values(by ='m/z')
fig.add_trace(go.Bar(
x=sel_ms2_df['m/z'], # x-axis: m/z values
y=sel_ms2_df['Intensity'], # y-axis: Intensity values
name="MS2", hovertext=sel_ms2_df['RetTime'],# Name for the legend
marker=dict(color='red') # Customize bar color
), row=1, col=2)


fig.update_xaxes(title_text="m/z",range = [tar_mz - 0.2, tar_mz+0.2], showline=True, linecolor='black', title_font=dict(size=14, color='black'),col = 1,row = 1)
fig.update_yaxes(title_text="Intensity", showline=True, linecolor='black',range = [0, None], title_font=dict(size=14, color='black'),col = 1,row = 1)

fig.update_xaxes(title_text="m/z", showline=True, linecolor='black', title_font=dict(size=14, color='black'),col = 2,row = 1)
fig.update_yaxes(title_text="Intensity", showline=True, linecolor='black',range = [0, None], title_font=dict(size=14, color='black'),col = 2,row = 1)
fig.update_layout(title_text='', width=1200, height=450, showlegend=False, plot_bgcolor='white')
fig.show()

Below is an example where the MS1 signal was strong, but the corresponding MS2 spectrum failed to produce meaningful fragments, likely due to poor fragmentation settings or low ion injection efficiency.

4. Tool recommendation for MS2 data processing

4.1 MetFrag

A web-based tool for in-silico fragmentation and metabolite annotation. It is commonly used for identifying small molecules based on MS2 spectra.

4.2 CFM-ID

Another tool for predicting MS2 fragmentation patterns, supporting different collision energies.
Note: The website is often unstable and may be temporarily inaccessible.

4.3 MassBank

Unlike MetFrag and CFM-ID, MassBank provides a searchable database of experimentally acquired MS2 spectra. Users can compare their spectra to community-uploaded reference data under similar experimental conditions.

4.4 MSBuddy

MSBuddy is an open-source, Python-based tool designed for molecular formula assignment with MS/MS (MS2) assistance. It is particularly useful for reducing ambiguity in formula annotation, though it is not intended for structural elucidation.

The tool is simple to use and accepts .mgf files as input. An example worlflow can be seen as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import tarfile  # Use standard tarfile instead of backports
import pkg_resources
from msbuddy import Msbuddy, MsbuddyConfig

# Print the MSBuddy version
print(f"MSBuddy version: {pkg_resources.get_distribution('msbuddy').version}")

# Define MSBuddy parameters
msb_config = MsbuddyConfig(
ms_instr="orbitrap", # Instrument type
ppm=True,
ms1_tol=10, # MS1 tolerance (ppm)
ms2_tol=20, # MS2 tolerance (ppm)
halogen=True,
timeout_secs=200
)

# Specify the adducts to consider
msb_config.adduct = ["[M+H]+", "[M+Na]+"]

# Instantiate the engine and load the .mgf file
msb_engine = Msbuddy(msb_config)
msb_engine.load_mgf('./raw_data/20250312_tunning/20250314_test_amb_aged/MS2_trial_res60K_HCD80%_Punjab_ambient.mgf')

# Perform formula annotation
msb_engine.annotate_formula()
results = msb_engine.get_summary()

# Print the results
for individual_result in results:
for key, value in individual_result.items():
print(key, value)

The output shows a successful identification of a likely molecular formula using MS2-assisted annotation:

1
2
3
4
5
6
7
adduct [M+H]+
formula_rank_1 C9H10N2
estimated_fdr 0.00536238379855114
formula_rank_2 C6H11FN2O
formula_rank_3 None
formula_rank_4 None
formula_rank_5 None

In this case, MS2 fragments supported C9H10N2 is the most probable formula.

4.5 SIRIUS

SIRIUS platform is a GUI-based software for analyzing MS2 datasets (It also has command line version which I have not tried yet). SIRIUS combines library matching, in-silico fragmentation, and de novo molecular formula prediction to rank candidate molecular structures based on how well they match observed MS2 fragments.

Below is an example output from my dataset, showcasing the top-ranked structure based on fragment scoring

Reference

  1. Assress, H. A.; Ferruzzi, M. G.; Lan, R. S.
    Optimization of Mass Spectrometric Parameters in Data Dependent Acquisition for Untargeted Metabolomics on the Basis of Putative Assignments.
    J. Am. Soc. Mass Spectrom. 2023, 34(8), 1621–1631. https://doi.org/10.1021/jasms.3c00084

  2. Defossez E, Bourquin J, von Reuss S, et al. Eight key rules for successful data‐dependent acquisition in mass spectrometry‐based metabolomics[J]. Mass Spectrometry Reviews, 2023, 42(1): 131-143. https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/full/10.1002/mas.21715

  3. McEachran, A. D., Balabin, I., Cathey, T., Transue, T. R., Al-Ghoul, H., Grulke, C., … & Williams, A. J. (2019). Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns. Scientific Data, 6(1), 141. https://www.nature.com/articles/s41597-019-0145-z

  4. Xing, S., Shen, S., Xu, B., Li, X., & Huan, T. (2023). BUDDY: molecular formula discovery via bottom-up MS/MS interrogation. Nature Methods, 20(6), 881-890. https://www.nature.com/articles/s41592-023-01850-x

  5. Dührkop, K., Fleischauer, M., Ludwig, M., Aksenov, A. A., Melnik, A. V., Meusel, M., … & Böcker, S. (2019). SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nature methods, 16(4), 299-302. https://www.nature.com/articles/s41592-019-0344-8

Installing FLEXPART v11 with ECMWF Support Visualizing Chemical Structures 绘制化学物质的结构式

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×