OrbiTrack Dev Log 1: Parsing Orbitrap MS1 Data

The ultra-high resolution of Orbitrap mass spectrometry enables an unprecedented level of chemical detail. In my current research, we employ a direct-infusion Orbitrap approach (EESI inlet, Felipe et al, 2019) for non-targeted analysis. However, due to the lack of dedicated tools tailored for this application, I developed a new software package called OrbiTrack. This tool supports two core workflows:

(1) TOF peak guidance: Guiding ion fitting in EESI-TOF using high-resolution chemical information acquired at 120k resolution from Orbitrap.

(2) Directly utilization of Orbitrap MS1 data by integrating full time-series MS1 outputs, enabling online untargted molecular analysis without chromatographic sepeartion.

This blog series will introduce the OrbiTrack framework step by step, covering the conceptual foundations and sharing the original scripts for each component:

  1. Raw data parsing (covered in this post)
  2. Dynamic m/z calibration link
  3. Ion filtering and clustering link
  4. TOF missing peak supplement link
  5. TOF integration/refiner (optional, used when leveraging Orbitrap-resolved ions to support TOF fitting) link

Feedback and suggestions are highly welcome. Feel free to leave a comment below or contact me via email.

In this first post, I will introduce Part 1: loading Orbitrap raw MS1 data by user-defined time period.

1. Orbitrap Working Principle

Before diving into the code examples, I’ll first summarize the Orbitrap data format and its underlying principles. This not only helps me organize my own understanding but may also be useful for others new to Orbitrap data processing.

Orbitrap raw data can be conceptually understood as a sequence of dataframes, each corresponding to a snapshot in time. Each individual dataframe contains a list of m/z values and their corresponding intensities.

As a quick reference, I will also note the simplified principle that how Orbitrap detects m/z and intensity. Ions are injected into a central trap ans ossilcated allong the central electrode. The frequency of oscillation
𝑓 is inversely proportional to the square root of the ion’s mass-to-charge ratio m/z:

$$
\begin{equation}
f \propto \frac{1}{\sqrt{m/z}} \quad \Rightarrow \quad m/z = \left( \frac{A}{f} \right)^2
\end{equation}
$$
Here, A is a constant defiend by the instrument electric field geometry.

Meanwhile, as the ions oscillate, they induce a small image current on the detector plates (similar to the microchannel plate (MCP)in TOF systems). The more ions present, the stronger the induced current. In this way, intensity is proportional to ion abundance and can be scaled and recorded.

2. Code Library

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
## Module 1. MS reading  
def get_last_retention_time(file_path):
"""
Function to read a file and extract the last retention time.

Args:
- file_path (str): Path to the file.

Returns:
- last_retention_time (float): The last retention time found in the file.
"""
last_retention_time = None # Initialize as None

with open(file_path, 'r') as f:
for line in f:
if line.startswith("I"): # Look for the lines with retention time info
parts = line.strip().split('\t')[1:]
if len(parts) == 2:
key, value = parts
if key == "RetTime":
last_retention_time = float(value) # Update with the latest retention time
return last_retention_time

def get_retention_time_indices(file_path, start_retention_time, end_retention_time):
retention_time_indices = []
current_segment_index = -1 # Initialize segment index
retention_time_value = []
with open(file_path, 'r') as f:
for line in f:
if line.startswith("S"): # Check for segment start
current_segment_index += 1 # Increment segment index
elif line.startswith("I"):
parts = line.strip().split('\t')[1:]
if len(parts) == 2:
key, value = parts

if key == "RetTime":
retention_time = float(value)
if start_retention_time <= retention_time <= end_retention_time:
retention_time_indices.append(current_segment_index)
retention_time_value.append(retention_time)
return retention_time_indices,retention_time_value

def read_ms1_segments_between_retention_times(file_path, retention_time_indices):
segments = []
current_segment = []
current_segment_index = -1
with open(file_path, 'r') as f:
for line in f:
if line.startswith("S"):
current_segment_index += 1
if current_segment_index in retention_time_indices:
if current_segment:
segments.append(current_segment)
current_segment = []
elif line.strip() and current_segment_index in retention_time_indices:
current_segment.append(line.strip().split())
if current_segment:
segments.append(current_segment)
return segments

def save_segments_as_dataframes(segments,thres):
dataframes = []
for idx, segment in enumerate(segments):
retention_time = float(segment[0][2])
df = pd.DataFrame(segment[3:], columns=["m/z", "intensity", "unknown1", "unknown2"])
# Convert columns to numeric (excluding the first two columns)
df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric)
# Keep only the first two columns
df = df.iloc[:, :2]
df['m/z'] = df['m/z'].astype(float)
df['intensity'] = df['intensity'].astype(float)
df = df[df['intensity']>thres]
df = df.reset_index(drop = True)
df['rt'] = retention_time
# Save the DataFrame
dataframes.append(df)
return dataframes

def merge_segments(dataframes):
merged_df = pd.concat(dataframes, ignore_index=True)
return merged_df

3. Data/Parameters Input

For TOF peak list fitting, we incorporate the raw TOF spectrum output from Tofware. This dataset is used to:

  • Evaluate the fitting accuracy of Orbitrap-assigned ions, and
  • Supplement ions that may be missing or undetectable in Orbitrap.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# ========== 1. INPUT FILE ==========
file_path = "./../../../../../Africa/Angola/Data/Raw/Orbitrap/Angola_mixed_samples_20250620020218.ms1"

# ========== 2. Orbitrap Real Time Series ==========
end_time = '2025-06-20 16:40:00' # Define end time for the time series
start_time = pd.to_datetime(end_time) - pd.Timedelta(minutes=get_last_retention_time(file_path)) # Calculate start time based on the file retention time

# ========== 3. Pre-filtering Threshold for Noisy Ions ==========
pre_filtering_threshold = 1000 # Default threshold set to 1000 to filter out noisy ions

# ========== 4. Processing Mode Selection ==========
# Options: "full_ts" for full time series or "subset" for a specific peak list subset
# mode = 'full_ts' # Use full time series mode
# start_retention_time, end_retention_time = 0, get_last_retention_time(file_path) # Set for full time series
mode = 'sub_set'
start_retention_time, end_retention_time = 3.51, 13#get_last_retention_time(file_path) # Define subset range (retention times of the place we are interested)

# ========== 5. Parameters for M/Z Calibration ==========
ideal_calibrants_mass = [
172.88347, 322.77771, 472.67196, 108.9638, 166.9163
] # 5 calibrants for m/z calibration

calibration_tolerance_ppm = 20 # Set calibration tolerance to 20 ppm
overlapped_scan_option = 1 # Handle overlapping regions in neighboring segment scans (1 to enable handling)

# ========== 6. Parameters for Filtering and Clustering ==========
# Pre-filtering based on "Air-beam" (Na2I range)
Na2I_lower_bound, Na2I_upper_bound = 0, 50 * 1E6 # Filter to remove bad air-beam ions

# Data size threshold for KDE clustering chunk
data_size_threshold = 500 # Minimum data size for performing KDE clustering

# KDE Parameters (for clustering noisy ions)
kde_bandwidth = 0.0005 # KDE bandwidth for peak density estimation
kde_percentile = 5 # Remove ions below the 5th percentile of density score (to filter out rare/noisy ions)

# DBSCAN Clustering Parameters
eps_value = 0.5 # Epsilon value for DBSCAN (maximum clustering distance in ppm)
clustering_thre = 0.025 # Retain clusters appearing more than 2.5% of Na2I showing frequency

# ========== 7. EESI-TOF reference ==========
# tof_spectrum = pd.read_excel("./../Input/EESI-TOF_reference/EESI_LTOF_raw_curve_Punjab.xlsx")
# tof_add_ion_threshold = 50 # 50 ppm means there is no Orbitrap ions within 50 ppm window of a EESI-TOF identifed peak apex
# inorganic_ion_file = './../Input/EESI-TOF_reference/EESI_TOF_common_inorganic_ions.csv'
# tof_peak_list_template = './../Input/EESI-TOF_reference/ToFware_peak_list_template.txt'

# ========== 8. Peak List Output and refining parameters ==========
peak_list_prefix = './output/ToF_peak_list/20250720_Angola_Orbitrap+TOF_peak_list'
refining_distance,refining_ratio = 25, 4

4. Applying the Reading data within user-defined time period

The segments refer to raw data extracted from the Orbitrap .ms1 text files, based on user-defined start and end retention times. Users can specify the desired time window for processing and optionally read multiple time segments to be merged in later steps.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
retention_time_indices,rt_values = get_retention_time_indices(file_path, start_retention_time, end_retention_time)
segments = read_ms1_segments_between_retention_times(file_path, retention_time_indices)
df_segments = save_segments_as_dataframes(segments, pre_filtering_threshold)
merged_spectrum = merge_segments(df_segments)
merged_spectrum = merged_spectrum.sort_values(by = 'm/z', ascending=True)
merged_spectrum = merged_spectrum.reset_index(drop=True)
```
The `showing_freq`, representing the recurrence frequency of a primary ion within the defined time window, is crucial for subsequent analysis. It reflects how consistently an ion appears and can serve as a basis for filtering other less stable or noise-like signals.

```python
tolerance_ppm = 20 ## need to re-adjust based on the actual data
peak_results = {}
masses = [172.88347, 322.77771, 472.67196,622.566700,108.9638,166.9163]#
for mass in masses:
peaks = find_peak_near_mass(merged_spectrum, mass, tolerance_ppm)
peak_results[mass] = peaks
showing_freq = len(peak_results[masses[0]])
print (showing_freq)

We can also evaluate how closely the measured m/z values of calibrants match their theoretical values before applying calibration. This step is useful for pre-defining the tolerance_ppm parameter.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
subplot_titles = ['Na2I', 'Na3I2', 'Na4I3','Na2[15]NO3', 'Na3[34]SO4']
masses = [172.88347, 322.77771, 472.67196,108.9638,166.9163][0:]
colors = ['#447f44', '#cd4040', '#636efa', '#ddc927', '#9271dd','#2d85cc']
subplot_titles = []
for i in range(len(masses)):
ppm_differences = calculate_ppm(np.array(peak_results[masses[i]]['m/z']), masses[i])
mean_ppm = np.mean(ppm_differences)
subplot_titles.append(f"{['Na2I', 'Na3I2', 'Na4I3', 'Na2[15]NO3', 'Na3[34]SO4'][i]}: {mean_ppm:.2f} ppm")
fig = make_subplots(rows=1, cols=5, subplot_titles=subplot_titles)
for i in range(len(masses)):
fig.add_trace(
go.Histogram(x=peak_results[masses[i]]['m/z'], marker=dict(color=colors[i % len(colors)])),
row=1, col=i+1
)
fig.add_vline(x=masses[i], line_width=2, line_dash="dash", line_color="grey", row=1, col=i+1)

fig.update_xaxes(title_text="m/z", showline=True, linecolor='black', range=[masses[i]-0.0025, masses[i]+0.0025], col=i+1, row=1)
fig.update_yaxes(title_text="Count", showline=True, linecolor='black', col=i+1, row=1)
fig.update_layout(
height=400,
width=1500,
title_text="Calibrant Raw M/Z Distribution",
plot_bgcolor='white',
showlegend=False
)
fig.show()

OrbiTrack Dev Log 2: Dynamic M/Z Calibration 读书笔记-张笑宇文明三部曲

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×