Preprocess Seismic Data

In this notebook, we pre-process the seismic data downloaded in the previous section 1_download_data.ipynb. This step is essential for the successful application of the detection and location workflow.

[14]:
import BPMF
import glob
import numpy as np
import os

import obspy as obs

from datetime import timedelta
from tqdm import tqdm
[15]:
DATE = obs.UTCDateTime("2012-07-26")
dirpath_data = os.path.join(BPMF.cfg.INPUT_PATH, str(DATE.year), DATE.strftime("%Y%m%d"))

Build the data reader

BPMF allows any data formats to be used. The price for flexibility is to define a function that reads the data and returns obspy.Stream.

[16]:
def data_reader_mseed(
    where,
    network="*",
    station="*",
    channel="*",
    location="*",
    starttime=None,
    endtime=None,
    attach_response=False,
    data_folder="",
    **kwargs
):
    """Data reader for BPMF.

    This data reader is specifically designed for the folder tree convention
    that we use in the tutorial. We will use the same data reader at later
    stages of the workflow.

    Parameters
    -----------
    where: string
        Path to data file or root data folder.
    network: string or list, optional
        Code(s) of the target network(s).
    station: string or list, optional
        Code(s) of the target station(s).
    channel: string or list, optional
        Code(s) of the target channel(s).
    location: string or list, optional
        Code(s) of the target location(s).
    starttime: string or obspy.UTCDateTime, optional
        Target start time.
    endtime: string or obspy.UTCDateTime, optional
        Target end time.
    attach_response: boolean, optional
        If True, find the instrument response from the xml files
        and attach it to the obspy.Stream output instance.
    data_folder: string, optional
        If given, is the child folder in `where` containing
        the mseed files to read.

    Returns
    -------
    traces: obspy.Stream
        The seismic data.
    """
    from obspy import Stream

    traces = Stream()
    # read your data into traces
    data_files = glob.glob(
        os.path.join(where, data_folder, f"{network}.{station}.{location}.{channel}*")
    )
    for fname in data_files:
        traces += obs.read(fname, starttime=starttime, endtime=endtime, **kwargs)
    if attach_response:
        resp_files = glob.glob(
            os.path.join(where, "resp", f"{network}.{station}.xml")
        )
        invs = list(map(obs.read_inventory, resp_files))
        traces.attach_response(invs)
    return traces

In general, you do want to make sure you personalize this data reader for your own folder architecture. If you choose to use a similar file architecture as in this tutorial, you may want to use a more complex version of this data reader, available at BPMF.data_reader_examples.data_reader_mseed.

Read the raw data

[17]:
raw_traces = data_reader_mseed(dirpath_data, attach_response=True, data_folder="raw")
[18]:
print(raw_traces.__str__(extended=True))

print(raw_traces[0].stats.response)
24 Trace(s) in Stream:
YH.DE08..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DE07..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DE08..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DD06..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC07..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.SAUV..HHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 100.0 Hz, 8740001 samples
YH.DC07..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.SAUV..HHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 100.0 Hz, 8740001 samples
YH.DC08..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.SPNC..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.SPNC..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC06..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC06..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC06..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC07..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.SAUV..HHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 100.0 Hz, 8740001 samples
YH.DC08..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC08..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.SPNC..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DE07..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DE08..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DE07..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DD06..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DD06..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
Channel Response
        From M/S (Velocity in Meters per Second) to COUNTS (Digital Counts)
        Overall Sensitivity: 2.45935e+09 defined at 1.000 Hz
        6 stages:
                Stage 1: PolesZerosResponseStage from M/S to V, gain: 2420
                Stage 2: CoefficientsTypeResponseStage from V to COUNTS, gain: 1.01626e+06
                Stage 3: FIRResponseStage from COUNTS to COUNTS, gain: 1
                Stage 4: FIRResponseStage from COUNTS to COUNTS, gain: 1
                Stage 5: FIRResponseStage from COUNTS to COUNTS, gain: 1
                Stage 6: FIRResponseStage from COUNTS to COUNTS, gain: 1

Preprocess the data

First, we call obspy.Stream.merge() to group the different segments of a same channel within a single trace.

[19]:
raw_traces.merge()
[19]:
24 Trace(s) in Stream:

YH.DC06..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
...
(22 other traces)
...
YH.SPNC..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples

[Use "print(Stream.__str__(extended=True))" to print all Traces]
[20]:
# there is now a single trace per channel
print(raw_traces.__str__(extended=True))
24 Trace(s) in Stream:
YH.DC06..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC06..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC06..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC07..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC07..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC07..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC08..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC08..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DC08..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DD06..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DD06..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DD06..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DE07..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DE07..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DE07..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DE08..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DE08..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.DE08..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.SAUV..HHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 100.0 Hz, 8740001 samples
YH.SAUV..HHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 100.0 Hz, 8740001 samples
YH.SAUV..HHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 100.0 Hz, 8740001 samples
YH.SPNC..BHE | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.SPNC..BHN | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples
YH.SPNC..BHZ | 2012-07-25T23:51:40.000000Z - 2012-07-27T00:08:20.000000Z | 50.0 Hz, 4370001 samples

Then, we use BPMF.utils.preprocess_stream with the below-defined parameters (some parameters were already defined when initializing the project). We do not correct for the full instrument response, as the response is mostly flat in the frequency band of interest. However, we do correct for the instrument sensitivity so that the waveform amplitudes are in physical units [m/s].

[21]:
# preprocessing parameters
freqmin = BPMF.cfg.MIN_FREQ_HZ
freqmax = BPMF.cfg.MAX_FREQ_HZ
target_SR = BPMF.cfg.SAMPLING_RATE_HZ
target_starttime = DATE
target_endtime = DATE + timedelta(days=1.)
remove_response = False
remove_sensitivity = True
[22]:
preprocessed_traces = BPMF.utils.preprocess_stream(
    raw_traces,
    freqmin=freqmin,
    freqmax=freqmax,
    target_SR=target_SR,
    target_starttime=target_starttime,
    target_endtime=target_endtime,
    remove_response=remove_response,
    remove_sensitivity=remove_sensitivity,
    antialiasing_filter_order=8, # make it sharp
)
[23]:
print(preprocessed_traces.__str__(extended=True))
24 Trace(s) in Stream:
YH.DC06..BHE | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DC06..BHN | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DC06..BHZ | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DC07..BHE | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DC07..BHN | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DC07..BHZ | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DC08..BHE | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DC08..BHN | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DC08..BHZ | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DD06..BHE | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DD06..BHN | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DD06..BHZ | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DE07..BHE | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DE07..BHN | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DE07..BHZ | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DE08..BHE | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DE08..BHN | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.DE08..BHZ | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.SAUV..HHE | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.SAUV..HHN | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.SAUV..HHZ | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.SPNC..BHE | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.SPNC..BHN | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples
YH.SPNC..BHZ | 2012-07-26T00:00:00.000000Z - 2012-07-26T23:59:59.960000Z | 25.0 Hz, 2160000 samples

Plot the preprocessed traces

[24]:
%config InlineBackend.figure_formats = ["svg"]

fig = preprocessed_traces.plot(equal_scale=False)
../../_images/tutorial_notebooks_2_preprocess_data_18_0.svg

Write the preprocessed data

Save the preprocessed data to the data base.

[25]:
preproc_folder_name = f"preprocessed_{BPMF.cfg.MIN_FREQ_HZ:.0f}_{BPMF.cfg.MAX_FREQ_HZ:.0f}"
print("The preprocessed data will be stored at:", os.path.join(dirpath_data, preproc_folder_name))
if not os.path.isdir(os.path.join(dirpath_data, preproc_folder_name)):
    os.makedirs(os.path.join(dirpath_data, preproc_folder_name))
The preprocessed data will be stored at: ../BPMF_data/2012/20120726/preprocessed_2_12
[26]:
for tr in tqdm(preprocessed_traces, desc="Writing preprocessed data"):
    tr.write(os.path.join(dirpath_data, preproc_folder_name, f"{tr.id}_{DATE.strftime('%Y%m%d')}.mseed"), encoding="FLOAT64", format="mseed")
# tip: it is most likely that you don't need the float64 precision. Try converting all tr.data to float32 and use encoding="FLOAT32", you will
# reduce the memory footprint of your preprocessed data set by 2!
Writing preprocessed data: 100%|██████████| 24/24 [00:05<00:00,  4.32it/s]