Build Template Database

In this notebook, we use the events detected in notebook 5, and further characterized in notebooks 6 ands 7, to build a database of template events. This step consists mostly in building data and metadata files but also allows the selection of events based on quality and the grouping of similar events. We want to avoid having too many similar templates in the database as it means useless, redundant computation in the template matching process (similar templates detect the same events).

[1]:
import os
from multiprocessing import cpu_count
n_CPUs = max(cpu_count(), 24)
os.environ["OMP_NUM_THREADS"] = str(n_CPUs)

import BPMF
import h5py as h5
import matplotlib.pyplot as plt
import numpy as np
from time import time as give_time
from tqdm import tqdm

from BPMF.data_reader_examples import data_reader_mseed

[2]:
%config InlineBackend.figure_formats = ["svg"]
[3]:
# program parameters
DB_FILENAME = "reloc_bp.h5"
DB_PATH_T = "template_db"
DATA_FOLDER = "preprocessed_2_12"
NETWORK_FILENAME = "network.csv"

[4]:
# read network metadata
net = BPMF.dataset.Network(NETWORK_FILENAME)
net.read()

Read the database of detected events

We read the database of events detected with backprojection and, with them, create BPMF.dataset.Template instances.

[5]:
# HMAX_LOCATION_UNCERTAINTY_KM: All events with hmax_unc < HMAX_LOCATION_UNCERTAINTY_KM will be selected
#                               as candidate template events
HMAX_LOCATION_UNCERTAINTY_KM = 5.0
[6]:
templates = []
tid = 0

with h5.File(os.path.join(BPMF.cfg.OUTPUT_PATH, DB_FILENAME), mode="r") as f:
    print(f"Checking {DB_FILENAME}...")
    for key in tqdm(f.keys()):
        event = BPMF.dataset.Event.read_from_file(
            DB_FILENAME, db_path=BPMF.cfg.OUTPUT_PATH, gid=key
        )
        if "NLLoc_reloc" in event.aux_data:
            if event.hmax_unc < HMAX_LOCATION_UNCERTAINTY_KM:
                # if event was well relocated with NLLoc, use its location
                event.set_moveouts_to_theoretical_times()
        event.where = os.path.join(
            BPMF.cfg.INPUT_PATH,
            str(event.origin_time.year),
            event.origin_time.strftime("%Y%m%d"),
        )
        event.read_waveforms(
            BPMF.cfg.TEMPLATE_LEN_SEC,
            data_folder=DATA_FOLDER,
            data_reader=data_reader_mseed,
            n_threads=4
        )
        template = BPMF.dataset.Template.init_from_event(event)
        template.id = tid
        template.set_aux_data({"tid": tid})
        templates.append(template)
        tid += 1

Checking reloc_bp.h5...
  6%|▌         | 1/17 [00:00<00:05,  2.90it/s]
 18%|█▊        | 3/17 [00:00<00:02,  5.69it/s]
Does not have a `arrival_times` attribute.
100%|██████████| 17/17 [00:02<00:00,  7.45it/s]
Does not have a `arrival_times` attribute.

Remove redundant templates

We first create an BPMF.dataset.TemplateGroup instance with the templates built in the previous cell. This class offers a number of methods to operate on groups of templates.

[7]:
template_group = BPMF.dataset.TemplateGroup(templates, net)
# this trick is necessary to make the class aware that the waveforms are already
# attached to the Template instances (_update_attributes should never be called at
# any other time of the workflow).
template_group._update_attributes.append("read_waveforms")
# normalize the waveforms
template_group.normalize()
[8]:
# parameters controlling how the inter-template correlation matrix is computed
DISTANCE_THRESHOLD_KM = 5. # in km, criterion on ellipsoid separation
N_CLOSEST_STATIONS = 10 # > larger than total number of stations, i.e. use all stations
MAX_LAG_SAMP = 5 # in samples, maximum time shift allowed when searching for similarity

TemplateGroup.compute_intertemplate_cc computes the network-averaged correlation coefficient between templates. To account for possible location errors – and, therefore, different alignments on the P- and S-wave arrivals – we search for the maximum correlation within +/- MAX_LAG_SAMP.

[9]:
t1 = give_time()
template_group.compute_intertemplate_cc(
    distance_threshold=DISTANCE_THRESHOLD_KM,
    n_stations=N_CLOSEST_STATIONS,
    save_cc=False,
    max_lag=MAX_LAG_SAMP,
    compute_from_scratch=True,
)
t2 = give_time()
print(f"Computed inter-template CCs in {t2-t1:.2f}sec.")
Computing the similarity matrix...
Computing the inter-template directional errors...
/home/ebeauce/miniconda3/envs/hy7_py310/lib/python3.10/site-packages/BPMF/dataset.py:4436: RuntimeWarning: invalid value encountered in divide
  unit_direction /= np.sqrt(np.sum(unit_direction**2, axis=1))[
Computed inter-template CCs in 2.18sec.
[10]:
template_group.intertemplate_cc
[10]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 1.000000 0.477128 0.000000 0.412416 0.616547 0.346673 0.335522 0.379354 0.299425 0.000000 0.0 0.741204 0.532904 0.558584 0.317324 0.000000 0.175184
1 0.477128 1.000000 0.000000 0.468460 0.437401 0.319153 0.374733 0.434762 0.300028 0.000000 0.0 0.493984 0.389494 0.671390 0.459654 0.000000 0.185049
2 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.174151 0.0 0.000000 0.000000 0.000000 0.000000 0.190878 0.000000
3 0.412416 0.468460 0.000000 1.000000 0.410000 0.323889 0.369112 0.381414 0.303599 0.000000 0.0 0.410642 0.412339 0.461057 0.504043 0.000000 0.189004
4 0.616547 0.437401 0.000000 0.410000 1.000000 0.458776 0.349111 0.374305 0.286568 0.000000 0.0 0.584560 0.595355 0.407033 0.388314 0.000000 0.198617
5 0.346673 0.319153 0.000000 0.323889 0.458776 1.000000 0.355300 0.269141 0.295206 0.000000 0.0 0.409151 0.379693 0.340360 0.288253 0.000000 0.182702
6 0.335522 0.374733 0.000000 0.369112 0.349111 0.355300 1.000000 0.341919 0.287617 0.000000 0.0 0.352310 0.280423 0.389386 0.343758 0.000000 0.168698
7 0.379354 0.434762 0.000000 0.381414 0.374305 0.269141 0.341919 1.000000 0.307829 0.000000 0.0 0.431928 0.369352 0.474225 0.397337 0.000000 0.155042
8 0.299425 0.300028 0.000000 0.303599 0.286568 0.295206 0.287617 0.307829 1.000000 0.000000 0.0 0.340576 0.288347 0.312245 0.298257 0.000000 0.153908
9 0.000000 0.000000 0.174151 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.183477 0.000000
10 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
11 0.741204 0.493984 0.000000 0.410642 0.584560 0.409151 0.352310 0.431928 0.340576 0.000000 0.0 1.000000 0.514565 0.669701 0.413737 0.000000 0.162423
12 0.532904 0.389494 0.000000 0.412339 0.595355 0.379693 0.280423 0.369352 0.288347 0.000000 0.0 0.514565 1.000000 0.451129 0.444074 0.000000 0.164250
13 0.558584 0.671390 0.000000 0.461057 0.407033 0.340360 0.389386 0.474225 0.312245 0.000000 0.0 0.669701 0.451129 1.000000 0.574083 0.000000 0.157772
14 0.317324 0.459654 0.000000 0.504043 0.388314 0.288253 0.343758 0.397337 0.298257 0.000000 0.0 0.413737 0.444074 0.574083 1.000000 0.000000 0.179577
15 0.000000 0.000000 0.190878 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.183477 0.0 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
16 0.175184 0.185049 0.000000 0.189004 0.198617 0.182702 0.168698 0.155042 0.153908 0.000000 0.0 0.162423 0.164250 0.157772 0.179577 0.000000 1.000000
[11]:
# set the threshold to consider two template as redundant
# this parameter is quite arbitrary and really depends on how much computation
# time you are ready to spend in the matched-filter search
# templates with very low correlation coefficients (~0.10) still detect
# many events in common
SIMILARITY_THRESHOLD = 0.50
[12]:
templates_filtered = []
tids_processed = set()
for tid in template_group.tids:
    if tid in tids_processed:
        continue
    tt = template_group.tindexes.loc[tid]
    similar = template_group.intertemplate_cc.loc[tid] > SIMILARITY_THRESHOLD
    similar_tids = set(template_group.intertemplate_cc[similar].index)
    # comment the following line for a more conservative linkage
    similar_tids.difference_update(tids_processed)
    n_similar = len(similar_tids)
    if n_similar > 1:
        print(f"Template {tid} is similar to: {similar_tids}")
        best_tid = -10
        best_unc = 10000000.0
        for tid_sim in similar_tids:
            tt_sim = template_group.tindexes.loc[tid_sim]
            unc = np.sqrt(
                template_group.templates[tt_sim].hmax_unc ** 2
                + template_group.templates[tt_sim].vmax_unc ** 2
            )
            if unc < best_unc:
                best_unc = unc
                best_tid = tid_sim
        print(f"Keeping only Template {best_tid}")
        ## uncomment if you're using the more conservative linkage
        #if best_tid in tids_processed:
        #    continue
        tids_processed.update(similar_tids)
        templates_filtered.append(
            template_group.templates[
                template_group.tindexes.loc[best_tid]
            ]
        )
    elif n_similar == 0:
        print(
            f"Problem with Template {tid}, it should at least be similar with" " itself"
        )
    else:
        tids_processed.add(tid)
        templates_filtered.append(template_group.templates[tt])

Template 0 is similar to: {0, 4, 11, 12, 13}
Keeping only Template 0
Template 3 is similar to: {3, 14}
Keeping only Template 14

Plot the similar templates

How good was our criterion to group similar templates together? Let’s check the waveforms of the similar templates. The previous cell told us that template 0 is similar to templates 1, 3, 11 and 13.

[13]:
tid = 0
fig = template_group.templates[template_group.tindexes.loc[tid]].plot(figsize=(10, 10))
fig.suptitle(f"Template {tid}")
[13]:
Text(0.5, 0.98, 'Template 0')
../../_images/tutorial_notebooks_7_build_template_database_17_1.svg
[14]:
tid = 4
fig = template_group.templates[template_group.tindexes.loc[tid]].plot(figsize=(10, 10))
fig.suptitle(f"Template {tid}")
[14]:
Text(0.5, 0.98, 'Template 4')
../../_images/tutorial_notebooks_7_build_template_database_18_1.svg

Let’s plot the waveforms of all the templates similar to template 0 on SAUV.E .

[15]:
from scipy.signal import hilbert

similar_tids = [0, 4, 11, 12, 13]
fig = plt.figure("waveforms_similar_templates", figsize=(15, 7))
ax = fig.add_subplot(111)
for tid in similar_tids:
    tp = template_group.templates[template_group.tindexes.loc[tid]]
    wav = tp.traces.select(station="SAUV", component="E")[0].data
    norm = np.std(wav)
    # align traces on max of envelope
    time = np.arange(len(wav))
    time -= np.abs(hilbert(wav)).argmax()
    # time -= np.abs(wav).argmax()
    if tid == 0:
        ax.plot(time, wav/norm, color="k", lw=1.5)
    else:
        ax.plot(time, wav/norm, color="dimgrey", lw=0.75)
ax.set_xlabel("Time to max (samples)")
ax.set_ylabel("Velocity (AU)")
[15]:
Text(0, 0.5, 'Velocity (AU)')
../../_images/tutorial_notebooks_7_build_template_database_20_1.svg

The similarity of these templates would make them detect the same events and thus, spending time on redundant computation and writing redundant info in databases.

Save the non-redundant templates in a database

[16]:
output_dir = os.path.join(BPMF.cfg.OUTPUT_PATH, DB_PATH_T)
if not os.path.isdir(output_dir):
    os.mkdir(output_dir)

for t, template in enumerate(templates_filtered):
    tid = t
    template.id = tid
    template.set_aux_data({"tid": tid})
    full_path = os.path.join(output_dir, f"template{tid}.h5")
    template.write(
        f"template{tid}.h5",
        db_path=output_dir,
        overwrite=True,
    )