SeisIO: a fast, efficient geophysical data architecture for

22
SeisIO: a fast, efficient geophysical data architecture for 1 the Julia language 2 Joshua P. Jones 1* , Kurama Okubo 2 , Tim Clements 2 , and Marine A. Denolle 2 3 1 4509 NE Sumner St., Portland, OR, USA 4 2 Department of Earth and Planetary Sciences, Harvard University, MA, USA 5 * Corresponding author: Joshua P. Jones ([email protected]) 6 1

Transcript of SeisIO: a fast, efficient geophysical data architecture for

Page 1: SeisIO: a fast, efficient geophysical data architecture for

SeisIO: a fast, efficient geophysical data architecture for1

the Julia language2

Joshua P. Jones1∗, Kurama Okubo2, Tim Clements2, and Marine A. Denolle23

14509 NE Sumner St., Portland, OR, USA4

2Department of Earth and Planetary Sciences, Harvard University, MA, USA5

∗Corresponding author: Joshua P. Jones ([email protected])6

1

Page 2: SeisIO: a fast, efficient geophysical data architecture for

Abstract7

SeisIO for the Julia language is a new geophysical data framework that combines the intuitive8

syntax of a high-level language with performance comparable to FORTRAN or C. Benchmark9

comparisons with recent versions of popular programs for seismic data download and analysis10

demonstrate significant improvements in file read speed and orders-of-magnitude improvements11

in memory overhead. Because the Julia language natively supports parallel computing with an12

intuitive syntax, we benchmark test parallel download and processing of multi-week segments of13

contiguous data from two sets of 10 broadband seismic stations, and find that SeisIO outperforms14

two popular Python-based tools for data downloads. The current capabilities of SeisIO include file15

read support for several geophysical data formats, online data access using FDSN web services,16

IRIS web services, and SeisComP SeedLink, with optimized versions of several common data17

processing operations. Tutorial notebooks and extensive documentation are available to improve18

the user experience (UX). As an accessible example of performant scientific computing for the19

next generation of researchers, SeisIO offers ease of use and rapid learning without sacrificing20

computational performance.21

2

Page 3: SeisIO: a fast, efficient geophysical data architecture for

1 Introduction22

The dramatic growth in the volume of collected geophysical data has the potential to lead to23

tremendous advances in the science (https://ds.iris.edu/data/distribution/). Leveraging the data rev-24

olution to gain knowledge that is useful for earthquake science, hydrology, industry, and climate25

science requires new tools to help Earth scientists extract meaningful information from arbitrarily26

large data sets. High-performance computing is necessary to manage the scale of these prob-27

lems; however, this requires specialized training at the undergraduate and graduate levels, which is28

rarely taught in undergraduate-level science curricula. On the other hand, open-source computing29

languages (Python) and codes (e.g., ObsPy; Beyreuther et al (2010)) have standardized seismic30

data processing and improved access to seismic data analysis for a new generation of seismolo-31

gists. However, these tools suffer from slow computation time and inefficient memory allocation32

at scale. Therefore, the geophysics community is in need of a computational framework that is33

simultaneously easy to learn and efficient.34

The Julia language combines the syntactic ease of high-level languages like MATLAB and Python35

with the performance of FORTRAN and C. Developed for fast, efficient numerical computing,36

Julia version 1.0.0 was released August 2018, while the first beta version appeared February37

2012 (Bezanson et al., 2017, 2018). The language is known for impressive speed and compu-38

tational efficiency: while still in beta testing, Julia became the fourth programming language39

to achieve a petaflop, after FORTRAN, C, and C++ (Reiger et al., 2018; Perkel, 2019). De-40

spite its relative youth, Julia supports a growing collection of open-source modules for numer-41

ical and scientific computing. Julia wrappers to C, FORTRAN, R, and Python allow seamless42

execution of external code, and third-party packages (https://github.com/JuliaInterop) extend in-43

teroperability to C++, Java, Mathematica, and MATLAB, including the ability to read .mat files44

(https://github.com/JuliaIO/MAT.jl).45

3

Page 4: SeisIO: a fast, efficient geophysical data architecture for

2 SeisIO46

The SeisIO package was created in May 2016 with the goal of rapid, efficient analysis of univariate47

geophysical data in the Julia language, using comprehensible, uniform syntax, and simple but48

powerful commands. Its design allows users to read univariate data from arbitrary instruments49

(e.g., seismic, geodetic, gas flux) into a single structure, including gapped and irregularly-sampled50

data. In the subsections below, we describe the capabilities of SeisIO, conduct benchmark tests,51

and introduce tutorials.52

2.1 Capabilities53

SeisIO includes well-tested read support for many geophysical time-series formats (Table 1). Read-54

ers for all formats but ASDF strictly use the Julia language; ASDF uses wrappers to libhdf5, written55

in C. Current data processing operations include filling time gaps, removing the mean and linear56

trend, band-pass filtering, instrument response translation and removal (i.e., flattening to DC),57

resampling, cosine tapering, merging, seismogram differentiation/integration, and time synchro-58

nization. Tools for online acquisition support FDSN services (station, event, and dataselect), IRIS59

time-series requests, FDSN SeedLink, and the IRIS TauP interface (Crofwell et al., 1999).60

SeisIO has been officially listed in the Julia package ecosystem since early 2019. Automated61

testing with Travis-CI (https://travis-ci.org/) and AppVeyor (https://www.appveyor.com/) supports62

Linux, Mac OS, and Windows installations. Code coverage estimates of 97-98% on Codecov63

(https://codecov.io/) and Coveralls (https://coveralls.io/) exceed the 95% coverage threshold typical64

of enterprise-level commercial software releases, yet both Julia and SeisIO are free.65

4

Page 5: SeisIO: a fast, efficient geophysical data architecture for

2.2 Installation66

Typical installation of the Julia language, SeisIO, and all dependencies requires three total steps:67

1. Download and install the Julia language from https://julialang.org/downloads/68

• The Julia install directory will be denoted (juliaroot) hereafter.69

• (juliaroot) is typically a pattern like /home/username/julia-v.v.v/ in70

Linux, e.g., /home/josh/julia-1.1.0/.71

2. Start the Julia command-line interface (CLI) with (juliaroot)/bin/julia72

3. Type or copy: using Pkg; Pkg.add("SeisIO"); using SeisIO73

Julia installs package dependencies automatically when Pkg.add is invoked. There is no need74

for dedicated environments or session-specific user settings; however, FFT performance can some-75

times be improved by starting Julia in parallel-ready mode with (juliaroot)/bin/julia76

--procs auto. Total disk space required is typically under 4 GB: 300-400 MB for Julia; 4.277

MB for SeisIO v0.4.1; 300 MB for optional test and benchmark data; and 1-3 GB for a typical78

set of scientific computing packages. The last space requirement is much lower for non-Windows79

users who manually link existing libraries and software (e.g., BLAS, Conda, FFTW) to Julia, but80

this is only recommended for experienced Linux users.81

2.3 SeisIO Data Structure82

SeisIO is designed around easy, fluid, and fast data access. For example, a complete sequence83

of commands to download and process channel data can be executed in one function call with84

keywords:85

5

Page 6: SeisIO: a fast, efficient geophysical data architecture for

86julia> S = get_data("FDSN", "UW.LON..BH?", src="IRIS", s="2019-01-01", t=3600, detrend=true, rr=87

true, w=true)88

89

SeisData with 3 channels (2 shown)90

ID: UW.LON..BHE UW.LON..BHN ...91

NAME: Longmire CREST broad-band Longmire CREST broad-band ...92

LOC: 46.7506 N, -121.81 E, 853.0 m 46.7506 N, -121.81 E, 853.0 m ...93

FS: 40.0 40.0 ...94

GAIN: 7.51485e8 7.51485e8 ...95

RESP: a0 1.0, f0 1.0, 1z, 1p a0 1.0, f0 1.0, 1z, 1p ...96

UNITS: m/s m/s ...97

SRC: http://service.iris.edu/fdsnws/da http://service.iris.edu/fdsnws/da ...98

MISC: 4 entries 4 entries ...99

NOTES: 2 entries 2 entries ...100

T: 2019-01-01T00:00:00.010 (0 gaps) 2019-01-01T00:00:00.010 (0 gaps) ...101

X: -1.511e+03 +4.669e+03 ...102

-1.512e+03 +4.699e+03 ...103

... ... ...104

+1.540e+03 +7.483e+02 ...105

(nx = 144000) (nx = 144000) ...106

C: 0 open, 0 total107

108

109110

This example downloads 3600 seconds of data beginning 2019-01-01 00:00:00 (UTC) using FDSN111

dataselect with the IRIS DMC server. The keyword ”detrend” removes the linear trend after down-112

load; ”rr” removes (flattens to DC) the instrument response and replaces the .resp field of each113

channel with an all-pass filter. The keyword ”w” writes the download directly to disk before pro-114

cessing. Access to data properties is straightforward and intentionally simple: for example, in all115

timeseries-data structures, the field .x holds univariate data.116

2.4 Tutorials117

A SeisIO tutorial is available from the project GitHub site, with three short, interactive Jupyter118

notebooks designed to take 5-10 minutes each. A few additional commands in the Julia CLI are119

required to run interactive notebooks:120

6

Page 7: SeisIO: a fast, efficient geophysical data architecture for

using PkgPkg.add(["Dates", "IJulia"])using IJuliacd(dirname(pathof(SeisIO))*"/../tutorial/")jupyterlab(dir=pwd())

121

The three tutorials are:122

Part_1-Basic.pynb: introduction to SeisIO123

Part_2-Data_Acquisition.pynb: downloading data & reading files124

Part_3-Processing.pynb: data processing125

Researchers familiar with MATLAB/Octave or Python will find Julia syntax intuitive and may126

need only the language’s official documentation to begin coding. However, many Julia-language127

tutorials can be downloaded from https://julialang.org/learning/ .128

3 Benchmarking129

We conduct a series of benchmark tests on a 64-bit personal computer equipped with an Intel130

DH67CL motherboard, i7-2600 (3.4 GHz) CPU, and 16 GB Kingston DDR3 RAM, running Julia131

v1.1.0 on 64-bit Ubuntu Linux 18.04.3 (kernel 5.0.0-29). File read tests (Table 2) use SeisIO v0.4.1132

and BenchmarkTools.jl with 100 samples per benchmark and one evaluation per sample. Because133

Julia uses a JIT compiler, an initial compile run precedes each test. The results shown in Fig. 1134

suggest that read time and memory use scale quasi-linearly with file size.135

7

Page 8: SeisIO: a fast, efficient geophysical data architecture for

3.1 File Reads136

We now compare SeisIO read speeds with those of two popular, well-established seismic data137

packages: ObsPy for Python (Beyreuther et al, 2010; Megies et al., 2011) and SAC (Goldstein et138

al., 2003; Goldstein and Snoke, 2005). Comparative memory usage is shown in Fig. 2 and median139

read times for 100-trial test sets are shown in Fig. 3. For these tests, ObsPy v1.1.1 uses a dedicated140

Python 3.7.3 environment created with Conda 4.7.12; benchmarks use timeit.py and memory-141

profiler 0.55.0 with child processes included in memory estimates. ASDF files are benchmarked142

with pyasdf v0.5.1. SAC v106.a is compiled from source on the test machine and benchmarked143

with perf v5.0.21 and time -v; the median time and memory required to start and exit SAC without144

executing commands are subtracted from the test values.145

We compare programs for all tests in Table 2 with file readers. Comparisons with SAC are limited146

because SAC only reads two of these formats. ObsPy has no reader for PASSCAL, SUDS, or UW,147

and the ObsPy ASCII reader is incompatible with GeoCSV variants on time-series pair (tspair,148

ASCII) data. The ObsPy WIN reader couldn’t read our test files, even though our data were149

downloaded directly from Hinet and integrity-checked by comparing with output from wintosac150

(http://wwweic.eri.u-tokyo.ac.jp/cgi-bin/show man en?wintosac). Thus, all possible comparisons151

with our benchmarks are shown in Figs. 2 & 3.152

SeisIO uses less memory and read files more quickly than both SAC and ObsPy; the former is153

especially noteworthy due to SAC’s low-level coding. With the exception of ASDF read times,154

which differ by < 4%, performance differences cannot be explained by random variations in sys-155

tem background activity. Fig. 2 suggests that ObsPy has a considerable amount of static memory156

overhead associated with each file read, which may explain some read time differences (e.g. Fig.157

3). The closest read times to SeisIO are obtained with ASDF, for which pyasdf also uses wrappers158

to libhdf5. The larger of the two mini-SEED benchmarks is also roughly comparable; notably, be-159

8

Page 9: SeisIO: a fast, efficient geophysical data architecture for

cause the ObsPy mini-SEED reader is a wrapper to libmseed for C (Trabant, 2019), both the ObsPy160

and SAC comparisons strongly support the claim that well-optimized Julia code can outperform161

well-optimized C, even with Julia’s high-level syntax, undaunting UX, and JIT compiler.162

3.2 Download Throughput163

With the data requirements of modern analysis techniques, download throughput is an increas-164

ingly important consideration when choosing data acquisition software. We benchmark down-165

load througput using SeisIO and two popular Python tools: ObsPy and ROVER v1.0.4 (devel-166

oped by IRIS-DMC and available at https://iris-edu.github.io/rover/). ROVER has built-in op-167

tions for multi-worker SQL requests. We use mpi4py with the NoisePy noise-correlation toolbox168

(https://github.com/mdenolle/NoisePy, Jiang et al., in prep) to parallelize ObsPy downloads. For169

SeisIO, we use the SeisDownload.jl module (https://github.com/kura-okubo/SeisDownload.jl, ver-170

sion 1.2.0, last accessed 2019/10/02), developed to leverage Julia’s built-in parallelization function171

pmap.172

This benchmark test uses publicly-available data from three-component broadband seismograph173

stations archived at the IRIS DMC and the Northern California Earthquake Data Center (NCEDC).174

Each test uses 10 stations; download sizes are 7 GB for the TA network and 17 GB for the BP175

network. For the IRIS-DMC test, we use 8 worker CPUs to match server-side connection limits176

and the maximum workers available in NoisePy. The request comprises 16 days of continuous data177

sampled at 40 Hz. For the NCEDC test, we requested 3-month segments of seismic data sampled178

at 20 Hz from stations in the Berkeley Parkfield (BP) High Resolution Seismic Network using179

SeisIO and Obspy. Tests were performed using a 32-core Intel(R) Xeon(R) Platinum 8268 CPU180

@ 2.90 GHz with 64 GB RAM.181

The computation time for the tests includes the data request from the remote server and conversion182

9

Page 10: SeisIO: a fast, efficient geophysical data architecture for

to mseed format. The download efficiency is defined as the total amount of downloaded data / total183

computational time [MB/s]. No preprocessing (e.g., detrending, tapering, filtering) is applied.184

Figure 4 shows the download efficiency. The download efficiency of SeisIO can reach 3.3× that of185

ObsPy, in agreement with standard microbenchmarks of the Julia language (Bezanson et al., 2017).186

In the IRIS-DMC benchmark, the scaling of download speed with number of workers follows a187

power law with an exponent of 1.06 for ROVER, 0.97 for ObsPy, and 0.92 for SeisIO with the TA188

network (Figure 4a); in the NCEDC benchmark, the scaling exponents are 0.92 for ObsPy and 0.96189

for SeisIO, respectively (Figure 4b). In larger downloads where the computational time required190

for the allocation of workers is negligible compared to that of the data download itself, we report191

that the scaling exponent converges to 1.0 . Therefore, the Julia language appears well-optimized192

for parallel computation using only built-in functions (pmap).193

3.3 Processing Example: Instrument Response Removal194

The removal of an instrument response function is a general processing operation that converts195

recorded counts or Volts to the approximate physical units of measure, such as ground velocity196

(m/s), at frequencies from DC to the Nyquist frequency. This is a common preprocessing step197

in seismic data analysis, e.g., when comparing and/or cross-correlating waveforms recorded by198

different instruments (e.g. Bensen et al., 2007). We use the computational efficiency of response199

removal as an example processing operation and perform comparative benchmark tests using Ob-200

sPy and SeisIO.201

The test data comprise a one-day digital seismogram from channel TA.121A.HHZ, network TA and202

station name 121A, sampled at 100 Hz. Data are bandpass filtered before removing the instrument203

response, with a 4-corner cosine taper in ObsPy and a Butterworth filter in SeisIO. To ensure204

that the test measures a single processing step, the bandpass operation is not timed. We test on a205

10

Page 11: SeisIO: a fast, efficient geophysical data architecture for

single-core computer with an Intel(R) Core i5 CPU @ 3.4 GHz with 8 GB RAM.206

Figure 5a shows computation times for file read and response removal. We conducted 100 trials207

of each process; mean values are shown, with standard deviations as error bars. The speedup of208

SeisIO is 1.6x relative to ObsPy for reading data, consistent with the results of test MSEED-1 in209

Figure 3; the speedup is 6.8x for instrument response removal. Figure 5b shows a graphical com-210

parison of output waveforms, demonstrating the agreement between ObsPy and SeisIO. Although211

the differences near the edges of each trace are large compared to the middle, the artifacts can be212

adequately suppressed by cosine tapering before removing instrumental response (Figure 5b top).213

In this test, the first and last 0.2% of samples in each window are tapered with both Obspy and214

SeisIO. The small misfit in amplitude and/or phase arises from differences in filtering strategies.215

4 Conclusions and Future Directions216

The SeisIO data framework is the first of its kind: high-level, easy, performant software that in-217

troduces the next generation of geophysics researchers to cutting-edge scientific computing in the218

Julia language. We have shown that SeisIO’s speed and efficiency can outperform specialized219

precompiled C-language software. The benefits are lower computing requirements and costs.220

The intent of SeisIO is to provide an efficient framework for geophysical data while maintaining221

comprehensible syntax. Core functionality will expand to additional data formats and acquisition222

methods based on demand; APIs and guides are available on the project homepage for potential223

contributors. Analysis programs based on SeisIO are in development, particularly for ambient-224

noise seismology (Bryan et al., 2019; Clements and Denolle, 2019). A SeisIO variant for GPU225

computing is in development and support for multiparametric volcano monitoring data is planned.226

As SeisIO is refined, and its scope expands to include GPU, cloud, and heterogeneous computing,227

11

Page 12: SeisIO: a fast, efficient geophysical data architecture for

we expect support to increase among seismologists and other geophysics researchers, many of228

whom find themselves spending valuable research time teaching new students to compile arcane229

(and sometimes, antique) programs.230

Acknowledgments231

The authors thank Andy Nowacki (University of Leeds, UK) for discussions on the Julia lan-232

guage; Douglas Neuhauser (University of California Seismological Laboratory, Berkeley, CA,233

USA) and David Shelly (US Geological Survey, Golden, CO, USA) for discussions on SAC and234

other data formats, which helped motivate the creation of SeisIO. J. Jones is thankful to Chad235

Trabant and Robert Casey (Incorporated Research Institutions for Seismology, Seattle, WA, USA)236

for assistance with IRIS web protocols. M. Denolle and J. Jones thank Ellen Yu and Aparna237

Bhaskaran (California Institute of Technology, Pasadena, California, USA) for assistance with238

SCSN FDSN and correspondence. J. Jones extends additional thanks to Wendy McCausland239

(USGS-VDAP, USA) and Ken Creager (University of Washington, USA) for contributing test data,240

and R. Carniel (Universita di Udine, Italy) for extensive early testing. mini-SEED handling was241

originally based on rdmseed.m for MATLAB by Francois Beauducel (Institut de Physique du242

Globe de Paris, France); SAC routines were originally based on SacIO for Julia by Ben Postleth-243

waite (https://github.com/bpostlethwaite/SacIO). This research was supported by a grant from the244

Packard Foundation.245

Author Contributions246

J. Jones created SeisIO, is the sole developer of the core package, and happily rules with an iron fist247

over its development and maintenance. T. Clements created the SeisIO notebook tutorial, devel-248

12

Page 13: SeisIO: a fast, efficient geophysical data architecture for

oped a number of packages based on SeisIO, and created the prototypes of several data processing249

routines. K. Okubo wrote and conducted the benchmarks of download efficiency and instrumen-250

tal response removal, and has developed a parallel downloader prototype, SeisDownload.jl, as an251

example of the many SeisIO applications created by M. Denolle’s research group; its functionality252

is currently being integrated into SeisIO core. M. Denolle contributed to application development,253

research direction, and manuscript editing, and provides management and financial support for254

ongoing development.255

Data and Resources256

Data used in benchmark tests (Table 2) can be found in the SeisIO GitHub repository, with redistri-257

bution restrictions as noted below. Benchmarking scripts are available on the SeisIO GitHub page.258

Data sources in Table 2 use the following key:259

1. Contributed by Prof. K. Creager, University of Washington, Seattle, WA, USA260

([email protected]).261

2. Retrieved with IRIS FDSN dataselect; to duplicate a data request, please contact the corre-262

sponding author for exact parameters. Each binary data file has a single data channel; each263

file name gives the time length and sampling frequency.264

3. File is from the IRIS Mt. St. Helens 1980 special data set (IRIS virtual network265

STHELENS-1980). Original data are available by request from Incorporated Research266

Institutions for Seismology, Seattle, WA, USA.267

4. File data are from the vertical-component channel of station EA3 in Jones et al. (2006). The268

original recording format was the SLIST variant of Lennartz MarsLite portable stations; the269

first line of text was manually edited to match SLIST syntax for this test.270

5. Redistribution restricted; to request this file please contact Dr. W. McCausland, USGS-271

VDAP, Vancouver, WA, USA ([email protected]). Data file comprises five minutes272

of 100 Hz data on 22 channels beginning 2008-10-08T17:01:06.06 (UTC -6).273

6. Available upon request from the corresponding author. Event data extracted from Pacific274

Northwest Seismic Network archives; data are fully described in Jones and Malone (2005).275

13

Page 14: SeisIO: a fast, efficient geophysical data architecture for

7. Data from HiNet (NIED, 2019); redistribution prohibited. Request comprises one hour of276

100 Hz data beginning 2014-09-27T09:00:00 (UTC+9) from 8 total channels (seismometer277

+ infrasound at stations V.ONTA and V.ONTN). Benchmark uses the NIED channel file.278

A standalone repository to reproduce the benchmark tests for download efficiency presented in279

section 3.2 is available on GitHub. The required software, computational environment, data sets,280

and commands to execute the benchmark tests are documented in the repository.281

The NoisePy module for ObsPy is part of a separate manuscript, currently in preparation. The282

repository is private until publication, but code is available upon request from its creator (Dr. C.283

Jiang., Harvard University, MA, USA, chengxin [email protected]).284

Addendum285

The SeisIO package presented in this work is the only official Julia package by this name. We286

recently learned of another, newer package that borrows the name SeisIO, consisting of reflection287

seismology software for SEGY data, whose code has migrated to another project. This other288

SeisIO is not part of the Julia registry and is completely unrelated to this work, but can be found289

on GitHub and via. Google search, and packages that depend on it exist in the Julia registry. To290

minimize potential confusion, please follow the installation instructions in this manuscript or on291

our Github page.292

References293

Ahern, T., Casey, R., Barnes, D., Benson, R., & Knight, T. (2007). Seed standard for the exchange of earthquake data294

reference manual format version 2.4. Incorporated Research Institutions for Seismology (IRIS), Seattle.295

Bensen, G. D., Ritzwoller, M. H., Barmin, M. P., Levshin, A. L., Lin, F., Moschetti, M. P., Shapiro, N. M. and Yang, Y.296

14

Page 15: SeisIO: a fast, efficient geophysical data architecture for

(2007) Processing seismic ambient noise data to obtain reliable broad-band surface wave dispersion measurements,297

Geophysical Journal International, 169(3), 1239-1260.298

M. Beyreuther, R. Barsch, L. Krischer, T. Megies, Y. Behr and J. Wassermann (2010), ObsPy: A Python Toolbox for299

Seismology, SRL, 81(3), 530-533. DOI: 10.1785/gssrl.81.3.530300

Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. (2017). Julia: A fresh approach to numerical computing.301

SIAM review, 59(1), 65-98.302

Bezanson, J., Chen, J., Chung, B., Karpinski, S., Shah, V. B., Vitek, J., & Zoubritzky, L. (2018). Julia: dynamism and303

performance reconciled by design. Proceedings of the ACM on Programming Languages, 2(OOPSLA), 120.304

Bryan, J. T., Okubo, K., Yuan, C., & Denolle, M. (2019) Improving the resolution of co-seismic velocity change305

monitoring at active fault zones using the ambient seismic field, Poster Presentation at 2019 SCEC Annual Meeting.306

Clements, T. & Denolle, M. (2019, 08) Cactus to Clouds: Processing the SCEDC Open Data Set on AWS, Poster307

Presentation at 2019 SCEC Annual Meeting.308

Crotwell, H. P., T. J. Owens, and J. Ritsema (1999). The TauP Toolkit: Flexible seismic travel-time and ray-path309

utilities, Seismological Research Letters 70, 154-160.310

Goldstein, P., A. Snoke, (2005), ”SAC Availability for the IRIS Community”, Incorporated Institutions for Seismology311

Data Management Center Electronic Newsletter.312

Goldstein, P., D. Dodge, M. Firpo, Lee Minner (2003) SAC2000: Signal processing and analysis tools for seismolo-313

gists and engineers, Invited contribution to ”The IASPEI International Handbook of Earthquake and Engineering314

Seismology”, Edited by WHK Lee, H. Kanamori, P.C. Jennings, and C. Kisslinger, Academic Press, London.315

Hagelund, Rune; Stewart A. Levin, eds. (2017). SEG-Y r2.0: SEG-Y revision 2.0 Data Exchange format (PDF). Tulsa,316

OK: Society of Exploration Geophysicists.317

Jones, J.P., Carniel, R., Harris, A.J., & Malone, S.D. (2006). Seismic characteristics of variable convection at Erta ’Ale318

lava lake, Ethiopia. J. Volcanol. Geotherm. Res., 153(1), 64–79.319

Jones, J.P., & Malone, S. D. (2005). Mount Hood earthquake activity: Volcanic or tectonic origins?. Bulletin of the320

Seismological Society of America, 95(3), 818-832.321

Lion Krischer, James Smith, Wenjie Lei, Matthieu Lefebvre, Youyi Ruan, Elliott Sales de Andrade, Norbert Pod-322

horszki, Ebru Bozdag̈, Jeroen Tromp, An Adaptable Seismic Data Format, Geophysical Journal International, Vol-323

ume 207, Issue 2, November, 2016, Pages 1003?1011, https://doi.org/10.1093/gji/ggw319.324

15

Page 16: SeisIO: a fast, efficient geophysical data architecture for

T. Megies, M. Beyreuther, R. Barsch, L. Krischer, J. Wassermann (2011), ObsPy ? What can it do for data centers and325

observatories?, Annals Of Geophysics, 54(1), 47-58, DOI: 10.4401/ag-4838.326

National Research Institute for Earth Science and Disaster Resilience (2019), NIED Hi-net, National Research Institute327

for Earth Science and Disaster Resilience, doi:10.17598/NIED.0003.328

Perkel, Jeffrey M. (2019). Julia: come for the syntax, stay for the speed, Nature 572, 141-142, doi: 10.1038/d41586-329

019-02310-3.330

Regier, J., Fischer, K., Pamnany, K., Noack, A., Revels, J., Lam, M., Howard, S., Giordano, R., Schlegel, D.,331

McAuliffe, J. and Thomas, R., 2019. Cataloging the visible universe through Bayesian inference in Julia at petas-332

cale. Journal of Parallel and Distributed Computing, 127, pp.89-104.333

Schorlemmer, D., Euchner, F., Kstli, P., & Saul, J. (2011). QuakeML: status of the XML-based seismological data334

exchange format. Annals of Geophysics, 54(1), 59-65.335

Trabant, C. (2019), libmseed - The miniSEED library. https://github.com/iris-edu/libmseed, last accessed 2019-09-24.336

Ward, Peter L. (1989). SUDS; seismic unified data system, USGS Open-File Report 89-188, doi:10.3133/ofr89188.337

16

Page 17: SeisIO: a fast, efficient geophysical data architecture for

Table 1: Data format support in SeisIO v0.4.1. Columns: ”RW” is read/write support (”r” = read, ”w” = write); column”Cov” is the lesser of % code coverage on CodeCov.io and Coveralls.io. Notes use the key below.

1. coverage reflects only supported blockette/packet types

2. support for Provenance not yet implemented (NYI)

3. supports IEEE-Float and integer data in SEGY rev 0 and rev 1 formats

Format Name SeisIO Name rw Cov Notes ReferenceSEED Ahern et al. (2007)

Dataless SEED dataless r 96 1mini-SEED mseed r 96 1SEED resp resp r 96

SAC e.g. Goldstein et al. (2003)SAC data file sac rw 97SAC pole-zero file sacpz rw 97

OTHERAd Hoc (v1, v2) ah1, ah2 r 96Advanced Seismic Data Format asdf rw 100 2 Krischer et al. (2016)GeoCSV sample list geocsv.slist r 98GeoCSV time-sample pair geocsv r 98QuakeML qml r 100 e.g. Schorlemmer et al. (2012)SEG Y (rev 0, rev 1) segy r 93 3 Hagelund et al. (2017)PASSCAL (SEG Y variant) passcal r 96Sample List ASCII slist r 100(SeisIO low-level format) seisio rw 100 this workFDSN Station XML sxml rw 100Seismic Unified Data System suds r 94 1 Ward (1989)UNAVCO Bottle bottle r 100University of Washington uw r 98WIN (32-bit, v1) win32 r 96 NIED (2019)

17

Page 18: SeisIO: a fast, efficient geophysical data architecture for

Table 2: Benchmark tests. Columns: Test Name is how the test is referenced in this manuscript; Filename is the nameor search pattern in SeisIO/test/SampleFiles/; Format corresponds to column 2 of Table 1; SzF is file size on disk;SzO is object size in memory; Mem is peak memory usage; %Ovh ≡ 100 × (Mem/SzO − 1.0)%; T̃ is median readtime in milliseconds for 100 trials. All memory and file size values are in MB. In column Notes, numeric values aredata sources (see Data and Resources); lowercase letters denote special benchmark parameters:

a test uses read asdf

b test uses read data with keywords nx new=36000, nx add=1400000

c test uses read data with keyword full=true

Test Name File Format SzF SzO Mem %Ov T̃ NotesAH 1day-1hz.ah ah1 0.33 0.33 0.33 1.11 0.49 1ASDF 2days-40hz.h5 asdf 21.96 26.37 26.49 0.45 92.74 2,aGeoCSV-tspair geo-tspair.csv geocsv 3.31 0.39 0.44 12.30 204.01 2MSEED-1 1day-100hz.mseed mseed 19.09 32.96 32.96 0.01 71.46 2MSEED-2 SHW.UW.mseed mseed 1.79 5.35 6.19 15.75 9.33 3,bPASSCAL 1day-100hz.segy passcal 32.96 32.96 32.99 0.08 22.30 2,cSAC 1day-100hz.sac sac 32.96 32.96 32.97 0.04 13.02 2,cSLIST 1h-62.5hz.slist slist 2.44 0.86 0.87 1.85 30.09 4SUDS 10081701.WVP suds 1.26 2.53 2.59 2.43 1.36 5UW 99011116541W uw 23.15 37.66 40.29 6.98 26.71 6WIN32 2014092709*.cnt win32 4.49 10.99 11.25 2.33 22.88 7

18

Page 19: SeisIO: a fast, efficient geophysical data architecture for

Figure 1: Benchmarks tests (Table 2) in Julia v1.1.0 with SeisIO v0.4.1. Left: file read times. Right: peak memoryuse in SeisIO and file size on disk.

19

Page 20: SeisIO: a fast, efficient geophysical data architecture for

Figure 2: Memory use and overhead for all benchmarks in Table 2 that were testable in at least two of ObsPy, SAC,and SeisIO. (top) Memory usage and file sizes on disk. (bottom) Memory overhead. The y-axis is logarithmic. Amissing bar with text label NR” indicates no reader.

20

Page 21: SeisIO: a fast, efficient geophysical data architecture for

Figure 3: Read times in milliseconds for all benchmarks in Table 2 that were testable in at least two of ObsPy, SAC,and SeisIO. A missing bar with text label NR indicates no reader. Most read times fall in the range 10-100 ms. SeisIOAH and SUDS benchmarks are labeled with their respective values because the bars themselves are difficult to see.ObsPy SLIST benchmark is labeled with its value because the full bar vastly exceeds the upper bound of the y-axis.

21

Page 22: SeisIO: a fast, efficient geophysical data architecture for

Figure 4: Download efficiency as a function of number of workers from (a) the IRIS-DMC server and (b) the NorthernCalifornia Earthquake Data Center (NCEDC). The markers indicate individual speed tests. The dashed lines indicatethe best-fit line (with logarithmic y-axis scaling) associated with each tool; the slope of each line is a proxy measureof the scaling performance.

Figure 5: Benchmark tests of instrument response removal. (a) Time benchmarks of data read and instrument responseremoval. Solid bar heights correspond to the mean times of each benchmark; 1σ error bars are shown as thin blacklines. (b) Waveforms with their respective instrument responses removed are shown to demonstrate that the methodsproduce nearly identical output. For ease of visualization, lines are plotted every 20 points.

22