Check out the Hyperspy Workshop May 13-17, 2024 Online

Big Data#

4-D STEM datasets are large and difficult to work with. In pyxem we try to get around this by using a lazy loading approach. This means that the data is not loaded into memory until it is needed.

This is a very powerful approach, but it can be confusing at first. In this guide we will discuss how to work with large datasets in pyxem.

Note

If you want more information on working with large datasets in pyxem, please see the Big Data <https://hyperspy.org/hyperspy-doc/current/user_guide/big_data.html>_ section of the HyperSpy User Guide <https://hyperspy.org/hyperspy-doc/current/user_guide/index.html>_.

Loading and Plotting a Dataset#

Let’s start by loading a dataset. We will use the load function to load a dataset from HyperSpy

import hyperspy.api as hs
s = hs.load("big_data.zspy", lazy=True)
s

The dataset here is not loaded into memory here, so this should happen instantaneously. We can then plot the dataset.

s.plot()

Which (in general) will be very slow. This is because entire dataset is loaded into memory (chuck by chunk) to create a navigator image. In many cases a HAADF dataset will be collected as well as a 4-D STEM dataset. In this case we can set the navigator to the HAADF dataset instead.

haadf = hs.load("haadf.zspy") # load the HAADF dataset
s.plot(navigator=haadf) # happens instantaneously

This is much faster as the navigator doesn’t need to be computed and instead only 1 chunk needs to be loaded into memory before plotting!

You can also set the navigator so that by default it is used when plotting.

haadf = hs.load("haadf.zspy") # load the HAADF dataset
s.navigator = haadf
s.plot()

Distributed Computing#

In pyxem we can use distributed computing to speed up the processing of large datasets. This is done using the Dask library. Dask is a library for parallel computing in Python. It is very powerful and can be used to speed up many different types of computations.

The first step is to set up a Dask Client. This can be done using the distributed scheduler.

from dask.distributed import Client
client = Client()

This will start a local cluster on your machine.

If you want to use a remote cluster using a scheduler such as Slurm you can do so by using the dask-jobqueue library. This is a library that allows you to use a scheduler to start a cluster on a remote machine.

from dask_jobqueue import SLURMCluster
cluster = SLURMCluster()
client = Client(cluster)