Big Data#
4-D STEM datasets are large and difficult to work with. In pyxem
we try to get around this by
using a lazy loading approach. This means that the data is not loaded into memory until it is
needed.
This is a very powerful approach, but it can be confusing at first. In this guide we will
discuss how to work with large datasets in pyxem
.
Note
If you want more information on working with large datasets in pyxem
, please see the
Big Data <https://hyperspy.org/hyperspy-doc/current/user_guide/big_data.html>_ section of
the HyperSpy User Guide <https://hyperspy.org/hyperspy-doc/current/user_guide/index.html>_.
Loading and Plotting a Dataset#
Let’s start by loading a dataset. We will use the load
function to load a dataset from
HyperSpy
import hyperspy.api as hs
s = hs.load("big_data.zspy", lazy=True)
s
The dataset here is not loaded into memory here, so this should happen instantaneously. We can then plot the dataset.
s.plot()
Which (in general) will be very slow. This is because entire dataset is loaded into memory (chuck by chunk) to create a navigator image. In many cases a HAADF dataset will be collected as well as a 4-D STEM dataset. In this case we can set the navigator to the HAADF dataset instead.
haadf = hs.load("haadf.zspy") # load the HAADF dataset
s.plot(navigator=haadf) # happens instantaneously
This is much faster as the navigator doesn’t need to be computed and instead only 1 chunk needs to be loaded into memory before plotting!
You can also set the navigator so that by default it is used when plotting.
haadf = hs.load("haadf.zspy") # load the HAADF dataset
s.navigator = haadf
s.plot()
Distributed Computing#
In pyxem
we can use distributed computing to speed up the processing of large datasets. This
is done using the Dask library. Dask is a library for parallel computing
in Python. It is very powerful and can be used to speed up many different types of computations.
The first step is to set up a Dask Client. This can be done using the distributed scheduler.
from dask.distributed import Client
client = Client()
This will start a local cluster on your machine.
If you want to use a remote cluster using a scheduler such as Slurm you can do so by using the dask-jobqueue library. This is a library that allows you to use a scheduler to start a cluster on a remote machine.
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster()
client = Client(cluster)