Big Data --------- 4-D STEM datasets are large and difficult to work with. In ``pyxem`` we try to get around this by using a lazy loading approach. This means that the data is not loaded into memory until it is needed. This is a very powerful approach, but it can be confusing at first. In this guide we will discuss how to work with large datasets in ``pyxem``. .. note:: If you want more information on working with large datasets in ``pyxem``, please see the `Big Data _` section of the `HyperSpy User Guide _`. Loading and Plotting a Dataset ------------------------------ Let's start by loading a dataset. We will use the ``load`` function to load a dataset from HyperSpy .. code-block:: import hyperspy.api as hs s = hs.load("big_data.zspy", lazy=True) s The dataset here is not loaded into memory here, so this should happen instantaneously. We can then plot the dataset. .. code-block:: s.plot() Which (in general) will be very slow. This is because entire dataset is loaded into memory (chuck by chunk) to create a navigator image. In many cases a HAADF dataset will be collected as well as a 4-D STEM dataset. In this case we can set the navigator to the HAADF dataset instead. .. code-block:: python haadf = hs.load("haadf.zspy") # load the HAADF dataset s.plot(navigator=haadf) # happens instantaneously This is much faster as the navigator doesn't need to be computed and instead only 1 chunk needs to be loaded into memory before plotting! You can also set the navigator so that by default it is used when plotting. .. code-block:: python haadf = hs.load("haadf.zspy") # load the HAADF dataset s.navigator = haadf s.plot() Distributed Computing --------------------- In ``pyxem`` we can use distributed computing to speed up the processing of large datasets. This is done using the `Dask `_ library. Dask is a library for parallel computing in Python. It is very powerful and can be used to speed up many different types of computations. The first step is to set up a `Dask Client `_. This can be done using the distributed scheduler. .. code-block:: from dask.distributed import Client client = Client() This will start a local cluster on your machine. If you want to use a remote cluster using a scheduler such as `Slurm `_ you can do so by using the `dask-jobqueue `_ library. This is a library that allows you to use a scheduler to start a cluster on a remote machine. .. code-block:: from dask_jobqueue import SLURMCluster cluster = SLURMCluster() client = Client(cluster)