Software Design: Decoupling when highly dependent on a third party library

Question

As part of an university project I am currently working on an eeg-biosignal classifier. While the project itself doesn't really focus on design ("anything that works") I am trying to learn and apply some best practices.

Some Background
The experimental setup is essentially as follows: A subject sits in front of a screen on which an optical stimulus is shown. EEG measurements are used to observe the triggered response to this stimulus. Each stimulus is assigned to an event code. The goal is to build a neural network that can classify a stimulus for a particular EEG measurement.

Thus, we have several datasets in the general data format (.gdf) containing eeg measurements labeled with event codes. My goal now is to load these datasets and extract the information I need to create a CustomDataset suitable for PyTorch.

This is my Dataset class:

import torch
from torch.utils.data.dataset import Dataset as TorchDataset


class Dataset(TorchDataset):

class_mapping: dict[str, int]
labels: torch.Tensor
samples: torch.Tensor

def __init__(
    self,
    class_mapping: dict[str, int],
    labels: torch.Tensor,
    samples: torch.Tensor
) -> None:

    self.class_mapping = class_mapping
    self.labels = labels
    self.samples = samples

def __len__(self) -> int:
    return len(self.labels)

def __getitem__(self, item) -> tuple[torch.Tensor, torch.Tensor]:
    return self.labels[item], self.samples[item]

The class_mapping variable is a dict that maps the event code labels as key to class labels used in the PyTorch framework.

The process of obtaining the samples and labels is as follows:

load raw files
extract data fragments in a specified time window around the event markers (aka. epochs)
extract the data and labels from the epochs objects

As these steps are quite something to do I decided to put them into a simple factory:

from datasets.loading import LoadingStrategy
from datasets.data import Dataset

from typing import Iterable, Optional, Callable
from pathlib import Path

import numpy as np
import torch
import mne
from mne.io import Raw
from mne import Epochs

class DatasetFactoryV2:
    @staticmethod
    def create_dataset(
            path_to_raw_files: Iterable[Path],
            file_loading_strategy: LoadingStrategy,
            considered_events: Iterable[str],
            time_window_min: float,
            time_window_max: float
    ) -> Dataset:

        class_mapping = _get_class_mapping(considered_events)
        raw_files = file_loading_strategy(path_to_raw_files)
        labels, samples = _get_labels_and_samples(
            raw_files=raw_files,
            class_mapping=class_mapping,
            time_window_min=time_window_min,
            time_window_max=time_window_max
        )

        return Dataset(
            class_mapping=class_mapping,
            labels=labels,
            samples=samples
        )

def _get_class_mapping(events_to_consider: Iterable[str]) -> dict[str, int]:
    return {event: idx for idx, event in enumerate(events_to_consider)}

def _get_labels_and_samples(
        raw_files: Iterable[Raw],
        class_mapping: dict[str, int],
        time_window_min: float,
        time_window_max: float,
) -> tuple[torch.Tensor, torch.Tensor]:

    labels = []
    samples = []

    for raw in raw_files:

        events_from_annot, event_dict = mne.events_from_annotations(raw=raw, event_id=class_mapping)
        channel_picks = mne.pick_types(raw.info, eeg=True, exclude='bads')
        epochs = Epochs(
            raw=raw,
            events=events_from_annot,
            on_missing='warn',
            tmin=time_window_min,
            tmax=time_window_max,
            baseline=(None, None),
            event_repeated='drop',
            preload=True,
            picks=channel_picks,
            reject_by_annotation=True  # exclude Epochs fully/partially overlapping with 'bad'-labelled data spans
        )

        labels.extend(epochs.events[:, -1])
        samples.extend(epochs.get_data())

    labels = torch.tensor(data=np.array(labels), dtype=torch.int64)
    samples = torch.tensor(data=np.array(samples), dtype=torch.float32)

    return labels, samples

And only for the completeness the LoadingStrategy:

from typing import Protocol, Iterable
from pathlib import Path
from mne.io import Raw, read_raw_gdf


class LoadingStrategy(Protocol):
    def __call__(self, path_to_files: Iterable[Path]) -> list[Raw]:
        ...


class GDFLoadingStrategy:
    def __call__(self, path_to_files: Iterable[Path]) -> list[Raw]:
        return [read_raw_gdf(str(filepath)) for filepath in path_to_files]

This is basically the only part I expect to change in the scope of that educational project. So I decided to go with a strategy pattern that can be extended for any of the many possible data formats.

Problem / Question
In several lectures I have now heard that coupling is a bad thing and should be avoided as much as possible. If I recognize it correctly, I must admit that my DatasetFactory is strongly coupled with the creation of mne.io.Raw and mne.Epochs and a number of different functions from this library as well. Thinking in terms of tests for example it seems quite obvious that I will have a hard time testing without having real data to load.

This brings me to my question:
How could I decouple a library if I need a lot of logic from it? Should I really create e.g. Protocols that mimic the mne.io.Raw and mne.Epochs classes and use them instead of the mne ones? It would certainly make mocking and thus testing easier, but it seems like a lot of work relative to the benefits.

Thanks for any help!

Julian

"I have now heard that coupling is a bad thing and should be avoided as much as possible." - that's your misunderstanding. Here is a better statement: coupling (to a library) should be avoided in case you expect the life time of your application to be a lot longer than the life time of the library. — Doc Brown, Commented Jan 25, 2023 at 7:37

JonasH · Accepted Answer · 2023-01-25 10:49:50Z

heard that coupling is a bad thing and should be avoided as much as possible

I think this is overly dogmatic.

In some cases it makes more sense to just accept the tight coupling. This is especially true if you are writing code that is not expected to live very long.

In other cases you might want to add an abstraction layer. The idea is to create interfaces that represent the required functionality, and implement these using your library.

But a very common problem with this approach is leaky abstractions. There is a good chance your interfaces will be very similar to the library you are trying to decouple from. And this will likely prevent any easy replacement of the library.

Creating well designed interfaces so they can be used with multiple different libraries is very difficult, at least for more complex libraries with large APIs. So I would consider your actual goal. Is it to make testing easier? Or allow plugin replacement with another library?

You might also want to consider the risk with any kind of dependency. Is it open source? Does it have an active community? Could you fix a bug yourself if needed? Is there possible replacements available? How difficult would a replacement be? If it is commercial, is the company stable? What support guarantees do they have? If the license compatible with whatever you want to do?

While recommendations on the internet can be useful, it is important to understand the problem they are trying to address, and when they are applicable.

Thanks to all for all your answers and comments. I think I will stick with some sort of factory method then where the factory methods "get_class_mapping" and "get_labels_and_samples" will be abstract. It's still seems to be a quite coarse abstraction but god knows how another library might work. — J. Lo, Commented Jan 25, 2023 at 17:15

gnasher729 · Accepted Answer · 2023-01-26 08:53:06Z

1

You create an abstraction lawyer so the rest of your application is coupled with the abstraction lawyer, but not with the library. The implementation of the abstraction lawyer is obviously tightly coupled with the library. That's fine. Something must be tightly coupled with it.

Now what you do need to do is make sure that your abstraction layer is indeed abstract and could be used with a different library, otherwise it is pointless. The goal is that you can rewrite the abstraction layer and leave everything else unchanged if you use a different library, or if the library makes incompatible changes to earlier versions.

Comments: Perfection isn’t needed. If you change the library that is accessed through an imperfect abstraction layer, then you have few changes in the application instead of many without abstraction layer. The abstraction layer probably needs to be re-implemented but you have to do that anyway.

edited Jan 26, 2023 at 8:53

answered Jan 25, 2023 at 12:21

gnasher729

46k4 gold badges66 silver badges130 bronze badges

1

I tried creating an abstraction lawyer once. But he wanted to go to film school.
– candied_orange
Commented Jan 25, 2023 at 13:23
2

The problem is that it's nearly impossible to make sure the abstraction layer could be used with a different library, except if you actually have a different library around and actually implement it with that library as well to make sure you understand all the details of the APIs.
– Michael Borgwardt
Commented Jan 25, 2023 at 13:25
@MichaelBorgwardt oh it's not impossible. It requires that you separate your interesting business logic from your brain dead glue code that wires up to the library. Once that's done all you need is something that can do what the library does. The question is if you can get any value from that separation prior to actually needing to replace the library. The less your code knows about the library the better but at some point you need to create something that works.
– candied_orange
Commented Jan 25, 2023 at 13:36
Thanks for your answer! If I could I would assign another "accept" to you. As described in a comment above I tend to create some sort of factory method where "get_class_mapping" and "get_labels_and_samples" will be abstract. It has the risk of repetition in case another library works in a similar way as mne but I think it also has the best flexibility in case another library works completely different.
– J. Lo
Commented Jan 25, 2023 at 17:21

Add a comment |

Stack Exchange Network

Software Design: Decoupling when highly dependent on a third party library

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
python
clean-code
coupling
third-party-libraries
protocol
or ask your own question.

Hot Network Questions

Software Design: Decoupling when highly dependent on a third party library

2 Answers 2

Not the answer you're looking for? Browse other questions tagged pythonclean-codecouplingthird-party-librariesprotocol or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
clean-code
coupling
third-party-libraries
protocol
or ask your own question.