Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: use pydantic for parameter validation #514

Draft
wants to merge 37 commits into
base: develop
Choose a base branch
from

Conversation

DerWeh
Copy link
Contributor

@DerWeh DerWeh commented Feb 26, 2024

This is a first proof of concept to use pydantic for the validation of parameters. I used Python 3.10 syntax as it is more concise and used pydantic=2.6.2 without putting much thought into it.

Currently, the fit methods is extremely messy due to all the checks. Of course, a refactoring putting this into a dedicated method would already simplify things.
I think using pydantic might be a more elegant solution. This draft is mostly for discussion, whether this is a good idea or not. I tried to be faithful to the checks that were implemented.
A disadvantage is that pydantic errors are harder to read. On the plus side, the type annotations are much clearer than having to search through the fit method. It's probably also much easier to maintain the validation.

Currently, I break the private versions of the classifier. Attributes were set, depending on whether the estimator is private or not. Do you think it would be better to define a super class with the common arguments and subclasses with the additional arguments?

Another question is how to best proceed with overwriting the default arguments. Currently, there is a lot of redundancy.

If we want to follow through with this idea, I think some sklearn checks should be enabled again using parametrize_with_checks or check_estimator

@paulbkoch
Copy link
Collaborator

Hi @DerWeh -- I like what you're doing here and plan to merge it once it fully replaces the existing checks and passes the tests.

A few thoughts:
I think a good minimum version that we should target would be python 3.8 since that's still supported. Thankfully 3.8 supports things like the Literal annotation.

We can add some classes to formalize the DP vs Regular EBM class partition. Someday, I think we might support multi-output EBMs, and one strange type of EBM might be an EBM that has both classification and regression outputs. To create one of the strange beasts, you'd need to directly create an EBMModel object, so I like having the ability to make and use these classes directly. I can't think of any good reason though to mix private and non-private EBMs, so the private vs regular partition can be a first split in the class hierarchy below the EBMModel class.

You probably already saw that I had a test for check_estimator, but disabled it. I think we can probably now increase our minimum scikit-learn version to allow for this check now. We didn't pass all the checks since EBMs are more permissive in what they accept than scikit-learn, and I'd like to preserve our user-friendliness in that respect, so I expect we'll want to continue disabling some of the scikit-learn checks.

@DerWeh
Copy link
Contributor Author

DerWeh commented Feb 28, 2024

I think a good minimum version that we should target would be python 3.8 since that's still supported. Thankfully 3.8 supports things like the Literal annotation.

Sure enough, I typically orient myself at NEP-29, but it is a good idea to be a more conservative (after all, now NumPy is already very stable such that using an older NumPy version is typically sufficient).
I just wanted to avoid the additional boilerplate for type hints for the prototype, I case we reject the idea altogether.

Someday, I think we might support multi-output EBMs, and one strange type of EBM might be an EBM that has both classification and regression outputs.

Out of curiosity: what do advantage do you expect? I don't really see the synergy for multi-output in EBMs compared to deep networks (or deep trees). Can we expect an advantage during the boosting phase?
But this is mostly an academic question. In practice, I am currently mostly interested in more convenient multi-class support. Interactions and post-processing seem to still be lacking.

You probably already saw that I had a test for check_estimator, but disabled it.
Yes, I saw it. The PR for the non-private EBMs is ready.

As soon as it is merged, (and when I find time) I can continue working on pydantic.

@paulbkoch
Copy link
Collaborator

paulbkoch commented Feb 28, 2024

Someday, I think we might support multi-output EBMs, and one strange type of EBM might be an EBM that has both classification and regression outputs.

Out of curiosity: what do advantage do you expect? I don't really see the synergy for multi-output in EBMs compared to deep networks (or deep trees). Can we expect an advantage during the boosting phase? But this is mostly an academic question. In practice, I am currently mostly interested in more convenient multi-class support. Interactions and post-processing seem to still be lacking.

I think you're right that there will be very little technical benefit. Mostly, I think of it as potentially a nice packaging solution for grouping related predictions, although I might reject the idea if it adds anything more than a little complexity. EBMs do have some benefits in terms of the compactness of the model representation. For example, SoftAtHome, which wrote Ebm2Onnx (https://github.com/interpretml/ebm2onnx) uses EBMs inside their routers, and I also saw a paper where some company was evaluating EBMs to be used inside pacemakers (!!!). Sharing the feature definitions, and bin cuts might be nice in those applications both for the compactness of the model, because it could lower the computation requirements of predictions, and also to speed up predictions.

Probably more likely than boosting multi-output models together would be constructing two separate EBMs and then merging them together as a post process step similarly to how merge_ebm works today. But who knows, maybe someone really will want some super weird objective that includes both classification and regression. I'd like that even if just for the entertainment value of seeing someone use an EBM this way.

@paulbkoch
Copy link
Collaborator

But this is mostly an academic question. In practice, I am currently mostly interested in more convenient multi-class support. Interactions and post-processing seem to still be lacking.

We don't allow multi-class interactions in the default API entirely because of the difficulties in visualizing them. You can construct multi-class EBMs with interactions today if you use the measure_interactions function manually. Here's a slightly related example showing how to use it:

https://interpret.ml/docs/python/examples/custom-interactions.html

@paulbkoch
Copy link
Collaborator

Hi @DerWeh -- I've been thinking through this question the last few days and here's what I came up with in terms of a hierarchy:

EBMBase -> abstract base class for all EBM models
EBMModel(EBMBase) -> instantiable class for normal EBM classifiers/regressors
ExplainableBoostingClassifier(EBMModel)
ExplainableBoostingRegressor(EBMModel)
DPEBMModel(EBMBase) -> instantiable class for differential privacy classifiers/regressors
DPExplainableBoostingClassifier(DPEBMModel)
DPExplainableBoostingRegressor(DPEBMModel)

Thoughts?

Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
@DerWeh
Copy link
Contributor Author

DerWeh commented Mar 4, 2024

Yes, the hierarchy sound very reasonable. Though, I am not quite sure if making EBMBase and abstract base class has any benefit. See the discussion: https://stackoverflow.com/questions/3570796/why-use-abstract-base-classes-in-python

But it doesn't really seem very important to me, so I am fine with anything/


However, now that EBMs have the checks for scikit-learn compatibility, I ran into a bit of a problem: clone doesn't work with parameter validation anymore. clone performs explicit is checks (old_parameter is new_parameter) which fail due to pydantic...
pydantic doesn't offer the option to defer the validation, see the rejected request pydantic/pydantic#559. I knew, that scikit-learn urges you to put the validation into the fit method (https://scikit-learn.org/stable/developers/develop.html#instantiation) but I didn't think they are this pedantic. Currently, the only option I see to get through with this PR, is providing our one custom implementation for clone, by providing as custom _clone_parametrized, but this seems somewhat ugly...

What is your opinion @paulbkoch?

@paulbkoch
Copy link
Collaborator

Yes, the hierarchy sound very reasonable. Though, I am not quite sure if making EBMBase and abstract base class has any benefit. See the discussion: https://stackoverflow.com/questions/3570796/why-use-abstract-base-classes-in-python

But it doesn't really seem very important to me, so I am fine with anything/

You're probably right that it has very little technical benefit in python. I was thinking of it mainly as a form of documentation to indicate that the EBMBase class should not be instantiated directly and then just using duck typing as we currently are. I haven't written as much python as you have (seemingly) and have never made an abstract class in python, but I'm assuming adding the abstract part is fairly lightweight.

However, now that EBMs have the checks for scikit-learn compatibility, I ran into a bit of a problem: clone doesn't work with parameter validation anymore. clone performs explicit is checks (old_parameter is new_parameter) which fail due to pydantic... pydantic doesn't offer the option to defer the validation, see the rejected request pydantic/pydantic#559. I knew, that scikit-learn urges you to put the validation into the fit method (https://scikit-learn.org/stable/developers/develop.html#instantiation) but I didn't think they are this pedantic. Currently, the only option I see to get through with this PR, is providing our one custom implementation for clone, by providing as custom _clone_parametrized, but this seems somewhat ugly...

What is your opinion @paulbkoch?

Yeah, that does seem a bit ugly. I didn't realize pydantic was unable to use deferred checks. Even if we get the clone function working though, it's not clear if the scikit-learn requirement to defer checks will break anything else now or in the future.

Maybe we should instead move back to option #2 @DerWeh where we move the checks into a separate _validate function? That would at least move them out of the way so that someone reading the fit function would only need to see them if they were specifically interested in the checks. And we'd get to keep the more descriptive error messages too.

@DerWeh
Copy link
Contributor Author

DerWeh commented Mar 5, 2024

You're probably right that it has very little technical benefit in python. I was thinking of it mainly as a form of documentation to indicate that the EBMBase class should not be instantiated directly and then just using duck typing as we currently are. I haven't written as much python as you have (seemingly) and have never made an abstract class in python, but I'm assuming adding the abstract part is fairly lightweight.

I agree, it should be lightweight, so I never did a benchmark (or in fact even use it). For documentation purpose, the most important part is probably not to make the class public.

Maybe we should instead move back to option #2 @DerWeh where we move the checks into a separate _validate function? That would at least move them out of the way so that someone reading the fit function would only need to see them if they were specifically interested in the checks. And we'd get to keep the more descriptive error messages too.

Sure, having a separate _validate functions is already a first good step. A question to consider is if it is still a good idea to separate EBMs and the DB versions (using your suggested class hierarchy), and if we use the vanilla dataclass offered by Python. It doesn't do any validation, but avoids the boilerplate for writing the __init__ function with all the self.attribute = attribute statements.

Another question is, if I should keep the 'Yoda-conditions' that are currently used for type-checks (example: passthrough < 0.0 or 1.0 < passthrough). Is this the code-style you use? I don't really know why, but my brain doesn't seem to like these statements, and I always have to mentally convert them to variable < constant. Of course, this is a very specific example, where we could use Pythons comparisons not (0.0 <= passthrough <= 1.0) (for some reason this is easy to read for me, but the single 0.0 <= passthrough takes time to digest).

Anyway, I'll play around a bit more in case I find a reasonable (elegant) way to use pydantic for validation. If I don't find a way, I'll at least do the refactor.


For documentation purpose: just gave it a try out of curiosity, and it is __sklearn_clone__ which we have to overwrite. Documentation indicates, however, that scikit-learn>=1.3 is required.

   def __sklearn_clone__(self, *, safe=True):
        """Customized implementation of clone. See :func:`sklearn.base.clone` for details."""
        klass = self.__class__
        new_object_params = self.get_params(deep=False)
        for name, param in new_object_params.items():
            new_object_params[name] = clone(param, safe=False)

        new_object = klass(**new_object_params)
        try:
            new_object._metadata_request = copy.deepcopy(self._metadata_request)
        except AttributeError:
            pass

        # _sklearn_output_config is used by `set_output` to configure the output
        # container of an estimator.
        if hasattr(self, "_sklearn_output_config"):
            new_object._sklearn_output_config = copy.deepcopy(
                self._sklearn_output_config
            )
        return new_object
@paulbkoch
Copy link
Collaborator

Sure, having a separate _validate functions is already a first good step. A question to consider is if it is still a good idea to separate EBMs and the DB versions (using your suggested class hierarchy), and if we use the vanilla dataclass offered by Python. It doesn't do any validation, but avoids the boilerplate for writing the __init__ function with all the self.attribute = attribute statements.

Avoiding boilerplate through dataclass sounds like a good idea. I guess this implies having the class hierarchy since we conditionally assign instance attributes based on whether the class is differentially private or not.

Another question is, if I should keep the 'Yoda-conditions' that are currently used for type-checks (example: passthrough < 0.0 or 1.0 < passthrough). Is this the code-style you use? I don't really know why, but my brain doesn't seem to like these statements, and I always have to mentally convert them to variable < constant. Of course, this is a very specific example, where we could use Pythons comparisons not (0.0 <= passthrough <= 1.0) (for some reason this is easy to read for me, but the single 0.0 <= passthrough takes time to digest).

If you want to change the ordering, go for it! I do use Yoda-conditionals in C++ to combat assignment bugs since "if(a = 5)" compiles, but "if(5 = a)" does not, but I don't care about the order in python since assignment isn't legal inside a conditional in python.

Anyway, I'll play around a bit more in case I find a reasonable (elegant) way to use pydantic for validation. If I don't find a way, I'll at least do the refactor.

Sounds good. Thanks again for investigating all these options!

DerWeh added 11 commits March 7, 2024 01:35
We use a regular Python dataclass in combination with a deferred
parameter validation using pydantic's `TypeAdaptor`. This is necessary to
be compatible to scikit-learn's `clone` function.

Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
In case we are not interested in an accurate result, we can drastically
reduce parameters to reduce the fit time.
On my old laptop, this reduces test time from 8 minutes to 1 minute.

Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
@DerWeh
Copy link
Contributor Author

DerWeh commented Mar 7, 2024

Ok, the parameter validation seems to be working now. I managed to find a way based on pydantics TypeAdapter.

The DP versions are currently broken, as I focused only on the non-private versions.

I am currently looking into breaking up the fit method a bit to look for away to avoid as much code duplications as possible when splitting between DP and non-private EBMs.
But this will probably take quite some time. The fit methods is quite the monolith, so it takes a lot of time to understand what's going on.

Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
@DerWeh
Copy link
Contributor Author

DerWeh commented Mar 13, 2024

Hi @DerWeh -- I just noticed something a bit concerning. I downloaded the documentation produced from the PR and the documentation for the ExplainableBoostingClassifier and ExplainableBoostingRegressor is missing. Not sure what the problem is yet.

The docs are available to download here: https://dev.azure.com/ms/interpret/_build/results?buildId=552050&view=artifacts&pathAsName=false&type=publishedArtifacts

We use Jupyter Book to make our documentation which sits on top of Sphinx. The ExplainableBoostingClassifier is documented here: https://github.com/interpretml/interpret/blob/develop/docs/interpret/python/api/ExplainableBoostingClassifier.ipynb

I'll have a look.

Sorry, never really used Notebooks to generate a documentation, and I am struggling a bit with the Azure pipeline.
However, when I tried to build the documentation locally, the ExplainableBoostingClassifier documentation is empty. Is this your issue? Locally, the problem is that requirements.txt pins typing_extensions==4.5.0. This is not compatible with pydantic causing the building of the documentation to fail. If I update typing_extensions, everything seems to be fine.

Where does the pinning come from? Can I just update the package?

@paulbkoch
Copy link
Collaborator

Sorry, never really used Notebooks to generate a documentation, and I am struggling a bit with the Azure pipeline. However, when I tried to build the documentation locally, the ExplainableBoostingClassifier documentation is empty. Is this your issue? Locally, the problem is that requirements.txt pins typing_extensions==4.5.0. This is not compatible with pydantic causing the building of the documentation to fail. If I update typing_extensions, everything seems to be fine.

Where does the pinning come from? Can I just update the package?

Yes, just update the pinning if necessary. I think this is pinned because Jupyter Book changes format frequently and pinning preserves documentation consistency, but we'll need to update these periodically.

@paulbkoch
Copy link
Collaborator

If we dropped python 3.8 support, we could eliminate the typing_extensions dependency, right? Seems like a good tradeoff.

@paulbkoch
Copy link
Collaborator

I've updated the develop branch pipelines to use python 3.9 for building the docs (and other build tasks).

@DerWeh
Copy link
Contributor Author

DerWeh commented Mar 20, 2024

Sorry for the wait, didn't manage to spend much time on it the last week.

Indeed, we only use Annotated from typing_extensions, which is available in the standard library typing 3.9.

However, the import is hard-coded, so I expect, that typing_extensions needs to be installed any which ways. Surprisingly, the docs build locally on my machine using Python3.9 on the requirements.txt file.

Just googled a little, if you want to still support Python3.8 and keep the dependencies as slim as possible, there is an option to make dependencies specific for a python version.

setup(
    ...,
    install_requires=[
        "typing_extensions;python_version<'3.9'",
    ],
)

For course, this would complicate the source-code:

import sys
...
if sys.version_info < (3, 9):
    from typing_extensions import Annotated
else:
    from typing import Annotated

Your thoughts @paulbkoch?


I hope that I can resolve the conflicts with interpret 0.6.0 the next days and continue on this PR.

@paulbkoch
Copy link
Collaborator

Hi @DerWeh -- I'm not too fussy on this, but I think of the options given I'd have a preference towards keeping things simple and dropping support for 3.8. Anyone who needs python 3.8 can just stick with interpret 0.6.0 until support for 3.8 is officially ended in a few months.

Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
@DerWeh
Copy link
Contributor Author

DerWeh commented Jun 11, 2024

@paulbkoch The regular and the DP EBMs are now separated (though it is not quite pretty yet). Maybe this is the right point to reevaluate if this is where we want to go.

If so, I'll try to resolve the conflicts (the public API changed). Probably it's would be best, to get it rapidly into a mergeable form, such that the develop branch doesn't deviate further. It's probably simple to add small changes afterward.

@paulbkoch
Copy link
Collaborator

Hi @DerWeh! Great to hear you're still looking at this PR. I'm still thinking the following class hierarchy would be good to implement:

EBMBase -> abstract base class for all EBM models
EBMModel(EBMBase) -> instantiable class for normal EBM classifiers/regressors
ExplainableBoostingClassifier(EBMModel)
ExplainableBoostingRegressor(EBMModel)
DPEBMModel(EBMBase) -> instantiable class for differential privacy classifiers/regressors
DPExplainableBoostingClassifier(DPEBMModel)
DPExplainableBoostingRegressor(DPEBMModel)

Looking back at our discussion there were some questions about whether EBMBase should be abstract or not, but I think we were in agreement on the rest? I'm fine with either abstract or not and would defer to someone (you) with more python experience. It might make more sense to break this PR into one that handles the class hierarchy change, and the rest of this PR, but do whatever works best for you.

Yeah, sorry about all the changes in that section of code. I added some new configuration options for greedy boosting that required some reworking there.

@DerWeh
Copy link
Contributor Author

DerWeh commented Jun 12, 2024

Sorry, @paulbkoch I wasn't clear enough. The class hierarchy is implemented by now:

class EBMBase(ABC, BaseEstimator): ....
class EBMModel(EBMBase): ...
class ExplainableBoostingClassifier(EBMModel, ClassifierMixin, ExplainerMixin): ...
class ExplainableBoostingRegressor(EBMModel, RegressorMixin, ExplainerMixin): ...
class DPEBMModel(EBMBase): ...
class DPExplainableBoostingClassifier(DPEBMModel, ClassifierMixin, ExplainerMixin): ...
class DPExplainableBoostingRegressor(DPEBMModel, RegressorMixin, ExplainerMixin): ...

It passes all test, but currently cannot be merged as the API on develop changed. What I would like to ask of you, is whether you could check out the PR and give it a try if you like the changes. I only use EBMs in a limited context, and don't have all intended usages in mind. I would like confirmation whether this PR is going the right direction, of course, critique is always more than welcome. When or if you're OK with the PR, I'll resolve the conflicts with the develop branch.

Yeah, sorry about all the changes in that section of code. I added some new configuration options for greedy boosting that required some reworking there.

Don't worry, it's not your fault that I am so slow. Every improvement is welcome.

@paulbkoch
Copy link
Collaborator

@DerWeh -- Overall I really like these changes. The code appears a lot more modular now, and I see you were able to use the new inheritance structure for good effect by creating private functions that differentiate the work based on whether the model is private or not. I also really like the declarative data validation that pydantic enables.

One area I'm less sure about is the use of @DataClass to remove the boilerplate __init__ functions. I like the theory behind this, but it appears that this change removes some control over the ordering of parameters. If I'm reading the docs correctly, you can order the __init__ parameters together within each class, but you can't intermix parameters from base classes and child classes. We do have a common API that should maintain the position of some __init__ parameters. I'm less familiar with this aspect of our package since I didn't write this part, but here are the example APIs:

https://github.com/interpretml/interpret/blob/develop/python/interpret-core/interpret/ext/examples.py

EBMs fit into the GlassboxExplainer API, which requires having the feature_names and features_types parameters first. There might be a way to make this work with @DataClass, but in addition to that, I also like how we are able to group common ideas together in the __init__ function and present them to the user in a certain order, even if most people are using keywords instead of position for specifying these.

Does this make sense?

@DerWeh
Copy link
Contributor Author

DerWeh commented Jun 16, 2024

Thank you for your valuable feedback @paulbkoch. It's important to get the API right. Personally, I opt for keyword-only arguments and declaring the order of parameters an implementation detail. But if you prefer having positional arguments, of course I will respect that.

The question is, if it's enough to be able to explicitly order the attributes of EBMModel and DPEBMModel (option 1)? Or do you want to go further and allow for ExplainableBoostingClassifier and ExplainableBoostingRegressor to have argument orders different for EBMModel?

Option 1 should be easily doable in the current form: we drop the @dataclass decorator from the EBMBase. It is abstract anyway, so there is no need to have an __init__ function. The subclasses EBMModel and DPEBMModel stay dataclasses and need to repeat all arguments of EBMBase defining the order of attributes.

To my knowledge, option 2 forces us to drop dataclass completely, we need to write the boilerplate of __init__ ourselves, but we have complete freedom concerning the attributes. Probably this could be also done replacing the standard library dataclass by the much more powerful attrs package, adding a dependency (Context: attrs was the original inspiration for dataclasses https://www.attrs.org/en/stable/why.html#data-classes).
I have no experience with attrs, and cannot say if it would be worth adding the dependency.

Are you fine with option 1, or would you prefer option 2, @paulbkoch?

@paulbkoch
Copy link
Collaborator

Hi @DerWeh -- I'm probably being a bit too sentimental towards compiled languages, but I'd prefer to go with option 2 and keep the __init__ functions. To me the bigger win in this PR is eliminating all the boilerplate parameter checking because I think that is a bigger barrier to entry for new people coming to the code since new developers have to initially at least skim through most of it to see if has any important logic, which it does! The __init__ functions in contrast, I feel like people coming to the code can skip over them for the most part after a few seconds because the __init__ is pretty clearly just passing the parameters to the parent class without modifying them. So, the main pain point with having boilerplate __init__ functions is just maintaining their consistency. I think we're pretty much done making huge changes to these parameters though. I can see us adding a parameter here or there, but I don't see that as a maintenance deal-breaker.

Perhaps if we moved the ExplainableBoostingClassifier and ExplanableBoostingRegressor and the DP equivalents into their own files it would further reduce the cognitive load by reducing the length of the _ebm.py file? I think once people skim over these files, they pretty much don't need to re-examine them. I've been wanting for some time to move the DP classes into the "privacy" directory (https://github.com/interpretml/interpret/tree/develop/python/interpret-core/interpret/privacy) and it feels like your other changes to separate DP from non-DP make this possible now. 99% of our users don't need the additional cognitive load of DP, so better to move it out in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants