-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: use pydantic for parameter validation #514
base: develop
Are you sure you want to change the base?
Conversation
Hi @DerWeh -- I like what you're doing here and plan to merge it once it fully replaces the existing checks and passes the tests. A few thoughts: We can add some classes to formalize the DP vs Regular EBM class partition. Someday, I think we might support multi-output EBMs, and one strange type of EBM might be an EBM that has both classification and regression outputs. To create one of the strange beasts, you'd need to directly create an EBMModel object, so I like having the ability to make and use these classes directly. I can't think of any good reason though to mix private and non-private EBMs, so the private vs regular partition can be a first split in the class hierarchy below the EBMModel class. You probably already saw that I had a test for check_estimator, but disabled it. I think we can probably now increase our minimum scikit-learn version to allow for this check now. We didn't pass all the checks since EBMs are more permissive in what they accept than scikit-learn, and I'd like to preserve our user-friendliness in that respect, so I expect we'll want to continue disabling some of the scikit-learn checks. |
Sure enough, I typically orient myself at NEP-29, but it is a good idea to be a more conservative (after all, now NumPy is already very stable such that using an older NumPy version is typically sufficient).
Out of curiosity: what do advantage do you expect? I don't really see the synergy for multi-output in EBMs compared to deep networks (or deep trees). Can we expect an advantage during the boosting phase?
As soon as it is merged, (and when I find time) I can continue working on |
I think you're right that there will be very little technical benefit. Mostly, I think of it as potentially a nice packaging solution for grouping related predictions, although I might reject the idea if it adds anything more than a little complexity. EBMs do have some benefits in terms of the compactness of the model representation. For example, SoftAtHome, which wrote Ebm2Onnx (https://github.com/interpretml/ebm2onnx) uses EBMs inside their routers, and I also saw a paper where some company was evaluating EBMs to be used inside pacemakers (!!!). Sharing the feature definitions, and bin cuts might be nice in those applications both for the compactness of the model, because it could lower the computation requirements of predictions, and also to speed up predictions. Probably more likely than boosting multi-output models together would be constructing two separate EBMs and then merging them together as a post process step similarly to how merge_ebm works today. But who knows, maybe someone really will want some super weird objective that includes both classification and regression. I'd like that even if just for the entertainment value of seeing someone use an EBM this way. |
We don't allow multi-class interactions in the default API entirely because of the difficulties in visualizing them. You can construct multi-class EBMs with interactions today if you use the measure_interactions function manually. Here's a slightly related example showing how to use it: https://interpret.ml/docs/python/examples/custom-interactions.html |
Hi @DerWeh -- I've been thinking through this question the last few days and here's what I came up with in terms of a hierarchy: EBMBase -> abstract base class for all EBM models Thoughts? |
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
c9ed51e
to
56c8135
Compare
Yes, the hierarchy sound very reasonable. Though, I am not quite sure if making EBMBase and abstract base class has any benefit. See the discussion: https://stackoverflow.com/questions/3570796/why-use-abstract-base-classes-in-python But it doesn't really seem very important to me, so I am fine with anything/ However, now that EBMs have the checks for What is your opinion @paulbkoch? |
You're probably right that it has very little technical benefit in python. I was thinking of it mainly as a form of documentation to indicate that the EBMBase class should not be instantiated directly and then just using duck typing as we currently are. I haven't written as much python as you have (seemingly) and have never made an abstract class in python, but I'm assuming adding the abstract part is fairly lightweight.
Yeah, that does seem a bit ugly. I didn't realize pydantic was unable to use deferred checks. Even if we get the clone function working though, it's not clear if the scikit-learn requirement to defer checks will break anything else now or in the future. Maybe we should instead move back to option #2 @DerWeh where we move the checks into a separate _validate function? That would at least move them out of the way so that someone reading the fit function would only need to see them if they were specifically interested in the checks. And we'd get to keep the more descriptive error messages too. |
I agree, it should be lightweight, so I never did a benchmark (or in fact even use it). For documentation purpose, the most important part is probably not to make the class public.
Sure, having a separate Another question is, if I should keep the 'Yoda-conditions' that are currently used for type-checks (example: Anyway, I'll play around a bit more in case I find a reasonable (elegant) way to use For documentation purpose: just gave it a try out of curiosity, and it is def __sklearn_clone__(self, *, safe=True):
"""Customized implementation of clone. See :func:`sklearn.base.clone` for details."""
klass = self.__class__
new_object_params = self.get_params(deep=False)
for name, param in new_object_params.items():
new_object_params[name] = clone(param, safe=False)
new_object = klass(**new_object_params)
try:
new_object._metadata_request = copy.deepcopy(self._metadata_request)
except AttributeError:
pass
# _sklearn_output_config is used by `set_output` to configure the output
# container of an estimator.
if hasattr(self, "_sklearn_output_config"):
new_object._sklearn_output_config = copy.deepcopy(
self._sklearn_output_config
)
return new_object |
Avoiding boilerplate through dataclass sounds like a good idea. I guess this implies having the class hierarchy since we conditionally assign instance attributes based on whether the class is differentially private or not.
If you want to change the ordering, go for it! I do use Yoda-conditionals in C++ to combat assignment bugs since "if(a = 5)" compiles, but "if(5 = a)" does not, but I don't care about the order in python since assignment isn't legal inside a conditional in python.
Sounds good. Thanks again for investigating all these options! |
We use a regular Python dataclass in combination with a deferred parameter validation using pydantic's `TypeAdaptor`. This is necessary to be compatible to scikit-learn's `clone` function. Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
In case we are not interested in an accurate result, we can drastically reduce parameters to reduce the fit time. On my old laptop, this reduces test time from 8 minutes to 1 minute. Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Ok, the parameter validation seems to be working now. I managed to find a way based on pydantics TypeAdapter. The DP versions are currently broken, as I focused only on the non-private versions. I am currently looking into breaking up the |
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
65d7448
to
ac45540
Compare
Sorry, never really used Notebooks to generate a documentation, and I am struggling a bit with the Azure pipeline. Where does the pinning come from? Can I just update the package? |
Yes, just update the pinning if necessary. I think this is pinned because Jupyter Book changes format frequently and pinning preserves documentation consistency, but we'll need to update these periodically. |
If we dropped python 3.8 support, we could eliminate the typing_extensions dependency, right? Seems like a good tradeoff. |
I've updated the develop branch pipelines to use python 3.9 for building the docs (and other build tasks). |
Sorry for the wait, didn't manage to spend much time on it the last week. Indeed, we only use Annotated from However, the import is hard-coded, so I expect, that Just googled a little, if you want to still support Python3.8 and keep the dependencies as slim as possible, there is an option to make dependencies specific for a python version. setup(
...,
install_requires=[
"typing_extensions;python_version<'3.9'",
],
) For course, this would complicate the source-code: import sys
...
if sys.version_info < (3, 9):
from typing_extensions import Annotated
else:
from typing import Annotated Your thoughts @paulbkoch? I hope that I can resolve the conflicts with interpret 0.6.0 the next days and continue on this PR. |
Hi @DerWeh -- I'm not too fussy on this, but I think of the options given I'd have a preference towards keeping things simple and dropping support for 3.8. Anyone who needs python 3.8 can just stick with interpret 0.6.0 until support for 3.8 is officially ended in a few months. |
94cc351
to
385a66b
Compare
Signed-off-by: DerWeh <andreas.weh@web.de>
ec0985f
to
0a7bc52
Compare
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
@paulbkoch The regular and the DP EBMs are now separated (though it is not quite pretty yet). Maybe this is the right point to reevaluate if this is where we want to go. If so, I'll try to resolve the conflicts (the public API changed). Probably it's would be best, to get it rapidly into a mergeable form, such that the develop branch doesn't deviate further. It's probably simple to add small changes afterward. |
Hi @DerWeh! Great to hear you're still looking at this PR. I'm still thinking the following class hierarchy would be good to implement: EBMBase -> abstract base class for all EBM models Looking back at our discussion there were some questions about whether EBMBase should be abstract or not, but I think we were in agreement on the rest? I'm fine with either abstract or not and would defer to someone (you) with more python experience. It might make more sense to break this PR into one that handles the class hierarchy change, and the rest of this PR, but do whatever works best for you. Yeah, sorry about all the changes in that section of code. I added some new configuration options for greedy boosting that required some reworking there. |
Sorry, @paulbkoch I wasn't clear enough. The class hierarchy is implemented by now: class EBMBase(ABC, BaseEstimator): ....
class EBMModel(EBMBase): ...
class ExplainableBoostingClassifier(EBMModel, ClassifierMixin, ExplainerMixin): ...
class ExplainableBoostingRegressor(EBMModel, RegressorMixin, ExplainerMixin): ...
class DPEBMModel(EBMBase): ...
class DPExplainableBoostingClassifier(DPEBMModel, ClassifierMixin, ExplainerMixin): ...
class DPExplainableBoostingRegressor(DPEBMModel, RegressorMixin, ExplainerMixin): ... It passes all test, but currently cannot be merged as the API on develop changed. What I would like to ask of you, is whether you could check out the PR and give it a try if you like the changes. I only use EBMs in a limited context, and don't have all intended usages in mind. I would like confirmation whether this PR is going the right direction, of course, critique is always more than welcome. When or if you're OK with the PR, I'll resolve the conflicts with the develop branch.
Don't worry, it's not your fault that I am so slow. Every improvement is welcome. |
@DerWeh -- Overall I really like these changes. The code appears a lot more modular now, and I see you were able to use the new inheritance structure for good effect by creating private functions that differentiate the work based on whether the model is private or not. I also really like the declarative data validation that pydantic enables. One area I'm less sure about is the use of @DataClass to remove the boilerplate __init__ functions. I like the theory behind this, but it appears that this change removes some control over the ordering of parameters. If I'm reading the docs correctly, you can order the __init__ parameters together within each class, but you can't intermix parameters from base classes and child classes. We do have a common API that should maintain the position of some __init__ parameters. I'm less familiar with this aspect of our package since I didn't write this part, but here are the example APIs: EBMs fit into the GlassboxExplainer API, which requires having the feature_names and features_types parameters first. There might be a way to make this work with @DataClass, but in addition to that, I also like how we are able to group common ideas together in the __init__ function and present them to the user in a certain order, even if most people are using keywords instead of position for specifying these. Does this make sense? |
Thank you for your valuable feedback @paulbkoch. It's important to get the API right. Personally, I opt for keyword-only arguments and declaring the order of parameters an implementation detail. But if you prefer having positional arguments, of course I will respect that. The question is, if it's enough to be able to explicitly order the attributes of Option 1 should be easily doable in the current form: we drop the To my knowledge, option 2 forces us to drop Are you fine with option 1, or would you prefer option 2, @paulbkoch? |
Hi @DerWeh -- I'm probably being a bit too sentimental towards compiled languages, but I'd prefer to go with option 2 and keep the __init__ functions. To me the bigger win in this PR is eliminating all the boilerplate parameter checking because I think that is a bigger barrier to entry for new people coming to the code since new developers have to initially at least skim through most of it to see if has any important logic, which it does! The __init__ functions in contrast, I feel like people coming to the code can skip over them for the most part after a few seconds because the __init__ is pretty clearly just passing the parameters to the parent class without modifying them. So, the main pain point with having boilerplate __init__ functions is just maintaining their consistency. I think we're pretty much done making huge changes to these parameters though. I can see us adding a parameter here or there, but I don't see that as a maintenance deal-breaker. Perhaps if we moved the ExplainableBoostingClassifier and ExplanableBoostingRegressor and the DP equivalents into their own files it would further reduce the cognitive load by reducing the length of the _ebm.py file? I think once people skim over these files, they pretty much don't need to re-examine them. I've been wanting for some time to move the DP classes into the "privacy" directory (https://github.com/interpretml/interpret/tree/develop/python/interpret-core/interpret/privacy) and it feels like your other changes to separate DP from non-DP make this possible now. 99% of our users don't need the additional cognitive load of DP, so better to move it out in my opinion. |
This is a first proof of concept to use pydantic for the validation of parameters. I used Python 3.10 syntax as it is more concise and used
pydantic=2.6.2
without putting much thought into it.Currently, the
fit
methods is extremely messy due to all the checks. Of course, a refactoring putting this into a dedicated method would already simplify things.I think using
pydantic
might be a more elegant solution. This draft is mostly for discussion, whether this is a good idea or not. I tried to be faithful to the checks that were implemented.A disadvantage is that
pydantic
errors are harder to read. On the plus side, the type annotations are much clearer than having to search through the fit method. It's probably also much easier to maintain the validation.Currently, I break the private versions of the classifier. Attributes were set, depending on whether the estimator is private or not. Do you think it would be better to define a super class with the common arguments and subclasses with the additional arguments?
Another question is how to best proceed with overwriting the default arguments. Currently, there is a lot of redundancy.
If we want to follow through with this idea, I think some sklearn checks should be enabled again using parametrize_with_checks or check_estimator