33
$\begingroup$

Specifically what I am looking for are tools with some functionality, which is specific to feature engineering. I would like to be able to easily smooth, visualize, fill gaps, etc. Something similar to MS Excel, but that has R as the underlying language instead of VB.

$\endgroup$

6 Answers 6

23
$\begingroup$

Very interesting question (+1). While I am not aware of any software tools that currently offer comprehensive functionality for feature engineering, there is definitely a wide range of options in that regard. Currently, as far as I know, feature engineering is still largely a laborious and manual process (i.e., see this blog post). Speaking about the feature engineering subject domain, this excellent article by Jason Brownlee provides a rather comprehensive overview of the topic.

Ben Lorica, Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media Inc., has written a very nice article, describing the state-of-art (as of June 2014) approaches, methods, tools and startups in the area of automating (or, as he put it, streamlining) feature engineering.

I took a brief look at some startups that Ben has referenced and a product by Skytree indeed looks quite impressive, especially in regard to the subject of this question. Having said that, some of their claims sound really suspicious to me (i.e., "Skytree speeds up machine learning methods by up to 150x compared to open source options"). Continuing talking about commercial data science and machine learning offerings, I have to mention solutions by Microsoft, in particular their Azure Machine Learning Studio. This Web-based product is quite powerful and elegant and offers some feature engineering functionality (FEF). For an example of some simple FEF, see this nice video.

Returning to the question, I think that the simplest approach one can apply for automating feature engineering is to use corresponding IDEs. Since you (me, too) are interested in R language as a data science backend, I would suggest to check, in addition to RStudio, another similar open source IDE, called RKWard. One of the advantages of RKWard vs RStudio is that it supports writing plugins for the IDE, thus, enabling data scientists to automate feature engineering and streamline their R-based data analysis.

Finally, on the other side of the spectrum of feature engineering solutions we can find some research projects. The two most notable seem to be Stanford University's Columbus project, described in detail in the corresponding research paper, and Brainwash, described in this paper.

$\endgroup$
8
$\begingroup$

Featuretools is a recently released python library for automated feature engineering. It's based on an algorithm called Deep Feature Synthesis originally developed in 2015 MIT and tested on public data science competitions on Kaggle.

Here is how it fits into the common data science process.

enter image description here

The aim of the library is to not only help experts build better machine learning models faster, but to make the data science process less intimidating to people trying to learn. If you have event driven or relational data, I highly recommend you check it it out!

Disclaimer: I am one of the developers on the project.

$\endgroup$
2
$\begingroup$

Feature Engineering is at the heart of Machine Learning and is rather laborious and time consuming. There have been various attempts at automating feature engineering in hopes of taking the human out of the loop. One specific implementation that does this for classification problems is auto-sklearn. It uses an optimization procedure called SMAC under the hood to choose the appropriate set of transforms and algorithm (and algorithm parameters).

Note that Trifacta offers a really easy to use tool for data transformation. It has a highly intuitive GUI that allows to set up transformation/ feature engineering maps. There is also a free trial version that can be used for reasonably sized problems.

$\endgroup$
1
  • $\begingroup$ auto-sklearn doc not working $\endgroup$
    – Escachator
    Commented Feb 7, 2021 at 21:24
2
$\begingroup$

Scikit-learn has recently released new transformers that tackle many aspects of feature engineering. For example:

  1. You can do multiple missing data imputation techniques with the SimpleImputer (http://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html), including mean, median and arbitrary value imputation in both numerical and categorical variables.

  2. You can do multivariate imputation using several estimators, like Bayes, random forest and others (equivalent to R's MICE, Amelia and MissForest) with the IterativeImputer (https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer)

  3. You can do categorical one hot encoding with the OneHotEncoder() from Scikit-learn

  4. You can encode categorical variables by numbers with the LabelEncoder.

  5. You can do Yeo-Johnson variable transformation with the PowerTransformer (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)

  6. You can do discretisation with the KBinsDiscretiser (https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization.html)

There are potentially other feature engineering transformers in Scikit-learn and the developers update the library quite regularly.

As an alternative to the well known Scikit-learn library, there is a new recently released open source library called feature-engine. With feature engine you can:

  1. Do mean, median, arbitrary, end of tail and random imputation in numerical and categorical variables
  2. Do various types of categorical encoding, including one hot, integer, ordinal, mean encoding and weight of evidence.
  3. Do various variable transformations including log, reciprocal, exp and box cox
  4. Various types of discretisation including equal frequency, equal distance and tree based
  5. Outlier handling.

More details in the github repo and docs (https://feature-engine.trainindata.com)

Disclaimer: I created feature engine and made it open source.

Another open source python package allows for different types of categorical variable encoding: https://contrib.scikit-learn.org/categorical-encoding/

$\endgroup$
0
1
$\begingroup$

You should consider checking the Azure Machine Learning platform. It is online and you can use it with a free account.

Azure ML provides you with a workflow by using modules in a graphic user interface. Many of them are related with Data Munging and you can easily clean your data. If there is something that you cannot do in the GUI, then you can just add a module which let you run custom R or Python script to manipulate your data.

The nice part of it, is that you can easily visualise your data at any time and check simple stats like the dataframe.describe() of the R.

$\endgroup$
1
$\begingroup$

Amazon Machine Learning is a tool, which I use for feature engineering some times.

As Amazon AWS services have shown a lot of promise and standard, I would definitely count on Amazon ML, with it's prospects and promises for making the workflow of data scientists simpler. But as of now, it's still small.

But, as you asked for a tool for feature engineering, so this is one of them.

Some FAQ's about/for using Amazon ML.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.