-1
$\begingroup$

I struggled to find a clear solution online so I'm resorting to asking a fresh question.

I have two finite datasets (call them 40x1 vectors) that are equal in length and generate non-linear curves, when plotted.

What are some methods for determining a 'best fit' equation that can convert dataset A into dataset B? I figure there must be some Matlab or python function out there, but I seem to be unable to identify it. Also, I realize that there are infinite valid equations to meet this criteria. Primarily, I'm looking for a tool that let me choose a function type (polynomial, exponential, log, etc) and test out different options until I find an ideal match.

Some context: I have captured an experimental dataset A and an experimental dataset B. These trends are similar in shape, but have varying steepness, curvature, intercept, etc. Specifically, dataset A is the output voltage of a sensor based on a particular gas-species. Dataset B is the output of the same sensor, but with a different gas-species. If the sensor is the same (it is), we expect a dependent relationship to exist between A and B.

In the future, I would like to be able to capture a new dataset B, and to rely on a relationship equation to generate calculated values for what a corresponding dataset A would look like. Essentially, this is a regression problem.

The tools that I'm most comfortable with are Matlab and Python, but I am open to all suggestions. Thanks!

$\endgroup$
6
  • $\begingroup$ The answers to "what tool ..." may resolve to opinions. My Short List: Igor Pro, Origin, ... Matlab, python. Otherwise, please clarify what you want independent of the tool. I interpret your question one way to mean that you have an experimental data set S that should map into another experimental data set E using specific mathematical transformations so that E <- f(S). I interpret your question in another way to mean that you want to do regression fitting to data sets A and B using the same mathematical function but different regression coefficients. $\endgroup$ Commented May 19, 2022 at 22:31
  • $\begingroup$ Some context: I have captured an experimental dataset A and an experimental dataset B. In the future, I would like to be able to capture a new dataset B, only, and to rely on a relationship equation to generate calculated values for what a corresponding dataset A would look like. $\endgroup$ Commented May 19, 2022 at 22:35
  • 1
    $\begingroup$ Still not clear enough. Seems now as though A depends on B, but what makes them different in the first place? Expand your question with specifics. $\endgroup$ Commented May 20, 2022 at 0:48
  • 1
    $\begingroup$ Updated with some additional context. $\endgroup$ Commented May 20, 2022 at 1:39
  • 1
    $\begingroup$ Perhaps better asked on Cross Validated. $\endgroup$
    – Solar Mike
    Commented May 20, 2022 at 6:36

2 Answers 2

1
$\begingroup$

This is a common question which usually reveals some misunderstandings on the part of the person asking.

First point: in most cases you should have some reasonable premise as to the type of equation fit, such as polynomial, exponential, sinusoidal, etc. If you don't have some idea of how the factors (input variables) should affect the output, then doing a curve fit won't tell you anything.

Next point: It's often possible to find some multivariate nonlinear formula. See the software "Eureqa" , for example. Unfortunately, these fancy fits often end up fitting a curve exactly to noisy data points, obscuring the true relationship.

Finally, as you might have guessed, the choice of programming language is completely irrelevant. Any decent language, including MATLAB, R, Julia, python, Octave, (but NOT that devilspawn Excel) have linear and nonlinear fitting tools available.

$\endgroup$
5
  • $\begingroup$ Excel is like many tools - it is about using it properly. A hammer can be bad when employed incorrectly. However you can replace Excel with SAS, as long as you are happy with the bill. $\endgroup$
    – Solar Mike
    Commented May 20, 2022 at 13:37
  • $\begingroup$ @SolarMike Your comment is technically true but operationally misleading. The moment you start using excel to do data analysis you run into risks such as : Excel's builtin functions are not always correct (nor can you review the source); every cell requires a formula, leading to high risk of errors; macro programming is a nightmare and incredibly fragile; hiding/revealing rows in spreadsheet display can break macros and formulas, .... . Excel is a spreadsheet. don't use it for anything else. And that includes not using it as a database tool. $\endgroup$ Commented May 20, 2022 at 14:02
  • $\begingroup$ @SolarMike "excel is like a hammer...." if your hammer has a cracked handle, has a 75-blade SwissArmyKnife welded onto it, has a USB port connected to an internal CPU of unknown origin which has repurposed the USB pins to a proprietary format... $\endgroup$ Commented May 20, 2022 at 14:04
  • $\begingroup$ Oh, so you can't program badly in any of the other software you mentioned... $\endgroup$
    – Solar Mike
    Commented May 20, 2022 at 14:13
  • $\begingroup$ @SolarMike Are you being deliberately obtuse? Bad programming is one thing; the opportunity for undetectable errors to occur is completely different. $\endgroup$ Commented May 20, 2022 at 16:04
1
$\begingroup$

The first step is regression fitting an analytical model to the empirical data. This step is made easier when you can start from a theoretical model. For example, you may have a first principled reason that defines how the sensor will respond linearly to the volume concentration (amount per volume) of the gas species multiplied by a sensitivity factor for any given gas species. For your system, consult with physical and analytical chemists to gain insights on how to craft a proper first-principled model.

Once you have an acceptable and accepted analytical model, you will likely find any number of software applications that can fit it to your experimental data. The trade offs are typically cost versus the time investment. The best course of action is to investigate the approaches that researchers in comparable fields of study are using. The temptation to use what is free (e.g. python) or what you know well (e.g. Matlab) has to be tempered against the realization that if no one has laid foundations or provided a reasonable example path for you using your favorite software package, you could and likely will be spending an inordinate amount of time just to get the tool working let alone to get it working completely for your specific problem. Searching the user forums for the different software with keywords can help (in your case, "sensors" and "regression analysis" come to mind).

The second step is confidence testing. You have to compare the coefficients after fitting to one gas with those after fitting to another gas. The outcome in this step is improved by having not only the regression coefficients but also their uncertainty values. This is not using the R2 value, it is using the fact that coefficient alpha in your model has a value $23.4 \pm 0.7$ in the regression fit for gas A and $30 \pm 5$ in the regression fit for gas B. It is realizing that, from a rigorous statistical analysis, the two parameters may not be different from each other to some confidence level (e.g. to the 95% confidence level). You cannot therefore use parameter alpha to distinguish the two gases at that confidence level. For this step, engage with folks who have been doing robust, comparative data analysis for insights.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.