Regression Analysis & Curve Fitting Back
Statistics & Data Analysis

Regression Analysis & Curve Fitting

Click the chart to add data points, then select your regression model. R², RMSE, the fitted equation and residual lines update instantly.

Regression Model
Preset Datasets
Statistics
Fitted Equation
Add at least 2 data points
Results
R² (fit quality)
RMSE
0
Data points
Model
Scatter Plot — Click to Add / Remove Points
Click: add point  |  Click near existing point: remove  |  Red lines = residuals
Theory & Key Formulas
Minimize sum of squared residuals:
S = sum(yi - yi_hat)^2
R2 = 1 - SS_res / SS_tot

What is Regression Analysis?

🙋
What exactly is regression analysis? I see the term "curve fitting" a lot.
🎓
Basically, it's a way to find the best mathematical relationship between variables. You have scattered data points, and you want to draw a line or curve that best summarizes their trend. In this simulator, you can click to add your own data points and instantly see the fitted curve.
🙋
Wait, really? How do we decide what "best" means? There are so many possible lines.
🎓
Great question! The most common method is "Least Squares." We define "best" as the line that minimizes the sum of the squared vertical distances from each point to the line. Those distances are called residuals. Try adding a few points and watch the red residual lines change as the model updates.
🙋
Okay, I see the red lines. But the simulator lets me choose different models like "Quadratic" or "Exponential." How do I know which one to use?
🎓
In practice, you use the data's shape and metrics like R². For instance, if your data points curve upwards, a straight line (Linear) will have large, patterned residuals. Switch to "Quadratic" and watch the R² value increase and the residuals become smaller and more random. That's a sign of a better fit.

Physical Model & Key Equations

The core of regression is minimizing the sum of squared residuals. The residual for a data point $(x_i, y_i)$ is the difference between the observed value $y_i$ and the value predicted by the model $\hat{y}_i$.

$$S = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

Here, $S$ is the sum of squared residuals we aim to minimize. $y_i$ is the actual data point value, and $\hat{y}_i$ is the value predicted by our fitted equation (e.g., $\hat{y}= mx + b$ for a linear model).

To quantify how well the model explains the variation in the data, we use the coefficient of determination, R-squared ($R^2$).

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}= 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$$

$SS_{res}$ is the sum of squared residuals (the same as $S$ above). $SS_{tot}$ is the total sum of squares, which measures the total variance in the $y$ data around its mean $\bar{y}$. An $R^2$ close to 1 indicates the model explains most of the variance.

Real-World Applications

Predictive Maintenance in Engineering: Sensor data (like vibration, temperature) is collected from machinery over time. Regression models fit trends to this data, predicting when a measurement will cross a failure threshold, allowing maintenance before a breakdown occurs.

Financial Forecasting: Analysts use curve fitting on historical stock prices or economic indicators. While not perfectly predictive, identifying trends (linear growth, exponential decay) helps inform investment strategies and risk assessments.

Drug Dosage Response: In pharmacology, researchers test a drug at different doses and measure a biological response. An exponential or logistic regression model is often fitted to this data to determine the effective dose for 50% of the population (ED50).

CAE & Material Science: When simulating material behavior, stress-strain data from physical tests is fitted with a constitutive model (like a polynomial or power-law). This fitted equation is then programmed into the simulation software to predict how a new part will deform under load.

Common Misconceptions and Points of Caution

First, the assumption that a high R² always means a good model is dangerous. For example, fitting a 5th-order polynomial to material creep data might yield an R² above 0.99, but that curve may have no physical meaning and fail entirely to predict future behavior. In practice, the balance between predictive performance and interpretability is crucial. Next, the influence of outliers is often overlooked. If your experimental data has just one clearly distant point, the least squares method will be strongly pulled toward it, producing a regression formula that distorts the overall trend. Try it in NovaSolver: add a single point far away from a cluster of points lying in a straight line. You'll see the line shift significantly. Finally, try to avoid making predictions outside the data range (extrapolation). Using a formula derived from experimental data between 20°C and 80°C to predict behavior at 150°C is very risky, even with a high R². Unforeseen phenomena, like material phase transitions, could occur.