In 1889, Sir Francis Galton, a cousin of Charles Darwin published a paper on heredity, “Natural Inheritance”. He reported his discovery that sizes of seeds of sweet pea plants appeared to “revert” or “regress”, to the mean size in successive generations. He also reported results of a study of the relationship between heights of fathers and heights of their sons. A straight line was fit to the data pairs: height of father versus height of son. Here, too, he found a “regression to mediocrity” The heights of the sons represented a movement away from their


fathers, towards the average height. We credit Sir Galton with the idea of statistical regression.

While most applications of regression analysis may have little to do with the “regression to the mean” discovered by Galton, the term “regression” remains. It now refers to the statistical technique of modeling the relationship between two or more variables. In general sense, regression analysis means the estimation or prediction of the unknown value of one variable from the known value(s) of the other variable(s). It is one of the most important and widely used statistical techniques in almost all sciences - natural, social or physical.

In this lesson we will focus only on simple regression –linear regression involving only two variables: a dependent variable and an independent variable. Regression analysis for studying more than two variables at a time is known as multiple regressions.



Simple regression involves only two variables; one variable is predicted by another variable. The variable to be predicted is called the dependent variable. The predictor is called the independent variable, or explanatory variable. For example, when we are trying to predict the demand for television sets on the basis of population growth, we are using the demand for television sets as the dependent variable and the population growth as the independent or predictor variable.

The decision, as to which variable is which sometimes, causes problems. Often the choice is obvious, as in case of demand for television sets and population growth because it would make no sense to suggest that population growth could be dependent on TV demand! The population growth has to be the independent variable and the TV demand the dependent variable.


If we are unsure, here are some points that might be of use:


  • if we have control over one of the variables then that is the independent. For example, a manufacturer can decide how much to spend on advertising and expect his sales to be dependent upon how much he spends
  • it there is any lapse of time between the two variables being measured, then the latter must depend upon the former, it cannot be the other way round
  • if we want to predict the values of one variable from your knowledge of the other variable, the variable to be predicted must be dependent on the known one