Step 1: Initialization: Set all the weights and thresholds levels of the network to random numbers uniformly distributed inside a small range.
Step 2: Activation: Activate the back propagation neural network by applying i/ps and desired o/ps.
 Calculate the actual o/ps of the neurons in the hidden layers.
 Calculate the actual o/ps of the neurons in the o/p layers.
Step 3: Weight training:
 Updates weights in the back propagation network by propagating backwards the errors associated with the o/p neurons.
 Calculate error gradient of o/p layer and hence of neurons in the hidden layer.
Step 4: Iteration: Increase iteration by repeating steps 2 and 3 until selected error criteria is satisfied.
(Refer textbook for mathematical equations)
Issues:
Overfitting
 Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship.
 Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.
 A model which has been overfit will generally have poor predictive performance.
 Overfitting depends not only on the number of parameters and data but also the conformability of the model structure.
In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross validation, pruning (Pre or Post), model comparison,
Reason
. Noise in training data.
. Incomplete training data.
. Flaw in assumed theory.
Validation
Validation techniques are motivated by two fundamental problems in pattern recognition: model selection and performance estimation
Validation Approaches:
 One approach is to use the entire training data to select our classifier and estimate the error rate, but the final model will normally overfit the training data.
 A much better approach is to split the training data into disjoint subsets cross validation ( The Holdout Method)
Cross Validation (The holdout method)
Data set divided into two groups. Training set: used to train the classifier and Test set: used to estimate the error rate of the trained classifier
Total number of examples = Training Set +Test Set
Approach:
Random Sub sampling
 Random Sub sampling performs K data splits of the dataset
 Each split randomly selects (fixed) examples without replacement
 For each data split we retrain the classifier from scratch with the training examples and estimate error with the test examples
KFold CrossValidation
 KFold Cross validation is similar to Random Sub sampling.
 Create a Kfold partition of the dataset, For each of K experiments, use K1 folds for training and the remaining one for testing.
 The advantage of KFold Cross validation is that all the examples in the dataset are eventually used for both training and testing.
 The true error is estimated as the average error rate.
Leaveoneout CrossValidation
 Leaveoneout is the degenerate case of KFold Cross Validation, where K is chosen as the total number of examples where one sample is left out at each experiment.
Model Comparison:
Models can be evaluated based on the output using different method :
 Confusion Matrix
 ROC Analysis
 Others such as: Gain and Lift Charts, KS Charts
Confusion Matrix (Contigency Table):
 A confusion matrix contains information about actual and predicted classifications done by classifier.
 Performance of such system is commonly evaluated using data in the matrix.
 It is also known as a contingency table or an error matrix, is a specific table layout that allows visualization of the performance of an algorithm.
 Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.

Predicted Positive 
Predicted Negative 
Positive Examples 
True Positive (TP) 
False Negative (FN) 
Negative Examples 
False Positive (FP) 
True Negative (TN) 
Accuracy: (TP + TN) / Total data count
Precision: TP / (TP + FP) or TN/ (TN + FN)

True Positive Rate (TPR): TP / (TP +TN)
True Negative Rate (TNR): TN / (TP +TN)

False Positive Rate (FPR): FP / (FP +FN)
False Negative Rate (FNR): FN / (FP +FN)
ROC Analysis
Receiver Operating Characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied.
The curve is created by plotting the true positive rate against the false positive rate at various threshold settings.
The ROC curve is thus the sensitivity as a function of fallout.
In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from to ) of the detection probability in the yaxis versus the cumulative distribution function of the falsealarm probability in xaxis.
ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution.
ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making.