Step 1: Initialization:     Set all the weights and thresholds levels of the network to random numbers uniformly distributed inside a small range.

Step 2: Activation: Activate the back propagation neural network by applying i/ps and desired o/ps.

  1. Calculate the actual o/ps of the neurons in the hidden layers.
  2. Calculate the actual o/ps of the neurons in the o/p layers.

Step 3: Weight training:

  1. Updates weights in the back propagation network by propagating backwards the errors associated with the o/p neurons.
  2. Calculate error gradient of o/p layer and hence of neurons in the hidden layer.

Step 4: Iteration: Increase iteration by repeating steps 2 and 3 until selected error criteria is satisfied.

(Refer textbook for mathematical equations)

Issues:


Overfitting

  • Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship.
  • Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.
  • A model which has been overfit will generally have poor predictive performance.
  • Overfitting depends not only on the number of parameters and data but also the conformability of the model structure.

In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross- validation, pruning (Pre or Post), model comparison,                                                                                                                                                  

Reason

. Noise in training data.

. Incomplete training data.

. Flaw in assumed theory.

Validation

Validation techniques are motivated by two fundamental problems in pattern recognition: model selection and performance estimation

Validation Approaches:

  • One approach is to use the entire training data to select our classifier and estimate the error rate, but the final model will normally overfit the training data.
  • A much better approach is to split the training data into disjoint subsets cross validation ( The Holdout Method)

Cross Validation (The holdout method)

Data set divided into two groups. Training set: used to train the classifier and Test set: used to estimate the error rate of the trained classifier

Total number of examples = Training Set +Test Set

Approach:

Random Sub sampling

  • Random Sub sampling performs K data splits of the dataset
  • Each split randomly selects (fixed) examples without replacement
  • For each data split we retrain the classifier from scratch with the training examples and estimate error with the test examples

K-Fold Cross-Validation

  • K-Fold Cross validation is similar to Random Sub sampling.
  • Create a K-fold partition of the dataset, For each of K experiments, use K-1 folds for training and the remaining one for testing.
  • The advantage of K-Fold Cross validation is that all the examples in the dataset are eventually used for both training and testing.
  • The true error is estimated as the average error rate.

Leave-one-out Cross-Validation

  • Leave-one-out is the degenerate case of K-Fold Cross Validation, where K is chosen as the total number of examples where one sample is left out at each experiment.

Model Comparison:

Models can be evaluated based on the output using different method :

  1. Confusion Matrix
  2. ROC Analysis
  3. Others such as: Gain and Lift Charts, K-S Charts

Confusion Matrix (Contigency Table):

  • A confusion matrix contains information about actual and predicted classifications done by classifier.
  • Performance of such system is commonly evaluated using data in the matrix.
  • It is also known as a contingency table or an error matrix, is a specific table layout that allows visualization of the performance of an algorithm.
  • Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.

 

Predicted Positive

Predicted Negative

Positive Examples

True Positive (TP)

False Negative (FN)

Negative Examples

False Positive (FP)

True Negative (TN)

 

Accuracy: (TP + TN) / Total data count

Precision: TP / (TP + FP)         or TN/ (TN + FN)

  • True Positive Rate (TPR): TP / (TP +TN)

    True Negative Rate (TNR): TN / (TP +TN)

  • False Positive Rate (FPR): FP / (FP +FN)

False Negative Rate (FNR): FN / (FP +FN)

ROC Analysis

Receiver Operating Characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied.

The curve is created by plotting the true positive rate against the false positive rate at various threshold settings.

The ROC curve is thus the sensitivity as a function of fall-out.

In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from to ) of the detection probability in the y-axis versus the cumulative distribution function of the false-alarm probability in x-axis.

ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution.

ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making.