It Store the training records and use training records to predict the class label of unseen cases.

Examples:

i.  Rote-learner

  • Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly

ii. Nearest neighbor

  • Uses k “closest” points (nearest neighbors) for performing K-closet neighbor of a record ‘X’ are data points that have the K-smallest distance of ‘X’.
  • Classification based on learning by analogy e. by comparing a given test tuple with training tuple that are similar to it.
  • Training tuples are described by n-attributes.
  • When given an unknown tuple, a k-nearest- neighbor classifier searches the pattern space for the k-training tuples that are closest to the unknown tuple.
  • Nearest neighbor classifier requires:
  • Set of stored records
  • Distance matric to compute distance between For distance calculation any standard approach can be used sch as Euclidean distance.
  • The value of ‘K’, the number of nearest neighbor to retrieve.
    • To classify the unknown records
  • Compute distance to other training records
  • Identify the k-nearest neighbor.
  • Use class label nearest neighbors to determine the class label of unknown record.                In case of conflict, use majority vote for classification.

Issues of classification using k-nearest neighbor classification


  1. Choosing the value of K
    • One of challenge in classification is to choose the appropriate value of K. If K is too small, it is sensitive to noise points. If K is too large, neighbor may include points from other classes.
    • With the change of value of K, the classification result may vary.

ii.  Scaling Issue

  • Attribute may have to be scaled to prevent distance measure from being dominated by one of attributes. Eg. Height, Temperature etc.

iii.  Distance computing for non-numeric data.

  • Use Distance as 0 for the same data and maximum possible distance for different data.

iv.  Missing values

  • Use maximum possible distance

Disadvantages:

  • Poor accuracy when data have noise and irrelevant attributes.
  • Slow when classifying test tuples.
  • Classifying unknown records are relatively expensive.