It Store the training records and use training records to predict the class label of unseen cases.
Examples:
i. Rote-learner
- Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly
ii. Nearest neighbor
- Uses k “closest” points (nearest neighbors) for performing K-closet neighbor of a record ‘X’ are data points that have the K-smallest distance of ‘X’.
- Classification based on learning by analogy e. by comparing a given test tuple with training tuple that are similar to it.
- Training tuples are described by n-attributes.
- When given an unknown tuple, a k-nearest- neighbor classifier searches the pattern space for the k-training tuples that are closest to the unknown tuple.
- Nearest neighbor classifier requires:
- Set of stored records
- Distance matric to compute distance between For distance calculation any standard approach can be used sch as Euclidean distance.
- The value of ‘K’, the number of nearest neighbor to retrieve.
- To classify the unknown records
- Compute distance to other training records
- Identify the k-nearest neighbor.
- Use class label nearest neighbors to determine the class label of unknown record. In case of conflict, use majority vote for classification.
Issues of classification using k-nearest neighbor classification
- Choosing the value of K
- One of challenge in classification is to choose the appropriate value of K. If K is too small, it is sensitive to noise points. If K is too large, neighbor may include points from other classes.
- With the change of value of K, the classification result may vary.
ii. Scaling Issue
- Attribute may have to be scaled to prevent distance measure from being dominated by one of attributes. Eg. Height, Temperature etc.
iii. Distance computing for non-numeric data.
- Use Distance as 0 for the same data and maximum possible distance for different data.
iv. Missing values
- Use maximum possible distance
Disadvantages:
- Poor accuracy when data have noise and irrelevant attributes.
- Slow when classifying test tuples.
- Classifying unknown records are relatively expensive.