Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one cluster and dissimilar objects are grouped in another cluster.”
- A cluster of data objects can be treated as one
- Clustering is a process of partitioning a set of data (or objects) into a set of meaningful sub-classes, called
- While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups.
- The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful features that distinguish different
Requirements of Clustering in Data Mining
- Scalability - We need highly scalable clustering algorithms to deal with large databases
- Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data.
- Discovery of clusters with attribute shape − the clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.
- High dimensionality − the clustering algorithm should not only be able to handle low- dimensional data but also the high dimensional space.
- Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters
- Interpretability − the clustering results should be interpretable, comprehensible, and usable
Applications of Clustering
- Economic Science (especially market research).
- Document classification,
- Cluster Weblog data to discover groups of similar access patterns
- Spatial Data Analysis: Create thematic maps in GIS by clustering feature spaces
- Image Processing
Spatial Data mining
Spatial data mining is the application of data mining to spatial models. In spatial data mining, analysts use geographical or spatial information to produce business intelligence or other results. This requires specific techniques and resources to get the geographical data into relevant and useful formats.
- Search for spatial patterns.
- Non-trivial search – as “automated” as
- Large search space of plausible hypothesis
- Asiatic cholera: causes water, food, air and insects.
- Interesting, useful, and unexpected spatial
- Useful in certain application domain
- Shutting off identified water pump => saved human lives.
- May provide a new understanding of the world
- Water pump – Cholera connection lead to the “germ” theory.
Spatial Data Mining Tasks
- Geo-Spatial Warehousing and OLAP
- Spatial data classification/predictive modeling
- Spatial clustering/segmentation
- Spatial association and correlation analysis
- Spatial regression analysis
- Time-related spatial pattern analysis: trends, sequential patterns, partial periodicity analysi
- Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and There are three general classes of information that can be discovered by web mining: Web activity, from server logs and Web browser activity tracking.
- Web mining can be broadly divided into three distinct categories, according to the kinds of data to be mined.
- Web Content Mining
- Web Structure Mining
- Web Usage Mining
There are three general classes of information that can be discovered by web mining:
- Web activity, from server logs and Web browser activity
- Web graph, from links between pages, people and other
- Web content, for the data found on Web pages and inside of
Uses of Web Content Mining
- To gather, categorize, organize and provide the best possible information available on the WWW to the user requesting the
- To determine the relevance of the content to the search query. Improve the navigation of information on the web provides productive Produce a higher quality of information to the user.
- Understand customer behavior, evaluate effectiveness of a particular web site, and help quantify the success of a marketing Business intelligence. Competitive intelligence. Pricing analysis. Product data. Reputation.
Web mining tools
- Automation Anywhere 1 (AA)
- Web Info Extractor (WIE)
- Web Content Extractor (WCE)
- Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.
Text analysis processes
- Information retrieval or identification of a corpus is a preparatory step: collecting or identifying a set of textual materials, on the Web or held in a file system, database, or content corpus manager, for
- Although some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive natural language processing, such as part of speech tagging, syntactic parsing, and other types of linguistic
- Named entity recognition is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on. Disambiguation—the use of contextual clues—may be required to decide where, for instance, "Ford" can refer to a former U.S. president, a vehicle manufacturer, a movie star, a river crossing, or some other
- Recognition of Pattern Identified Entities: Features such as telephone numbers, e-mail addresses and quantities (with units) can be discerned via regular expression or other pattern matches.
- Co - Reference: identification of noun phrases and other terms that refer to the same object.
Applications of Text mining