Data Mining and Data Warehousing - Short Question Answer

Here in this section of Data Mining and Data Warehousing Short Questions Answers, We have listed out some of the important Short Questions with Answers which will help students to answer it correctly in their University Written Exam.

1. What are the alternative terms for data?

Some alternative terms for data are:

  1. Knowledge mining
  2. Knowledge extraction
  3. Data/pattern
  4. Data Archaeology
  5. Data dredging
2. What is KDD?

KDD-Knowledge Discovery in Databases.

3. What are the steps involved in KDD?

The steps involved in KDD are:

  1. Data cleaning
  2. Data Mining
  3. Pattern Evaluation
  4. Knowledge Presentation
  5. Data Integration
  6. Data Selection
  7. Data Transformation
4. What is the use of the knowledge base?

Knowledge base is domain knowledge that is used to guide search or evaluate the interestingness of resulting pattern.

Such knowledge can include concept hierarchies used to organize attribute /attribute values in to different levels of abstraction.

5. What is the Architecture of a typical data mining?

Architecture of a typical data mining is:

6. What are the some of the data mining techniques?

Some of the data mining are:

  1. Statistics
  2. Machine learning
  3. Decision Tree
  4. Hidden Markov models
  5. Artificial Intelligence
  6. Genetic Algorithm
  7. Meta learning
7. Give few statistical techniques in data mining

Some statistical techniques in data mining are:

  1. Point Estimation
  2. Data Summarization
  3. Bayesian Techniques
  4. Testing Hypothesis
  5. Correlation
  6. Regression
8. What is meta learning?

Concept of combining the predictions made from multiple models of data mining and analyzing those predictions to formulate a new and previously unknown prediction.

9. What is Genetic algorithm?

Genetic algorithm enables us to locate optimal binary string by processing an initial random population of binary strings by performing operations such as artificial mutation , crossover and selection. it is a Search algorithm.

10. What is the purpose of Data mining Technique?

It provides a way to use various data mining tasks.

11. What is Predictive model?

Predictive model is used to predict the values of data by making use of known results from a different set of sample data.

Tasks that are belongs to predictive model are:

  1. Classification
  2. Regression
  3. Time series analysis
12. What is Descriptive model?

Descriptive model is used to determine the patterns and relationships in a sample data.

Data mining tasks that belongs to descriptive model:

  1. Clustering
  2. Summarization
  3. Association rules
  4. Sequence discovery
13. What are the advanced database systems?

List of the advanced database systems:

  1. Extended-relational databases
  2. Object-oriented databases
  3. Deductive databases
  4. Spatial databases
  5. Temporal databases
  6. Multimedia databases
  7. Active databases
  8. Scientific databases
  9. Knowledge databases
14. What is Cluster Analysis?

Cluster analyses data objects without consulting a known class label. The class labels are not present in the training data simply because they are not known to begin with.

15. Classifications of Data mining systems

1. Based on the kinds of databases mined:

  • According to model
    • Relational mining system
    • Transactional mining system
    • Object-oriented mining system
    • Object-Relational mining system
    • Data warehouse mining system
  • Types of Data
    • Spatial data mining system
    • Time series data mining system
    • Text data mining system
    • Multimedia data mining system

2. Based on kinds of Knowledge mined

  • According to functionalities
    • Characterization
    • Discrimination
    • Association
    • Classification
    • Clustering
    • Outlier analysis
    • Evolution analysis
  • According to levels of abstraction of the knowledge mined
    • Generalized knowledge (High level of abstraction)
    • Primitive-level knowledge (Raw data level)
  • According to mine data regularities versus mine data irregularities

3. Based on kinds of techniques utilized

  • According to user interaction
    • Autonomous systems
    • Interactive exploratory system
    • Query-driven systems
  • According to methods of data analysis
    • Database-oriented
    • Data warehouse-oriented
    • Machine learning
    • Statistics
    • Visualization
    • Pattern recognition
    • Neural networks

4. Based on applications adopted

  • Finance
  • Telecommunication
  • DNA
  • Stock markets
  • E-mail and so on
16. Describe challenges to data mining regarding data mining methodology and user interaction issues.

Challenges to data mining regarding data mining methodology and user interaction issues.

  • Mining different kinds of knowledge in databases
  • Interactive mining of knowledge at multiple levels of abstraction
  • Incorporation of background knowledge
  • Data mining query languages and ad hoc data mining
  • Presentation and visualization of data mining results
  • Handling noisy or incomplete data
  • Pattern evaluation
17. How is a data warehouse different from a database?

Data warehouse is a repository of multiple heterogeneous data sources, organized under a unified schema at a single site in order to facilitate management decision-making.

Database consists of a collection of interrelated data.

18. Define support and confidence in Association rule mining.

Support S is the percentage of transactions in D that contain AUB.

Confidence c is the percentage of transactions in D containing A that also contain B.

Support ( A=>B)= P(AUB)

Confidence (A=>B)=P(B/A)

19. Describe the different classifications of Association rule mining.

Based on types of values handled in the Rule

  1. Boolean association rule
  2. Quantitative association rule

Based on the dimensions of data involved

  1. Single dimensional association rule
  2. Multidimensional association rule

Based on the levels of abstraction involved

  1. Multilevel association rule
  2. Single level association rule

Based on various extensions

  1. Correlation analysis
  2. Mining max patterns
20. What is the purpose of Apriori Algorithm?

Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean association rules.

The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent item set properties.

21. How to generate association rules from frequent item sets?

Association rules can be generated as follows:

For each frequent item set1, generate all non empty subsets of 1.

For every non empty subsets s of 1, output the rule “S=>(1-s)”if

min_conf = Support count(1) / Support count(s)

Where min_conf is the minimum confidence threshold.

 

22. What are the techniques to improve the efficiency of Apriori algorithm?

Techniques to improve the efficiency of Apriori algorithm

  • Hash based technique
  • Transaction Reduction
  • Portioning
  • Sampling
  • Dynamic item counting

Apriori Algorithm – Frequent Pattern Algorithms


Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. It was later improved by R Agarwal and R Srikant and came to be known as Apriori. This algorithm uses two steps “join” and “prune” to reduce the search space. It is an iterative approach to discover the most frequent itemsets.

Apriori says:

The probability that item I is not frequent is if:

  • P(I) < minimum support threshold, then I is not frequent.
  • P (I+A) < minimum support threshold, then I+A is not frequent, where A also belongs to itemset.
  • If an itemset set has value less than minimum support then all of its supersets will also fall below min support, and thus can be ignored. This property is called the Antimonotone property.

The steps followed in the Apriori Algorithm of data mining are:

  1. Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with itself.
  2. Prune Step: This step scans the count of each item in the database. If the candidate item does not meet minimum support, then it is regarded as infrequent and thus it is removed. This step is performed to reduce the size of the candidate itemsets.
23. What is Clustering and Cluster Analysis?

What is Clustering?

Clustering is a process of grouping the physical or conceptual data object into clusters.

What do you mean by Cluster Analysis?

A cluster analysis is the process of analyzing the various clusters to organize the different objects into meaningful and descriptive objects.

24. What are the fields in which clustering techniques are used?
  • Clustering is used in biology to develop new plants and animal
  • Clustering is used in business to enable marketers to develop new distinct groups of their customers and characterize the customer group on basis of purchasing.
  • Clustering is used in the identification of groups of automobiles Insurance policy
  • Clustering is used in the identification of groups of house in a city on the basis of house type, their cost and geographical
  • Clustering is used to classify the document on the web for information discovery.
25. What are the requirements of cluster analysis?

The basic requirements of cluster analysis are:

  • Dealing with different types of
  • Dealing with noisy
  • Constraints on
  • Dealing with arbitrary
  • High dimensionality
  • Ordering of input data
  • Interpretability and usability
  • Determining input parameter and
  • Scalability
26. What are the different types of data used for cluster analysis?

The different types of data used for cluster analysis are:

  • interval scaled
  • binary
  • nominal
  • ordinal and
  • ratio scaled data.
27. What are interval scaled variables?

Interval scaled variables are continuous measurements of linear scale.

For example, height and weight, weather temperature or coordinates for any cluster. These measurements can be calculated using Euclidean distance or Minkowski distance.

28. Define Binary variables? And what are the two types of binary variables?

Binary variables are understood by two states 0 and 1, when state is 0, variable is absent and when state is 1, variable is present.

There are two types of binary variables, symmetric and asymmetric binary variables.

  1. Symmetric variables are those variables that have same state values and weights.
  2. Asymmetric variables are those variables that have not same state values and weights.
29. Define nominal, ordinal and ratio scaled variables?

A nominal variable is a generalization of the binary variable. Nominal variable has more than two states, For example, a nominal variable, color consists of four states, red, green, yellow, or black.

In Nominal variables the total number of states is N and it is denoted by letters, symbols or integers.

1. An ordinal variable also has more than two states but all these states are ordered in a meaningful sequence.

2. A ratio scaled variable makes positive measurements on a non-linear scale, such as exponential scale, using the formula

AeBt or Ae-Bt 

Where A and B are constants.

30. What do u mean by partitioning method?

In partitioning method a partitioning algorithm arranges all the objects into various partitions, where the total number of partitions is less than the total number of objects. Here each partition represents a cluster.

The two types of partitioning method are

  1. k-means and
  2. k-medoids
31. What is CLARA and CLARANS?

CLARA 

Clustering in LARge Applications is called as CLARA. The efficiency of CLARA depends upon the size of the representative data set.

CLARA does not work properly if any representative data set from the selected representative data sets does not find best k-medoids.

CLARANS

To recover this drawback a new algorithm, Clustering Large Applications based upon RANdomized search (CLARANS) is introduced.

The CLARANS works like CLARA, the only difference between CLARA and CLARANS is the clustering process that is done after selecting the representative data sets.

32. What is Hierarchical method?

Hierarchical method groups all the objects into a tree of clusters that are arranged in a hierarchical order. This method works on bottom-up or top-down approaches.

33. Differentiate Agglomerative and Divisive Hierarchical Clustering?

Agglomerative Hierarchical clustering method works on the bottom-up approach.

In Agglomerative hierarchical method, each object creates its own clusters. The single Clusters are merged to make larger clusters and the process of merging continues until all the singular clusters are merged into one big cluster that consists of all the objects.

Divisive Hierarchical clustering method works on the top-down approach. In this method all the objects are arranged within a big singular cluster and the large cluster is continuously divided into smaller clusters until each cluster has a single object.

Hierarchical Agglomerative vs Divisive clustering 


  • Divisive clustering is more complex as compared to agglomerative clustering, as in case of divisive clustering we need a flat clustering method as “subroutine” to split each cluster until we have each data having its own singleton cluster.
  • Divisive clustering is more efficient if we do not generate a complete hierarchy all the way down to individual data leaves. Time complexity of a naive agglomerative clustering is O(n3) because we exhaustively scan the N x N matrix dist_mat for the lowest distance in each of N-1 iterations. Using priority queue data structure we can reduce this complexity to O(n2logn). By using some more optimizations it can be brought down to O(n2). Whereas for divisive clustering given a fixed number of top levels, using an efficient flat algorithm like K-Means, divisive algorithms are linear in the number of patterns and clusters.
  • Divisive algorithm is also more accurate. Agglomerative clustering makes decisions by considering the local patterns or neighbor points without initially taking into account the global distribution of data. These early decisions cannot be undone. whereas divisive clustering takes into consideration the global distribution of data when making top-level partitioning decisions.
34. What is CURE?

Clustering Using Representatives is called as CURE. The clustering algorithms generally work on spherical and similar size clusters. CURE overcomes the problem of spherical and similar size cluster and is more robust with respect to outliers.

35. Define Chameleon method?

Chameleon is another hierarchical clustering method that uses dynamic modeling. Chameleon is introduced to recover the drawbacks of CURE method. In this method two clusters are merged, if the interconnectivity between two clusters is greater than the interconnectivity between the objects within a cluster.

36. What is OLTP?

If an on-line operational database systems is used for efficient retrieval, efficient storage and management of large amounts of data, then the system is said to be on-line transaction processing.

37. What is OLAP?

Data warehouse systems serves users (or) knowledge workers in the role of data analysis and decision-making. Such systems can organize and present data in various formats. These systems are known as on-line analytical processing systems.

38. What does audio data mining mean?

Uses audio signals to indicate patterns of data or the features of data mining results.

Patterns are transformed into sound and music.

To identify interesting or unusual patterns by listening pitches, rhythms, tune and melody.

Steps involved in DNA analysis

  • Semantic integration of heterogeneous, distributed genome databases
  • Similarity search and comparison among DNA sequences
  • Association analysis: Identification of co-occuring gene sequences
  • Path analysis: Linking genes to different stages of disease development
  • Visualization tools and genetic data analysis
39. Explain the types of data mining.

types of data mining are:

  • Audio data mining
  • Video data mining
  • Image data mining
  • Scientific and Statistical data mining
40. How data mining is used in banking industry?

Data collected by data mining in banking and Banking data mining tools are

  • Mining customer data of bank
  • Mining for prediction and forecasting
  • Mining for fraud detection
  • Mining for cross selling bank services
  • Mining for identifying customer preferences
  • Applications of data mining in banking
41. How data mining is used in health care analysis?

Health care data mining and its aims are:

  • Health care data mining technique
  • Segmenting patients into groups Identifying patients into groups
  • Identifying patients with recurring health problems
  • Relation between disease and symptoms
  • Curbing the treatment costs
  • Predicting medical diagnosis Medical research
  • Hospital administration
  • Applications of data mining in health care 
42. What are the DB Miner tool in data mining?

DB Miner tool in data mining are:

  • System architecture
  • Input and Output
  • Data mining tasks supported by the system
  • Support of task and method selection
  • Support of the KDD process
  • Main applications
  • Current status
43. What is spatial data mining?

Extracting undiscovered and implied spatial information.

Spatial data: Data that is associated with a location Used in several fields such as geography, geology, medical imaging etc.

44. What is Text mining?

Extraction of meaningful information from large amounts free format textual data.

Useful in Artificial intelligence and pattern matching.

Also known as text mining, knowledge discovery from text, or content analysis.

45. What are the factors involved while choosing data mining system?

The factors involved while choosing data mining system are:

  • Data types
  • System issues
  • Data sources
  • Data Mining functions and methodologies
  • Coupling data mining with database and/or data warehouse systems
  • Scalability
  • Visualization tools
  • Data mining query language and graphical user interface.
46. What is DMQL?

DMQL stand for Data Mining Query Language.

It specifies clauses and syntaxes for performing different types of data mining tasks for example data classification, data clustering and mining association rules. Also it uses SQL-like syntaxes to mine databases.

It was proposed by Han, Fu, Wang, et al. for the DBMiner data mining system. Although, it was based on the structured Data Mining Query Language. These query languages are designed to support ad hoc and interactive data mining. Also, it provides commands for specifying primitives.

We can use Data Mining Query Language to work with databases and data warehouses as well. We can also use it to define data mining tasks. Particularly we examine how to define data warehouses and data marts in DMQL.

47. What are the areas in which data warehouses are used in present and in future?

The potential subject areas in which data ware houses may be developed at present and also in future are:

1. Census data:

The registrar general and census commissioner of India decennially compiles information of all individuals, villages, population groups, etc. This information is wide ranging such as the individual slip.

A compilation of information of individual households, of which a database of 5%sample is maintained for analysis. A data warehouse can be built from this database upon which OLAP techniques can be applied, Data mining also can be performed for analysis and knowledge discovery

2. Prices of Essential Commodities:

The ministry of food and civil supplies, Government of India complies daily data for about 300 observation centers in the entire country on the prices of essential commodities such as rice, edible oil etc.

A data warehouse can be built for this data and OLAP techniques can be applied for its analysis

48. What is the difference between generic single-task tools and generic multi-task tools?

The difference between generic single-task tools and generic multi-task tools are:

Generic single-task tools

Generic single-task tools generally use neural networks or decision trees.

They cover only the data mining part and require extensive pre-processing and post- processing steps.

Generic multi-task tools 

Generic multi-task tools offer modules for pre-processing and post- processing steps and also offer a broad selection of several popular data mining algorithms as clustering.

49. What are Research prototypes?

Some of the research products may find their way into commercial market: ‘DB Miner’ from Simon Fraser University, British Columbia, ‘Mining Kernel System’ from University of Ulster, North Ireland.

50. What is VLDB?

VLDB stand for Very Large Data Base.

If a database whose size is greater than 100GB, then the database is said to be very large database.

51. What is Virtual Warehouse?

A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized. A virtual warehouse is easy to build but requires excess capability on operational database servers.

52. What are dependent and independent data marts?

Dependent data marts 

Dependent data marts are sourced directly from enterprise data warehouses.

Independent data marts

Independent data marts are data captured from one (or) more operational systems (or) external information providers (or) data generated locally with in particular department (or) geographic area.

53. What is Data Mart?

What is Data Mart?

Data mart is a database that contains a subset of data present in a data warehouse.

Data marts are created to structure the data in a data warehouse according to issues such as hardware platforms and access control strategies.

We can divide a data warehouse into data marts after the data warehouse has been created.

Data marts are usually implemented on low-cost departmental servers that are UNIX (or) windows/NT based. The implementation cycle of the data mart is likely to be measured in weeks rather than months (or) years.

54. What is Enterprise Warehouse?

What is Enterprise Warehouse?

An enterprise warehouse collects all the information’s about subjects spanning the entire organization. It provides corporate-wide data integration, usually from one (or) more operational systems (or) external information providers.

It contains detailed data as well as summarized data and can range in size from a few giga bytes to hundreds of giga bytes, tera bytes (or) beyond.

An enterprise data warehouse may be implemented on traditional mainframes, UNIX super servers (or) parallel architecture platforms. It requires business modeling and may take years to design and build.

55. What is HOLAP?

The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP,(i.e.) a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store.

Top