(SEM VI) THEORY EXAMINATION 2021-22 DATA ANALYTICS

B.Tech General 0 downloads

₹29.01

DATA ANALYTICS (KIT601)

B.Tech Semester VI – Theory Examination (2021–22)

Data Analytics is an interdisciplinary subject that focuses on the systematic analysis of data to extract meaningful patterns, trends, and insights for decision-making. In the modern digital world, massive volumes of data are generated every second through business transactions, social media, sensors, mobile devices, and online platforms. Raw data by itself has limited value, but when analyzed using appropriate statistical, computational, and machine-learning techniques, it becomes a powerful asset for organizations. Data analytics helps in improving efficiency, predicting future outcomes, optimizing processes, and supporting strategic decisions across domains such as business, healthcare, finance, manufacturing, and governance.

From the uploaded question paper, it is evident that the syllabus emphasizes data types, data streams, sampling, clustering, classification, decision trees, neural networks, data analytics life cycle, data stream algorithms, association rule mining, PCA, support vector machines, Hive architecture, and R programming. To score well, answers must be written in clear, explanatory paragraphs, showing understanding of concepts, algorithms, and applications rather than brief bullet points.

SECTION A – FUNDAMENTAL CONCEPTS OF DATA ANALYTICS

(Based on Section A, Page-1 of the paper)

The need for data analytics arises from the rapid growth of data and the requirement to convert this data into actionable knowledge. Organizations rely on data analytics to understand customer behavior, improve operational efficiency, detect fraud, predict trends, and gain competitive advantage. Without analytics, large datasets remain underutilized and decision-making becomes intuition-based rather than evidence-based.

Data classification refers to categorizing data based on its structure and nature. Data may be structured, semi-structured, or unstructured, and this classification determines the choice of storage systems and analytical techniques.

A neural network is a computational model inspired by the structure of the human brain. It consists of interconnected neurons organized into input, hidden, and output layers, and it is widely used for pattern recognition, classification, and prediction tasks.

Multivariate analysis involves the examination of multiple variables simultaneously to understand relationships, dependencies, and patterns among them. It is commonly used in marketing analysis, finance, and scientific research.

The full form of RTAP is Real-Time Analytics Platform, which is used to analyze streaming data instantly to support time-critical decisions such as fraud detection, stock trading, and network monitoring.

The role of sampling in data streams is crucial because data streams are continuous and potentially infinite. Sampling techniques allow efficient approximation and analysis without storing the entire stream.

The limited pass algorithm is used in data stream processing where data can be scanned only a limited number of times. Such algorithms are essential for real-time analytics with memory constraints.

The principle behind hierarchical clustering is to build a hierarchy of clusters either by successively merging smaller clusters into larger ones or by splitting larger clusters into smaller ones, based on similarity measures.

In descriptive statistics, R functions such as mean, median, sd, summary, and var are commonly used to describe the central tendency and dispersion of data.

Popular data visualization tools help in graphical representation of data to make insights easily understandable by humans.

SECTION B – DATA ANALYTICS MODELS AND ALGORITHMS

(Based on Section B, Page-1)

The process model and computation model for Big Data platforms describe how data is collected, stored, processed, and analyzed. The process model includes data acquisition, preprocessing, storage, analysis, and visualization, while the computation model focuses on distributed processing frameworks that enable scalable analytics.

Decision trees are supervised learning models used for classification and prediction. They are easy to interpret and visualize, and they work by recursively splitting data based on attribute values to reach a decision outcome.

The architecture of a data stream model includes data sources, stream processing engine, memory management, and query processing modules. This architecture enables continuous analysis of real-time data.

The K-means algorithm is a popular clustering technique that partitions data into K clusters by minimizing intra-cluster variance. It works iteratively by assigning data points to the nearest centroid and updating centroids until convergence. Its simplicity and efficiency make it suitable for large datasets.

The difference between NoSQL and RDBMS databases lies in schema flexibility, scalability, and consistency models. NoSQL databases support schema-less design and horizontal scaling, while RDBMS systems emphasize structured schema and strong consistency.

SECTION C – DATA ANALYTICS LIFE CYCLE AND TOOLS

(Based on Section C, Page-1)

The data analytics life cycle consists of multiple phases including problem definition, data collection, data cleaning, exploratory analysis, model building, evaluation, and deployment. Each phase ensures that analytics results are accurate, relevant, and actionable.

Modern data analytics tools include platforms for data storage, processing, visualization, and modeling. These tools support large-scale analytics, machine learning, and real-time data processing.

ADVANCED ANALYTICS TECHNIQUES

(Based on Questions 4 & 5, Page-2)

Support Vector Machines (SVM) and kernel methods are powerful supervised learning techniques used for classification and regression. Kernel methods allow SVMs to handle non-linear data by transforming it into higher-dimensional space.

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms correlated variables into a smaller set of uncorrelated principal components, preserving maximum variance.

Algorithms for counting distinct elements in a data stream are essential when memory is limited. These algorithms provide approximate counts efficiently.

The case study of stock market prediction demonstrates how historical price data, trends, and machine-learning models are used to forecast future market behavior, although predictions remain probabilistic due to market uncertainty.

DATA MINING, ASSOCIATION RULES & HIVE

(Based on Questions 6 & 7, Page-2)

The difference between CLIQUE and ProCLUS clustering lies in how subspace clusters are discovered in high-dimensional data. These algorithms address the curse of dimensionality in clustering.

The Apriori algorithm is used for mining frequent itemsets and generating association rules. By applying minimum support and confidence thresholds, meaningful relationships among items are discovered.

The HIVE architecture enables SQL-like querying on large datasets stored in distributed file systems. It includes components such as query compiler, execution engine, and metastore.

Writing an R function demonstrates practical data analytics skills by enabling statistical computation and data processing programmatically.

HOW TO WRITE DATA ANALYTICS ANSWERS IN THE EXAM

In Data Analytics, never write answers in short bullet points. Always start with a clear explanation of the concept, followed by algorithmic understanding, working principles, and applications. Use correct terminology such as clustering, classification, data streams, PCA, Apriori, and Hive. Examiners give maximum weightage to conceptual clarity, analytical reasoning, and real-world relevance.

File Size

126.46 KB

Uploader

SuGanta International