(SEM VI) THEORY EXAMINATION 2024-25 DATA ANALYTICS

B.Tech Engineering 0 downloads

₹29.00

BIT601 – DATA ANALYTICS

Section-Wise Solved Answers (2024–25)

SECTION A

Attempt all questions in brief (2 × 7 = 14 marks)

(a) What are the main sources of data in data analytics?

The main sources of data in data analytics include transactional data from business systems, sensor and machine-generated data from IoT devices, social media data, web logs, survey data, and publicly available datasets. These sources together provide both structured and unstructured data for analysis.

(b) What is the purpose of the ‘operationalization’ phase?

The operationalization phase focuses on deploying the developed analytics model into real-world use. It ensures that insights or models are integrated into business processes so that decisions can be taken automatically or semi-automatically based on analytics results.

(c) What is the purpose of Support Vector Machines (SVM) in classification?

SVM is used for classification by finding an optimal separating boundary, called a hyperplane, between different classes. It aims to maximize the margin between data points of different classes, leading to better generalization and accuracy.

(d) What are fuzzy decision trees?

Fuzzy decision trees are decision trees that use fuzzy logic instead of crisp values. They allow partial membership of data points in multiple classes, which helps in handling uncertainty and imprecise data.

(e) Define stream computing and mention one key feature.

Stream computing is the processing of continuous data streams in real time. A key feature is low latency, meaning data is processed immediately as it arrives rather than being stored first.

(f) What are the advantages of K-means clustering?

K-means clustering is simple to understand, computationally efficient, and works well for large datasets. It is widely used due to its speed and ease of implementation.

(g) What is the purpose of MapReduce in big data processing?

MapReduce is used to process large datasets by dividing tasks into smaller parts (map phase) and then combining results (reduce phase). It enables parallel processing across distributed systems.

SECTION B

Attempt any three (7 × 3 = 21 marks)

(a) Stages in a data analytics project

A data analytics project starts with business understanding, where goals are clearly defined. This is followed by data collection and data preparation, where raw data is cleaned and organized. Next comes data exploration to understand patterns. Model planning and model building are then carried out to develop analytical models. Finally, results are communicated and operationalized for real-world use.

(b) Support vector and kernel methods comparison

Linear SVM works well for linearly separable data. Polynomial kernels handle non-linear patterns, while radial basis function (RBF) kernels manage complex data distributions. Kernel methods allow SVMs to operate in higher-dimensional spaces without explicitly computing them.

(c) Mining data streams in stock market prediction

Data streams in stock markets include live price feeds and trading volumes. Stream mining helps detect trends and anomalies in real time. Challenges include high speed, noise, and concept drift, while benefits include timely decision-making and risk reduction.

(d) Apriori algorithm for frequent itemsets

The Apriori algorithm finds frequent itemsets by using the principle that subsets of frequent itemsets must also be frequent. It generates candidate itemsets and prunes those that do not meet minimum support, repeating this process iteratively.

(e) Pig vs Hive

Pig is a scripting platform used for data flow processing, while Hive provides a SQL-like interface for querying data. Pig is more flexible for procedural tasks, whereas Hive is user-friendly for structured queries and reporting.

SECTION C

Q3. Attempt any one (7 marks)

(a) Difference between data analysis and data reporting

Data analysis focuses on discovering insights and patterns using statistical or machine learning techniques. Data reporting summarizes existing data using charts and dashboards. For example, predicting sales trends is analysis, while monthly sales charts are reporting.

(b) Model planning vs model building

Model planning involves selecting techniques and defining evaluation criteria. Model building is the actual implementation of models using algorithms and training data. Planning decides what to build, while building focuses on how to build it.

Q4. Attempt any one (7 marks)

(a) Neural networks and learning

Neural networks are computational models inspired by the human brain. Learning occurs by adjusting weights based on error, while generalization allows the model to perform well on unseen data. This makes neural networks powerful for prediction tasks.

(b) PCA computation (conceptual explanation)

Principal Component Analysis reduces data dimensionality by identifying directions of maximum variance. It transforms original correlated variables into uncorrelated principal components, simplifying analysis while retaining essential information.

Q5. Attempt any one (7 marks)

(a) Real-Time Analytics Platform (RTAP)

RTAP processes streaming data instantly to generate immediate insights. Applications include fraud detection, smart cities, healthcare monitoring, and online recommendation systems.

(b) Sampling in data streams

Sampling selects representative data from continuous streams to reduce processing load. It helps manage memory, improves efficiency, and still provides accurate analytical insights.

Q6. Attempt any one (7 marks)

(a) Importance of parallelism in clustering

Parallelism speeds up clustering of large datasets by dividing data across multiple processors. Techniques include MapReduce-based clustering and distributed K-means algorithms.

(b) ProCLUS vs CLIQUE

ProCLUS is a subspace clustering algorithm that focuses on relevant dimensions, while CLIQUE identifies dense regions in high-dimensional spaces. ProCLUS is more scalable and efficient for large datasets.

Q7. Attempt any one (7 marks)

(a) Sharding in NoSQL databases

Sharding divides large datasets into smaller parts across multiple servers. It improves scalability, load balancing, and performance when handling massive data volumes.

(b) Hadoop Distributed File System (HDFS)

HDFS stores data across multiple nodes and ensures fault tolerance by replicating data blocks. If one node fails, data is automatically retrieved from another replica, ensuring reliability.

File Size

138.67 KB

Uploader

SuGanta International