(SEM VIII) THEORY EXAMINATION 2024-25 BIG DATA

B.Tech General 0 downloads

₹29.00

SECTION A – Short Answers (2 Marks Each) – Paragraph Style

a) What is Big Data?

Big Data refers to extremely large and complex datasets that cannot be processed efficiently using traditional data processing tools. These datasets are generated continuously from sources such as social media, sensors, mobile devices, transactions, and online platforms. Big Data requires advanced storage, processing, and analytical techniques to extract useful information and insights.

b) Explain the characteristics of Big Data.

The characteristics of Big Data describe the nature of the data and the challenges involved in handling it. Big Data is massive in size, generated at high speed, comes in different formats, and varies in quality and usefulness. These characteristics make traditional data management systems inadequate for Big Data processing.

c) Explain the four V’s of Big Data.

The four V’s of Big Data are Volume, Velocity, Variety, and Veracity. Volume refers to the huge amount of data generated daily. Velocity represents the speed at which data is generated and processed. Variety indicates different forms of data such as text, images, and videos. Veracity deals with the accuracy and reliability of the data.

d) Discuss applications of Big Data.

Big Data is widely used in healthcare for disease prediction, in finance for fraud detection, in e-commerce for recommendation systems, and in social media for sentiment analysis. It also plays an important role in smart cities, weather forecasting, and business decision-making.

e) What is Big Data Analytics?

Big Data Analytics is the process of examining large and complex datasets to uncover hidden patterns, trends, and relationships. It uses advanced analytical techniques such as machine learning, data mining, and statistical analysis to support better decision-making.

f) Discuss challenges of Big Data.

Big Data faces challenges such as data storage, data security, data integration, scalability, and processing speed. Managing data quality and ensuring privacy are also major concerns when dealing with large-scale data systems.

g) Differentiate between structured and unstructured data.

Structured data is organized in a predefined format such as tables and databases, making it easy to store and analyze. Unstructured data does not follow a fixed structure and includes text, images, audio, and videos, which require advanced tools for processing.

h) Differentiate between HDFS and HBase.

HDFS is a distributed file system designed for storing large files with high fault tolerance, while HBase is a NoSQL database built on top of HDFS for real-time read and write access. HDFS is optimized for batch processing, whereas HBase supports random access.

i) What is ZooKeeper? List its benefits.

ZooKeeper is a centralized coordination service used in distributed systems. It provides services such as configuration management, synchronization, and leader election. ZooKeeper improves reliability, simplifies coordination, and ensures consistent system operation.

j) Differentiate between Apache Pig and MapReduce.

Apache Pig is a high-level data processing framework that uses Pig Latin scripts, making it easier to write programs. MapReduce is a low-level programming model that requires complex coding. Pig simplifies development, while MapReduce offers greater control over processing.

SECTION B – Descriptive Answers (10 Marks Each) – Paragraph Style

a) Explain how Big Data processing is different from distributed processing.

Big Data processing focuses on handling extremely large and diverse datasets using scalable frameworks such as Hadoop and Spark. While distributed processing divides tasks across multiple systems, Big Data processing additionally manages data variety, velocity, and fault tolerance. It emphasizes data locality, parallel computation, and scalability, which go beyond traditional distributed systems.

b) Discuss Hadoop YARN in detail with failures in classic MapReduce.

Hadoop YARN is a resource management layer that separates job scheduling from data processing. In classic MapReduce, resource management and job execution were tightly coupled, leading to scalability issues and inefficient resource utilization. YARN overcomes these limitations by enabling multiple processing engines and improving cluster efficiency.

c) Explain the MapReduce framework in detail.

MapReduce is a programming model used for processing large datasets in parallel. It consists of a Map phase that processes input data and generates key-value pairs, and a Reduce phase that aggregates results. The framework ensures fault tolerance, scalability, and efficient distributed processing.

d) What are NameNode and DataNode in Hadoop architecture?

The NameNode is the master node responsible for managing metadata and file system structure in HDFS. DataNodes are worker nodes that store actual data blocks. Together, they ensure reliable data storage and retrieval in a distributed environment.

e) Differentiate between NoSQL and SQL databases.

SQL databases use structured schemas and relational models, making them suitable for structured data. NoSQL databases support flexible schemas and are designed for scalability and high availability, making them ideal for Big Data applications.

SECTION C – Long Answer (10 Marks Each) – Paragraph Style

a) What is MapReduce? Explain the working of various phases of MapReduce with example and diagram.

MapReduce is a distributed computing framework used for processing large datasets. The Map phase reads input data and converts it into key-value pairs. The Shuffle and Sort phase groups similar keys together. The Reduce phase processes these grouped values to produce final output. For example, in a word count program, the Map phase counts words, and the Reduce phase aggregates total occurrences.

b) Explain the working of Hive with proper steps and diagram.

Apache Hive is a data warehousing tool built on Hadoop that allows querying large datasets using HiveQL. Data is stored in HDFS, queries are written in HiveQL, and Hive converts them into MapReduce jobs. The execution engine processes these jobs and returns results, enabling easy data analysis without complex programming.

File Size

130.23 KB

Uploader

SuGanta International