(SEM VIII) THEORY EXAMINATION 2023-24 BIG DATA

B.Tech General 0 downloads

₹29.00

SECTION A

(Attempt all | 2 × 10 = 20 Marks)

a. List any five Big Data platforms
Apache Hadoop, Apache Spark, Apache Flink, Apache Storm, Google BigQuery.

b. Importance of Hadoop technology in Big Data Analytics
Hadoop enables distributed storage and parallel processing of large datasets at low cost, providing scalability, fault tolerance, and high availability.

c. Three benefits of MapReduce
MapReduce offers scalability, fault tolerance, and parallel processing of large data across clusters.

d. Define heartbeat in HDFS
Heartbeat is a periodic signal sent by DataNodes to NameNode to indicate that they are active and functioning properly.

e. List any five Big Data platforms
Apache Hadoop, Apache Spark, Cassandra, MongoDB, Amazon EMR.

f. Define data replication in HDFS
Data replication is the process of storing multiple copies of data blocks on different DataNodes to ensure fault tolerance and data availability.

g. Name any two data ingestion tools in Hadoop
Apache Flume and Apache Sqoop.

h. Compare NoSQL and Relational Databases
Relational databases use structured schema and SQL, while NoSQL databases support flexible schema and handle large-scale unstructured data.

i. Advantages of Scala over Java
Scala supports functional programming, concise syntax, immutability, and better performance with Apache Spark.

j. Differentiate between Pig and Hive
Pig uses procedural language (Pig Latin) for data flow, while Hive uses declarative SQL-like language (HiveQL) for querying data.

SECTION B

(Attempt any THREE | 3 × 10 = 30 Marks)

2(a) Structured, Semi-Structured & Unstructured Data

Structured data is organized in rows and columns such as databases and spreadsheets.
Semi-structured data has tags or markers like XML and JSON files.
Unstructured data has no predefined format, such as videos, images, emails, and social media posts.
Big Data technologies handle all three types efficiently.

2(b) Anatomy of a MapReduce Job Run

A MapReduce job begins with data input split into blocks. The Map phase processes data into key-value pairs. The Shuffle and Sort phase groups similar keys. The Reduce phase aggregates results and stores output in HDFS. The JobTracker coordinates tasks, while TaskTrackers execute them.

2(c) Design and Concept of HDFS

HDFS follows a master-slave architecture with NameNode managing metadata and DataNodes storing data blocks. Data is stored in large blocks with replication. HDFS provides high fault tolerance, scalability, and is optimized for batch processing.

2(d) CRUD operations in MongoDB

CRUD stands for Create, Read, Update, and Delete. Create inserts documents using insert().
Read retrieves data using find(). Update modifies documents using update().
Delete removes documents using delete(). MongoDB stores data in flexible JSON-like documents.

2(e) Role of ZooKeeper in HBase

ZooKeeper manages coordination, synchronization, configuration, and leader election among HBase components. It ensures reliability, fault tolerance, and consistency in distributed environments.

SECTION C

3(a) 5 Vs of Big Data and their implications

The 5 Vs are:

Volume: Huge amount of data Velocity: Speed of data generation

Variety: Different data formats Veracity: Data quality and accuracy

Value: Useful insights from data

These characteristics require specialized tools for storage, processing, and analytics.

3(b) Components of Big Data Architecture

Components include data sources, data ingestion layer, storage layer (HDFS/NoSQL), processing layer (MapReduce/Spark), analytics layer, and visualization tools. Together they enable end-to-end Big Data processing.

4(a) HDFS Architecture & Fault Tolerance

HDFS uses NameNode, DataNode, and Secondary NameNode. Fault tolerance is achieved using data replication, heartbeat monitoring, and automatic re-replication of failed blocks.

4(b) Hadoop Streaming and Pipes

Hadoop Streaming allows MapReduce programs in languages like Python or Perl. Pipes enable C/C++ programs to interact with Hadoop via standard input/output.

5(a) Client Read and Write Operations in HDFS

For write operation, the client contacts NameNode for metadata and writes data to DataNodes in a pipeline.
For read operation, the client fetches metadata from NameNode and reads data directly from nearest DataNode.

5(b) Cluster Specification & Hadoop Cluster Setup

Cluster specification includes number of nodes, CPU, RAM, storage, and network bandwidth.
Setting up Hadoop cluster involves installing Hadoop, configuring core-site.xml, hdfs-site.xml, yarn-site.xml, formatting NameNode, and starting Hadoop services.

6(a) Features of Apache Spark & Integration with Hadoop

Spark provides in-memory processing, high speed, fault tolerance, and supports batch, streaming, ML, and graph processing.
Spark can work with HDFS, YARN, and Hadoop MapReduce.

6(b) NameNode High Availability & HDFS Federation

High Availability removes single point of failure using Active-Standby NameNodes.
HDFS Federation allows multiple NameNodes to manage separate namespaces for scalability.

7(a) Need of Pig & Execution Modes

Pig simplifies complex data processing using Pig Latin.
Execution modes are: Local mode

MapReduce mode Tez mode

7(b) Apache Hive Architecture

Hive architecture includes UI, Driver, Compiler, Metastore, Execution Engine, and HDFS. Hive converts HiveQL queries into MapReduce or Spark jobs for execution.

File Size

139.61 KB

Uploader

SuGanta International