(SEM VIII) THEORY EXAMINATION 2022-2023 BIG DATA

B.Tech Data Structure 0 downloads

₹29.00

SECTION A

(Attempt all | 2 × 10 = 20 Marks)

(a) Benefits of HDFS over NFS

HDFS provides high fault tolerance, scalability, and distributed storage across multiple nodes, whereas NFS is centralized and less reliable for large-scale data processing. HDFS is optimized for big data analytics and parallel processing.

(b) Structured, Semi-Structured & Unstructured Data

Structured data follows a fixed schema (tables, rows). Semi-structured data uses tags or markers (XML, JSON). Unstructured data has no predefined format, such as images, videos, and social media data.

(c) Sources of data in Big Data

Sources include social media, sensors and IoT devices, transaction logs, web clickstreams, mobile devices, multimedia content, and enterprise applications.

(d) Metadata in HDFS

Metadata stores information about files such as file name, size, permissions, block locations, and replication details. It is maintained by the NameNode.

(e) Map vs Reduce

Map processes input data and converts it into key-value pairs. Reduce aggregates and processes these key-value pairs to produce final output.

(f) Indexing

Indexing is a technique used to improve data retrieval speed by creating a data structure that allows quick access to records.

(g) Shuffle vs Sort

Shuffle transfers intermediate map outputs to reducers. Sort arranges the data by keys before reduction.

(h) TF-IDF

TF-IDF (Term Frequency–Inverse Document Frequency) measures the importance of a word in a document relative to a collection of documents.

(i) NameNode, DataNode, JobTracker, TaskTracker

NameNode manages metadata, DataNode stores data blocks, JobTracker manages MapReduce jobs, and TaskTracker executes tasks on slave nodes.

(j) File name and block size

Windows: Max filename 255 chars, block size ~4 KB

Linux: Max filename 255 chars, block size ~4 KB

Hadoop: Block size 128 MB (default), large block size for efficient processing

SECTION B

(Attempt any THREE | 10 × 3 = 30 Marks)

2(a) 5 Vs of Big Data and their importance

The 5 Vs are Volume (huge data size), Velocity (speed of data generation), Variety (multiple data types), Veracity (data quality), and Value (useful insights). These characteristics explain why traditional systems fail and why specialized Big Data tools are required.

2(b) History and evolution of Hadoop

Hadoop originated from Google’s GFS and MapReduce papers. It evolved into Apache Hadoop, providing open-source distributed storage (HDFS) and processing (MapReduce), later enhanced with YARN, Spark, Hive, and HBase.

2(c) Data replication in HDFS

Replication stores multiple copies of data blocks across different nodes. Benefits include fault tolerance and high availability. Challenges include increased storage cost and network overhead.

2(d) Fair vs Capacity Scheduler in YARN

Fair Scheduler ensures equal resource sharing among applications, while Capacity Scheduler allocates fixed resources to queues. Fair Scheduler is flexible; Capacity Scheduler is suitable for large organizations.

2(e) Pig and its execution modes

Apache Pig uses Pig Latin for data processing.
Execution modes: Local Mode MapReduce Mode

Pig simplifies complex data flows compared to traditional databases, which use SQL and structured schema.

SECTION C

3(a) Security, compliance, auditing & protection in Big Data

Big Data security includes authentication, authorization, encryption, auditing, and compliance. Key features are data privacy, secure access control, regulatory compliance, and monitoring using tools like Kerberos and Ranger.

3(b) Challenges of conventional data systems

Traditional systems lack scalability, flexibility, and performance for large datasets. Big Data solves these using distributed storage, parallel processing, and fault tolerance.

4(a) Hadoop Distributed File System

HDFS stores large data across clusters using replication and parallel access. It supports scalability, fault tolerance, and high-throughput processing.

4(b) Anatomy of a MapReduce job

Input split → Map → Shuffle → Sort → Reduce → Output. JobTracker coordinates tasks while TaskTrackers execute them.

5(a) Data ingestion methods: Flume & Sqoop

Flume ingests streaming data like logs, while Sqoop transfers structured data between RDBMS and Hadoop.

5(b) Hadoop I/O support

Hadoop supports compression, serialization, Avro for schema-based storage, and file formats like SequenceFiles and Parquet.

6(a) NoSQL & MongoDB

MongoDB stores data as documents using JSON-like format. It supports CRUD operations, flexible schema, indexing, and high scalability.

6(b) Scala features

Scala supports object-oriented and functional programming, classes, objects, closures, pattern matching, and is tightly integrated with Spark.

7(a) HBase vs RDBMS

HBase is distributed, schema-less, and scalable, while RDBMS is centralized and schema-based. HBase offers advanced indexing and column-family design.

7(b) Role of ZooKeeper

ZooKeeper manages configuration, synchronization, leader election, and monitoring in Hadoop clusters. It helps build reliable distributed applications.

File Size

40.56 KB

Uploader

SuGanta International