(SEM VI) THEORY EXAMINATION 2021-22 BIG DATA

B.Tech Data Structure 0 downloads

₹29.00

BIG DATA (KCS-061)

B.Tech Semester VI – Theory Examination (2021–22)

BIG-DATA-KCS061

Big Data is an interdisciplinary subject that deals with the storage, processing, analysis, and management of extremely large and complex data sets that cannot be efficiently handled using traditional data processing techniques. With the exponential growth of data generated from social media, sensors, mobile devices, transactions, and online services, organizations require scalable and distributed systems to extract meaningful insights from data. Big Data technologies such as Hadoop, HDFS, Map-Reduce, Hive, Pig, and NoSQL databases like MongoDB provide cost-effective and fault-tolerant solutions for handling massive volumes of structured, semi-structured, and unstructured data.

The uploaded question paper clearly indicates that the examination focuses on Big Data fundamentals, Hadoop ecosystem, HDFS architecture, Map-Reduce processing, Hive and Pig frameworks, NoSQL databases, and MongoDB operations. To score well, answers must be written in clear, logically connected paragraphs, with conceptual explanations, architecture discussion, and suitable examples wherever required.

SECTION A – FUNDAMENTAL BIG DATA CONCEPTS

(Based on Section A on Page-1 of the paper)

Big Data platforms refer to software frameworks that enable distributed storage and processing of large data sets. Examples include Hadoop, Spark, Cassandra, HBase, and Flink, all of which support scalability and fault tolerance.

Big Data finds extensive application across industries. In healthcare, it is used for patient data analysis and disease prediction, while in e-commerce it supports recommendation systems, customer behavior analysis, and demand forecasting.

In Map-Reduce, Sort and Shuffle play a critical role between the Map and Reduce phases. After the Map phase generates intermediate key-value pairs, the framework automatically sorts the data by keys and shuffles it across the network so that all values corresponding to the same key reach the same reducer. This process ensures correctness and efficiency of parallel processing.

The full form of HDFS is Hadoop Distributed File System, which is designed to store very large files across clusters of commodity hardware while providing high throughput access.

The default block size of HDFS is 128 MB, which is significantly larger than traditional file systems. This large block size reduces the number of disk seeks and improves performance for large sequential reads.

Hadoop consists of two main types of nodes: NameNode and DataNode. The NameNode manages metadata and namespace information, while DataNodes store the actual data blocks.

NoSQL databases differ from relational databases in terms of schema flexibility, scalability, and consistency models. While relational databases follow a fixed schema and ACID properties, NoSQL databases offer schema-less design, horizontal scalability, and are optimized for distributed environments.

MongoDB provides limited support for ACID properties. While it ensures atomicity and consistency at the document level, it relaxes strict transactional guarantees across multiple documents to achieve higher scalability and performance.

A schema defines the logical structure of data, including fields, data types, and relationships. In Big Data systems, schema can be enforced at write time or read time, providing flexibility in handling diverse data.

Hive can handle structured, semi-structured, and unstructured data by using schema-on-read, making it suitable for data warehousing on Hadoop.

SECTION B – BIG DATA ARCHITECTURE AND PROCESSING

(Based on Section B on Page-1)

The three dimensions of Big Data are Volume, Velocity, and Variety. Volume refers to massive data sizes, velocity refers to the speed at which data is generated and processed, and variety refers to different data formats such as text, images, videos, and logs.

The Map-Reduce architecture consists of a client, JobTracker or ResourceManager, TaskTrackers or NodeManagers, and HDFS. The client submits a job, which is divided into map and reduce tasks. These tasks are executed in parallel across the cluster, enabling efficient large-scale processing.

In HDFS, when a client reads data, it first contacts the NameNode to obtain metadata and block locations. The actual data is then read directly from DataNodes. During write operations, data is split into blocks and replicated across multiple DataNodes to ensure fault tolerance.

CRUD operations in MongoDB include insert, read, update, and delete operations on documents stored in collections. For example, inserting a document involves adding a JSON-like object into a collection, which allows flexible schema design.

Map-Reduce, Pig, and Hive differ in abstraction level. Map-Reduce is a low-level programming model, Pig provides a scripting language for data flow, and Hive offers SQL-like querying through HiveQL, making it more user-friendly.

SECTION C – BIG DATA STORAGE AND ANALYTICS

(Based on Section C and subsequent questions)

Big Data exists in multiple forms, including structured data like relational tables, semi-structured data like XML and JSON, and unstructured data like images, audio, and videos. Each form requires different processing approaches.

The Big Data architecture consists of data sources, data ingestion layer, storage layer such as HDFS, processing layer such as Map-Reduce or Spark, and analytics and visualization layer.

The detailed architecture of Map-Reduce includes job submission, input splitting, mapping, sorting, shuffling, reducing, and output generation. This pipeline enables efficient distributed computation.

Scale-up involves increasing resources on a single machine, whereas scale-out involves adding more machines to a cluster. Hadoop uses scale-out architecture by distributing data and computation across multiple nodes, improving performance and fault tolerance.

HDFS is designed with a master-slave architecture, where the NameNode manages metadata and DataNodes store data blocks. Replication ensures data reliability even if nodes fail.

Benefits of HDFS include scalability, fault tolerance, and cost effectiveness, while challenges include latency for small files and NameNode single point of failure.

NoSQL databases are classified into key-value stores, document stores, column-family stores, and graph databases. MongoDB falls under document-oriented NoSQL databases.

Indexing in MongoDB improves query performance by allowing faster data retrieval. For example, indexing a frequently queried field reduces search time significantly.

Pig execution models include local mode and Map-Reduce mode, where scripts are translated into Map-Reduce jobs.

Hive architecture includes components such as HiveQL engine, Metastore, Driver, Compiler, and Execution Engine, enabling SQL-like querying on HDFS data.

HOW TO WRITE BIG DATA ANSWERS IN THE EXAM

In Big Data, never write answers in short bullet points. Always start with a clear definition, followed by detailed explanation of architecture, working principles, and examples. Use correct terminology such as HDFS, Map-Reduce, schema-on-read, NoSQL, HiveQL, and scalability. Examiners give maximum weightage to conceptual clarity, architecture explanation, and practical understanding of Hadoop ecosystem.

File Size

121.71 KB

Uploader

SuGanta International