(SEM VI) THEORY EXAMINATION 2017-18 BIG DATA

B.Tech Data Structure 0 downloads

₹29.00

Big Data (NIT-067)

Complete Section-Wise Explanation – B.Tech Semester VI

Introduction to the Subject

Big Data as a subject focuses on understanding how extremely large, fast-growing, and complex data sets are stored, processed, and analyzed to extract meaningful insights. Traditional databases and systems are not capable of handling such data efficiently, which is why distributed systems like Hadoop, MapReduce, HDFS, Hive, HBase, and NoSQL databases are used.

This paper tests:

Conceptual understanding of Big Data fundamentals Knowledge of Hadoop ecosystem

Data storage and processing models MapReduce working and design

NoSQL and graph databases Real-world Big Data applications

The paper is divided into three sections: A, B, and C, and students must attempt questions as instructed.

SECTION A – Basic Concepts & Definitions

Pattern:
Attempt all questions
10 questions × 2 marks = 20 marks

Nature of Section A

Section A checks whether your basic concepts are clear. Answers should be short but meaningful. Even though the questions are brief, clarity is extremely important.

Explanation of Section A Topics

What is Big Data and why do we analyze it?
Big Data refers to extremely large and complex datasets that cannot be processed using traditional tools. We analyze Big Data to discover patterns, trends, user behavior, and insights that help in decision-making, prediction, and optimization.

Data Locality Optimization
Data locality means moving computation closer to where the data is stored instead of moving data across the network. This improves performance and reduces network congestion in distributed systems like Hadoop.

Tools related to Hadoop
Hadoop ecosystem includes tools such as HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Flume, and Oozie, each serving a specific purpose in data storage, processing, or management.

Purpose of Hadoop Pipes
Hadoop Pipes allow developers to write MapReduce programs in languages like C++ instead of Java, improving flexibility.

Map Reducing (MapReduce)
MapReduce is a programming model that processes large data sets by dividing tasks into Map and Reduce phases, enabling parallel computation.

Operational vs Analytical Systems
Operational systems handle daily transactions like banking or billing, while analytical systems process historical data for reporting, analysis, and decision-making.

Hadoop Distributed File System (HDFS)
HDFS is a distributed file system that stores data across multiple nodes with replication to ensure fault tolerance and high availability.

Industry Examples of Big Data
Big Data is used in healthcare, banking, e-commerce, social media, telecom, and transportation sectors.

Entities of YARN
YARN includes Resource Manager, Node Manager, Application Master, and Container, which together manage cluster resources.

Hadoop Architecture
Hadoop architecture consists of HDFS for storage, YARN for resource management, and MapReduce for data processing.

SECTION B – Conceptual & System-Level Understanding

Pattern:
Attempt any three questions
3 questions × 10 marks = 30 marks

Nature of Section B

This section requires descriptive answers written in paragraphs. Students must explain concepts clearly with examples and proper flow.

Explanation of Major Section B Topics

Crowd Sourcing Analytics
Crowd sourcing analytics involves collecting and analyzing data generated by a large number of people through social media, surveys, mobile apps, and online platforms. It helps organizations understand public opinion, trends, and collective behavior at scale.

Relationship Between Cloud and Big Data
Cloud computing provides scalable infrastructure for storing and processing Big Data. Big Data applications rely on cloud resources for elasticity, cost efficiency, and high availability, while cloud platforms benefit from Big Data-driven insights.

Design of HDFS
HDFS follows a master-slave architecture. The NameNode manages metadata, while DataNodes store actual data blocks. Data is divided into blocks and replicated across nodes to ensure reliability and fault tolerance.

Hive Data Definition Queries
Hive uses HiveQL, which is similar to SQL. Data definition queries include CREATE, DROP, ALTER, and DESCRIBE, allowing structured access to large datasets stored in HDFS.

HBase and Pig Data Models
HBase uses a column-oriented data model suitable for real-time read/write access. Pig uses a high-level scripting language (Pig Latin) that simplifies MapReduce programming.

SECTION C – Advanced Analysis & Architecture

Pattern:
Attempt one part from each question
5 questions × 10 marks = 50 marks

This section carries the maximum marks and determines overall performance.

Question 3

How Hadoop Analyzes Data
Hadoop analyzes data by breaking it into smaller chunks stored in HDFS and processing them in parallel using MapReduce. The Map phase processes data blocks, while the Reduce phase aggregates results.

Cassandra Data Model
Cassandra uses a peer-to-peer distributed architecture. Data is stored in tables with rows and columns, optimized for high availability and scalability without a single point of failure.

Question 4

Anatomy of a MapReduce Job Run
A MapReduce job involves job submission, input splitting, mapping, shuffling, sorting, reducing, and final output generation. Each step is managed by YARN for resource allocation.

Types and Formats of MapReduce
MapReduce supports text, sequence, and binary formats. Different data types require different input and output formats for efficient processing.

Question 5

Data Model: Aggregations and Relations
Aggregations summarize large datasets, while relations define connections between data entities. These concepts are crucial for analytics and reporting.

Composing MapReduce Calculations
Complex MapReduce jobs can be composed by chaining multiple jobs where the output of one becomes the input of another.

Question 6

Master-Slave and Peer-to-Peer Replication
Master-slave replication uses a central controller, while peer-to-peer replication distributes control across nodes, improving fault tolerance.

Three Dimensions of Big Data
The three dimensions are Volume, Velocity, and Variety, representing data size, speed, and diversity.

Question 7

Graph Databases and Schema-less Databases
Graph databases store data as nodes and edges, ideal for relationship-based data. Schema-less databases offer flexibility by not enforcing fixed data structures.

Graph Mapping Schemas and Replication Rate
Graph mapping schemas define how graph data is structured. Lower bound replication rate refers to the minimum data duplication required to maintain availability and performance.

File Size

111.4 KB

Uploader

SuGanta International