THEORY EXAMINATION (SEM–VI) 2016-17 BIG DATA

B.Tech Data Structure 0 downloads

₹29.00

BIG DATA (NIT067)

Time: 3 Hours Max Marks: 100

SECTION – A (Short Answer Questions)

(10 × 2 = 20 Marks)

(a) Characteristics of Big Data

The main characteristics of Big Data are known as the 5 V’s:

Volume: Huge amount of data

Velocity: Speed of data generation and processing

Variety: Structured, semi-structured, and unstructured data

Veracity: Data quality and uncertainty

Value: Useful insights derived from data

(b) Calculation of risk in marketing

Risk in marketing is calculated by analyzing customer behavior, purchase patterns, probability of loss, and uncertainty using statistical and predictive analytics techniques.

(c) Use of inferential statistics in Big Data

Inferential statistics is used to draw conclusions about a population from sampled data, helping in prediction, decision-making, and hypothesis testing.

(d) Sharding

Sharding is the process of splitting large datasets into smaller parts (shards) and distributing them across multiple servers to improve performance and scalability.

(e) Usage of Hadoop Pipes

Hadoop Pipes allows developers to write MapReduce programs in languages like C++ instead of Java.

(f) Master-Slave vs Peer-to-Peer architecture in NoSQL

Master-Slave	Peer-to-Peer
Central master controls slaves	No central controller
Single point of failure	High fault tolerance
Used in HDFS	Used in Cassandra

(g) Purpose of Bloom filter

Bloom filter is a probabilistic data structure used to quickly test whether an element is present in a dataset, reducing disk access.

(h) Classic MapReduce vs YARN

Classic MapReduce	YARN
JobTracker + TaskTracker	ResourceManager + NodeManager
Limited scalability	Better scalability
Single processing model	Supports multiple models

(i) Usage of Grunt

Grunt is an interactive shell for Apache Pig, used for writing, testing, and debugging Pig scripts.

(j) Date and Time data types in Hive

Hive uses DATE, TIMESTAMP, and STRING data types to store and manipulate date and time-based data for querying and analytics.

(k) Why Hive is preferred over PigLatin

Hive is preferred because it uses SQL-like syntax (HiveQL), making it easier for users with database background.

SECTION – B (Long Answer Questions)

(Attempt any FIVE – 5 × 10 = 50 Marks)

2(a) Relationship between crowdsourcing and Big Data

Crowdsourcing involves collecting data from a large number of users through platforms like social media, surveys, and mobile apps.
This data is:

High in volume

Generated continuously

Diverse in nature

Hence, crowdsourcing is a major source of Big Data.

Example:
User reviews on e-commerce platforms help companies analyze customer sentiment.

2(b) Aggregate Data Model

The aggregate data model groups related data into aggregates which are accessed together.

Features: Reduces join operations Improves performance Used in NoSQL databases

Example:
An order aggregate contains order details, customer details, and item list.

2(c) Scale-up vs Scale-out and Hadoop

Scale-up: Adding more power (CPU, RAM) to a single machine

Scale-out: Adding more machines to distribute workload

Hadoop uses scale-out architecture by distributing data across multiple nodes using HDFS, improving fault tolerance and performance.

2(d) Building blocks of Hadoop

Main components: HDFS (Hadoop Distributed File System) – Storage

MapReduce – Data processing YARN – Resource management

Hadoop Common – Libraries and utilities

Together, they enable distributed storage and parallel processing.

2(e) MapReduce workflows MapReduce workflow consists of:

Input splitting Map phase (key-value generation)

Shuffle and sort Reduce phase (aggregation)

Output generation

This workflow enables large-scale parallel processing.

2(f) HBase data model

HBase is a column-oriented NoSQL database.

Data model includes: Table

Row key Column family

Column qualifier Timestamp

Cell value

It supports real-time read/write access.

2(g) Data modeling rules in Cassandra

Rules: Design based on queries

Avoid joins Use denormalization

Prefer wide rows

Relationships are handled using partition keys and clustering keys instead of joins.

2(h) Hive queries for joins

Natural Join:

SELECT * FROM A NATURAL JOIN B;

Outer Join:

SELECT * FROM A LEFT OUTER JOIN B ON A.id = B.id;

Used for combining data from multiple tables.

SECTION – C (Very Long Answer Questions)

(Attempt any TWO – 2 × 15 = 30 Marks)

3(i) Hadoop job processing

Steps: Client submits job

ResourceManager allocates resources Map tasks process input splits

Shuffle and sort phase Reduce tasks generate output

Results stored in HDFS This ensures fault-tolerant and parallel execution.

3(ii) Hadoop cluster modes and local mode installation

Modes of Hadoop: Standalone (Local) mode

Pseudo-distributed mode Fully distributed mode

Standalone mode:

Single JVM No HDFS

Used for testing and learning

Configuration involves setting environment variables and running Hadoop commands locally.

4(i) Pig scripts

Given data: Name, District, Age, Gender

Female students:

A = LOAD 'st.txt' USING PigStorage(',') AS (name:chararray, district:chararray, age:int, gender:chararray); B = FILTER A BY gender == 'Female'; DUMP B;

Students from specific district:

C = FILTER A BY district == 'XXXX'; D = GROUP C ALL; E = FOREACH D GENERATE COUNT(C);

District-wise male count:

M = FILTER A BY gender == 'Male'; G = GROUP M BY district; H = FOREACH G GENERATE group, COUNT(M);