THEORY EXAMINATION (SEM–VI) 2016-17 BIG DATA
BIG DATA (NIT067)
Time: 3 Hours Max Marks: 100
SECTION – A (Short Answer Questions)
(10 × 2 = 20 Marks)
(a) Characteristics of Big Data
The main characteristics of Big Data are known as the 5 V’s:
Volume: Huge amount of data
Velocity: Speed of data generation and processing
Variety: Structured, semi-structured, and unstructured data
Veracity: Data quality and uncertainty
Value: Useful insights derived from data
(b) Calculation of risk in marketing
Risk in marketing is calculated by analyzing customer behavior, purchase patterns, probability of loss, and uncertainty using statistical and predictive analytics techniques.
(c) Use of inferential statistics in Big Data
Inferential statistics is used to draw conclusions about a population from sampled data, helping in prediction, decision-making, and hypothesis testing.
(d) Sharding
Sharding is the process of splitting large datasets into smaller parts (shards) and distributing them across multiple servers to improve performance and scalability.
(e) Usage of Hadoop Pipes
Hadoop Pipes allows developers to write MapReduce programs in languages like C++ instead of Java.
(f) Master-Slave vs Peer-to-Peer architecture in NoSQL
| Master-Slave | Peer-to-Peer |
|---|---|
| Central master controls slaves | No central controller |
| Single point of failure | High fault tolerance |
| Used in HDFS | Used in Cassandra |
(g) Purpose of Bloom filter
Bloom filter is a probabilistic data structure used to quickly test whether an element is present in a dataset, reducing disk access.
(h) Classic MapReduce vs YARN
| Classic MapReduce | YARN |
|---|---|
| JobTracker + TaskTracker | ResourceManager + NodeManager |
| Limited scalability | Better scalability |
| Single processing model | Supports multiple models |
(i) Usage of Grunt
Grunt is an interactive shell for Apache Pig, used for writing, testing, and debugging Pig scripts.
(j) Date and Time data types in Hive
Hive uses DATE, TIMESTAMP, and STRING data types to store and manipulate date and time-based data for querying and analytics.
(k) Why Hive is preferred over PigLatin
Hive is preferred because it uses SQL-like syntax (HiveQL), making it easier for users with database background.
SECTION – B (Long Answer Questions)
(Attempt any FIVE – 5 × 10 = 50 Marks)
2(a) Relationship between crowdsourcing and Big Data
Crowdsourcing involves collecting data from a large number of users through platforms like social media, surveys, and mobile apps.
This data is:
High in volume
Generated continuously
Diverse in nature
Hence, crowdsourcing is a major source of Big Data.
Example:
User reviews on e-commerce platforms help companies analyze customer sentiment.
2(b) Aggregate Data Model
The aggregate data model groups related data into aggregates which are accessed together.
Features: Reduces join operations Improves performance Used in NoSQL databases
Example:
An order aggregate contains order details, customer details, and item list.
2(c) Scale-up vs Scale-out and Hadoop
Scale-up: Adding more power (CPU, RAM) to a single machine
Scale-out: Adding more machines to distribute workload
Hadoop uses scale-out architecture by distributing data across multiple nodes using HDFS, improving fault tolerance and performance.
2(d) Building blocks of Hadoop
Main components: HDFS (Hadoop Distributed File System) – Storage
MapReduce – Data processing YARN – Resource management
Hadoop Common – Libraries and utilities
Together, they enable distributed storage and parallel processing.
2(e) MapReduce workflows MapReduce workflow consists of:
Input splitting Map phase (key-value generation)
Shuffle and sort Reduce phase (aggregation)
Output generation
This workflow enables large-scale parallel processing.
2(f) HBase data model
HBase is a column-oriented NoSQL database.
Data model includes: Table
Row key Column family
Column qualifier Timestamp
Cell value
It supports real-time read/write access.
2(g) Data modeling rules in Cassandra
Rules: Design based on queries
Avoid joins Use denormalization
Prefer wide rows
Relationships are handled using partition keys and clustering keys instead of joins.
2(h) Hive queries for joins
Natural Join:
SELECT * FROM A NATURAL JOIN B;
Outer Join:
SELECT * FROM A LEFT OUTER JOIN B ON A.id = B.id;
Used for combining data from multiple tables.
SECTION – C (Very Long Answer Questions)
(Attempt any TWO – 2 × 15 = 30 Marks)
3(i) Hadoop job processing
Steps: Client submits job
ResourceManager allocates resources Map tasks process input splits
Shuffle and sort phase Reduce tasks generate output
Results stored in HDFS This ensures fault-tolerant and parallel execution.
3(ii) Hadoop cluster modes and local mode installation
Modes of Hadoop: Standalone (Local) mode
Pseudo-distributed mode Fully distributed mode
Standalone mode:
Single JVM No HDFS
Used for testing and learning
Configuration involves setting environment variables and running Hadoop commands locally.
4(i) Pig scripts
Given data: Name, District, Age, Gender
Female students:
A = LOAD 'st.txt' USING PigStorage(',') AS (name:chararray, district:chararray, age:int, gender:chararray); B = FILTER A BY gender == 'Female'; DUMP B;
Students from specific district:
C = FILTER A BY district == 'XXXX'; D = GROUP C ALL; E = FOREACH D GENERATE COUNT(C);
District-wise male count:
M = FILTER A BY gender == 'Male'; G = GROUP M BY district; H = FOREACH G GENERATE group, COUNT(M);
4(ii) Pig operators
Data access: LOAD, STORE Transformations: FILTER, GROUP, JOIN, FOREACH
Debugging: DUMP, DESCRIBE, EXPLAIN
These operators simplify large-scale data processing.
5(i) Version stamps
Ways: Auto-generated timestamp User-defined timestamp
Pros: Data versioning, consistency
Cons: Storage overhead, complexity
5(ii) Three dimensions of Big Data
Volume: Size of data
Velocity: Speed of data generation
Variety: Different data formats
These dimensions define Big Data complexity and challenges.
Related Notes
BASIC ELECTRICAL ENGINEERING
ENGINEERING PHYSICS THEORY EXAMINATION 2024-25
(SEM I) ENGINEERING CHEMISTRY THEORY EXAMINATION...
THEORY EXAMINATION 2024-25 ENGINEERING MATHEMATICS...
(SEM I) THEORY EXAMINATION 2024-25 ENGINEERING CHE...
(SEM I) THEORY EXAMINATION 2024-25 ENVIRONMENT AND...
Need more notes?
Return to the notes store to keep exploring curated study material.
Back to Notes StoreLatest Blog Posts
Best Home Tutors for Class 12 Science in Dwarka, Delhi
Top Universities in Chennai for Postgraduate Courses with Complete Guide
Best Home Tuition for Competitive Exams in Dwarka, Delhi
Best Online Tutors for Maths in Noida 2026
Best Coaching Centers for UPSC in Rajender Place, Delhi 2026
How to Apply for NEET in Gurugram, Haryana for 2026
Admission Process for BTech at NIT Warangal 2026
Best Home Tutors for JEE in Maharashtra 2026
Meet Our Exceptional Teachers
Discover passionate educators who inspire, motivate, and transform learning experiences with their expertise and dedication
Explore Tutors In Your Location
Discover expert tutors in popular areas across India
Discover Elite Educational Institutes
Connect with top-tier educational institutions offering world-class learning experiences, expert faculty, and innovative teaching methodologies