(SEM VI) THEORY EXAMINATION 2023-24 BIG DATA
KCS061 – BIG DATA (B.Tech Sem VI, 2023–24)
The answers are written in simple, humanized language, not in short points, and strictly follow the uploaded question paper (both pages).
Reference: Uploaded Question Paper
KCS061-BIG-DATA
SECTION A
Attempt all questions in brief (2 × 10 = 20 marks)
(a) Types of digital data in Big Data with examples
Digital data in Big Data applications is broadly classified into structured, semi-structured, and unstructured data. Structured data is well organized in rows and columns, such as data stored in relational databases like student records or bank transactions. Semi-structured data does not follow a strict table format but contains tags or markers, such as XML and JSON files used in web applications. Unstructured data has no predefined structure and includes text documents, images, videos, audio files, emails, and social media posts.
(b) What constitutes a Big Data platform?
A Big Data platform consists of components for data ingestion, storage, processing, analysis, and visualization. It includes distributed storage systems like HDFS, processing frameworks such as MapReduce or Spark, resource management tools like YARN, analytics tools, and visualization or reporting tools. Together, these components enable handling of large, complex datasets efficiently.
(c) Hadoop Streaming
Hadoop Streaming is a utility that allows users to write MapReduce programs using any programming language that can read from standard input and write to standard output, such as Python or Perl, instead of using Java.
(d) Data formats used in Hadoop environments
Common data formats in Hadoop include Text files, Sequence files, Avro, Parquet, and ORC. Text files are simple and readable, while Avro, Parquet, and ORC are optimized formats that support schema evolution, compression, and faster query performance.
(e) File sizes, block sizes, and block abstraction in HDFS
In HDFS, files are split into large fixed-size blocks, typically 128 MB. A file is stored as multiple blocks across different DataNodes. Block abstraction allows HDFS to manage storage and replication independently of the file structure, improving scalability and fault tolerance.
(f) Benefits and challenges of using HDFS
HDFS provides high fault tolerance, scalability, and cost-effective storage on commodity hardware. However, it is not suitable for low-latency access or handling a large number of small files, and it works best with batch processing rather than real-time operations.
(g) Fair Scheduler and Capacity Scheduler
The Fair Scheduler ensures that all applications get a fair share of cluster resources over time. It is useful in multi-user environments. The Capacity Scheduler divides resources into queues with guaranteed capacity, making it suitable for large organizations with multiple teams sharing a Hadoop cluster.
(h) YARN
YARN (Yet Another Resource Negotiator) is Hadoop’s resource management layer. It manages cluster resources and schedules applications, allowing multiple data processing engines to run on the same cluster.
(i) Apache Pig
Apache Pig is a high-level data processing platform used with Hadoop. It uses a scripting language called Pig Latin, which simplifies writing complex data transformations compared to raw MapReduce code.
(j) Grunt shell in Apache Pig
The Grunt shell is an interactive command-line interface of Apache Pig. It allows users to execute Pig Latin commands interactively, test scripts, and debug data processing logic.
SECTION B
Attempt any three (3 × 10 = 30 marks)
(a) Data analysis vs reporting in Big Data
Reporting focuses on summarizing historical data using predefined queries, dashboards, and charts. It answers questions like “what happened” and “when it happened.” Data analysis, especially advanced analytics, goes beyond reporting by exploring data to identify hidden patterns, correlations, and trends. Techniques such as machine learning, predictive analytics, and data mining help organizations make future-oriented decisions rather than just reviewing past performance.
(b) Apache Hadoop and its role in Big Data processing
Apache Hadoop is an open-source framework designed to store and process large datasets across clusters of computers. Its core components include HDFS for distributed storage, MapReduce for distributed processing, YARN for resource management, and Hadoop Common for utilities. These components work together to provide scalable, fault-tolerant Big Data processing.
(c) Core concepts of HDFS: NameNode and DataNode
HDFS follows a master-slave architecture. The NameNode maintains metadata such as file names, block locations, and access permissions. DataNodes store actual data blocks and handle read/write requests. The NameNode coordinates data placement and replication, ensuring reliability and efficient data access across the cluster.
(d) NoSQL databases and their benefits
NoSQL databases are non-relational databases designed to handle large volumes of unstructured and semi-structured data. They offer schema flexibility, horizontal scalability, high performance, and fault tolerance. Compared to traditional relational databases, NoSQL systems like MongoDB and Cassandra are better suited for Big Data applications.
(e) Apache Hive architecture
Apache Hive provides a data warehousing layer on Hadoop. Its architecture includes HiveQL parser, compiler, optimizer, and execution engine. Hive translates SQL-like queries into MapReduce or Spark jobs, allowing users to analyze large datasets without writing complex code.
SECTION C
Attempt any one (1 × 10 = 10 marks)
(a) The 5 Vs of Big Data
The 5 Vs of Big Data are Volume, Velocity, Variety, Veracity, and Value. Volume refers to the massive amount of data generated. Velocity describes the speed at which data is produced and processed. Variety represents different data types and formats. Veracity deals with data quality and reliability. Value focuses on extracting meaningful insights that support decision-making.
(b) Real-world applications of Big Data analytics
In healthcare, Big Data helps in disease prediction and personalized treatment. In finance, it is used for fraud detection and risk analysis. E-commerce platforms analyze customer behavior to recommend products. Transportation systems use Big Data for traffic management, route optimization, and predictive maintenance.
Related Notes
BASIC ELECTRICAL ENGINEERING
ENGINEERING PHYSICS THEORY EXAMINATION 2024-25
(SEM I) ENGINEERING CHEMISTRY THEORY EXAMINATION...
THEORY EXAMINATION 2024-25 ENGINEERING MATHEMATICS...
(SEM I) THEORY EXAMINATION 2024-25 ENGINEERING CHE...
(SEM I) THEORY EXAMINATION 2024-25 ENVIRONMENT AND...
Need more notes?
Return to the notes store to keep exploring curated study material.
Back to Notes StoreLatest Blog Posts
Best Home Tutors for Class 12 Science in Dwarka, Delhi
Top Universities in Chennai for Postgraduate Courses with Complete Guide
Best Home Tuition for Competitive Exams in Dwarka, Delhi
Best Online Tutors for Maths in Noida 2026
Best Coaching Centers for UPSC in Rajender Place, Delhi 2026
How to Apply for NEET in Gurugram, Haryana for 2026
Admission Process for BTech at NIT Warangal 2026
Best Home Tutors for JEE in Maharashtra 2026
Meet Our Exceptional Teachers
Discover passionate educators who inspire, motivate, and transform learning experiences with their expertise and dedication
Explore Tutors In Your Location
Discover expert tutors in popular areas across India
Discover Elite Educational Institutes
Connect with top-tier educational institutions offering world-class learning experiences, expert faculty, and innovative teaching methodologies