(SEM VI) THEORY EXAMINATION 2023-24 BIG DATA

B.Tech Data Structure 0 downloads

₹29.00

KCS061 – BIG DATA (B.Tech Sem VI, 2023–24)

The answers are written in simple, humanized language, not in short points, and strictly follow the uploaded question paper (both pages).
Reference: Uploaded Question Paper

KCS061-BIG-DATA

SECTION A

Attempt all questions in brief (2 × 10 = 20 marks)

(a) Types of digital data in Big Data with examples

Digital data in Big Data applications is broadly classified into structured, semi-structured, and unstructured data. Structured data is well organized in rows and columns, such as data stored in relational databases like student records or bank transactions. Semi-structured data does not follow a strict table format but contains tags or markers, such as XML and JSON files used in web applications. Unstructured data has no predefined structure and includes text documents, images, videos, audio files, emails, and social media posts.

(b) What constitutes a Big Data platform?

A Big Data platform consists of components for data ingestion, storage, processing, analysis, and visualization. It includes distributed storage systems like HDFS, processing frameworks such as MapReduce or Spark, resource management tools like YARN, analytics tools, and visualization or reporting tools. Together, these components enable handling of large, complex datasets efficiently.

(c) Hadoop Streaming

Hadoop Streaming is a utility that allows users to write MapReduce programs using any programming language that can read from standard input and write to standard output, such as Python or Perl, instead of using Java.

(d) Data formats used in Hadoop environments

Common data formats in Hadoop include Text files, Sequence files, Avro, Parquet, and ORC. Text files are simple and readable, while Avro, Parquet, and ORC are optimized formats that support schema evolution, compression, and faster query performance.

(e) File sizes, block sizes, and block abstraction in HDFS

In HDFS, files are split into large fixed-size blocks, typically 128 MB. A file is stored as multiple blocks across different DataNodes. Block abstraction allows HDFS to manage storage and replication independently of the file structure, improving scalability and fault tolerance.

(f) Benefits and challenges of using HDFS

HDFS provides high fault tolerance, scalability, and cost-effective storage on commodity hardware. However, it is not suitable for low-latency access or handling a large number of small files, and it works best with batch processing rather than real-time operations.

(g) Fair Scheduler and Capacity Scheduler

The Fair Scheduler ensures that all applications get a fair share of cluster resources over time. It is useful in multi-user environments. The Capacity Scheduler divides resources into queues with guaranteed capacity, making it suitable for large organizations with multiple teams sharing a Hadoop cluster.

(h) YARN

YARN (Yet Another Resource Negotiator) is Hadoop’s resource management layer. It manages cluster resources and schedules applications, allowing multiple data processing engines to run on the same cluster.

(i) Apache Pig

Apache Pig is a high-level data processing platform used with Hadoop. It uses a scripting language called Pig Latin, which simplifies writing complex data transformations compared to raw MapReduce code.

(j) Grunt shell in Apache Pig

The Grunt shell is an interactive command-line interface of Apache Pig. It allows users to execute Pig Latin commands interactively, test scripts, and debug data processing logic.

SECTION B

Attempt any three (3 × 10 = 30 marks)

(a) Data analysis vs reporting in Big Data

Reporting focuses on summarizing historical data using predefined queries, dashboards, and charts. It answers questions like “what happened” and “when it happened.” Data analysis, especially advanced analytics, goes beyond reporting by exploring data to identify hidden patterns, correlations, and trends. Techniques such as machine learning, predictive analytics, and data mining help organizations make future-oriented decisions rather than just reviewing past performance.

(b) Apache Hadoop and its role in Big Data processing

Apache Hadoop is an open-source framework designed to store and process large datasets across clusters of computers. Its core components include HDFS for distributed storage, MapReduce for distributed processing, YARN for resource management, and Hadoop Common for utilities. These components work together to provide scalable, fault-tolerant Big Data processing.

(c) Core concepts of HDFS: NameNode and DataNode

HDFS follows a master-slave architecture. The NameNode maintains metadata such as file names, block locations, and access permissions. DataNodes store actual data blocks and handle read/write requests. The NameNode coordinates data placement and replication, ensuring reliability and efficient data access across the cluster.

(d) NoSQL databases and their benefits

NoSQL databases are non-relational databases designed to handle large volumes of unstructured and semi-structured data. They offer schema flexibility, horizontal scalability, high performance, and fault tolerance. Compared to traditional relational databases, NoSQL systems like MongoDB and Cassandra are better suited for Big Data applications.

(e) Apache Hive architecture

Apache Hive provides a data warehousing layer on Hadoop. Its architecture includes HiveQL parser, compiler, optimizer, and execution engine. Hive translates SQL-like queries into MapReduce or Spark jobs, allowing users to analyze large datasets without writing complex code.

SECTION C

Attempt any one (1 × 10 = 10 marks)

(a) The 5 Vs of Big Data

The 5 Vs of Big Data are Volume, Velocity, Variety, Veracity, and Value. Volume refers to the massive amount of data generated. Velocity describes the speed at which data is produced and processed. Variety represents different data types and formats. Veracity deals with data quality and reliability. Value focuses on extracting meaningful insights that support decision-making.

(b) Real-world applications of Big Data analytics

In healthcare, Big Data helps in disease prediction and personalized treatment. In finance, it is used for fraud detection and risk analysis. E-commerce platforms analyze customer behavior to recommend products. Transportation systems use Big Data for traffic management, route optimization, and predictive maintenance.

File Size

144.31 KB

Uploader

SuGanta International