(SEM VI) THEORY EXAMINATION 2022-23 BIG DATA AND ANALYTICS

B.Tech General 0 downloads

₹28.92

BIG DATA AND ANALYTICS – KDS-601

Section-wise Important Questions & Ready Answers

SECTION A

(Attempt all questions – 2 marks each)

(a) Different Kinds of Digital Data

Digital data can be classified as structured data, semi-structured data, and unstructured data. Structured data is organized in tabular form like databases, semi-structured data includes XML and JSON files, while unstructured data includes images, videos, audio files, emails, and social media content.

(b) Drivers of Big Data

The major drivers of Big Data include the rapid growth of social media, mobile devices, IoT sensors, cloud computing, digital transactions, and the need for real-time analytics. These factors generate massive

volumes of diverse and fast-moving data.

(c) Importance of Hadoop Data Format

Hadoop data format is important because it enables efficient storage and processing of large datasets. Hadoop supports formats like Text, SequenceFile, Avro, and Parquet, which improve compression, performance, and compatibility with MapReduce and other ecosystem tools.

(d) Distributed File System

A distributed file system stores data across multiple machines while appearing as a single logical system to users. It provides scalability, fault tolerance, and high availability by distributing data blocks across nodes.

(e) Working of File System

A file system manages how data is stored, retrieved, and organized on storage devices. It handles file naming, access control, data allocation, and metadata management to ensure efficient and secure data access.

(f) Use of Data Replication

Data replication creates multiple copies of data across different nodes. It improves fault tolerance, data availability, and reliability by ensuring data remains accessible even if a node fails.

(g) Need of Scheduler in Hadoop

A scheduler is required in Hadoop to allocate cluster resources efficiently among multiple jobs. It ensures fairness, optimal resource utilization, and balanced workload execution across nodes.

(h) Data Types Used in MongoDB

MongoDB supports data types such as String, Integer, Boolean, Double, Array, Object, Date, ObjectId, and Binary data, enabling flexible schema-less storage.

(i) Applications of Big Data Using Pig

Pig is used for data cleansing, transformation, aggregation, and analysis of large datasets. It is widely applied in log analysis, ETL processes, customer behavior analysis, and recommendation systems.

(j) Data Processing Operators Used in Pig

Pig operators include LOAD, FILTER, GROUP, FOREACH, JOIN, ORDER, DISTINCT, UNION, and STORE, which help in performing complex data transformations easily.

SECTION B

(Attempt any three – 10 marks each)

2(a) Overcoming Challenges of Conventional Data Analysis Systems

Conventional systems fail due to limited scalability, high cost, and inability to process unstructured data. These challenges are overcome using distributed computing, parallel processing, cloud infrastructure, and Big Data frameworks like Hadoop and Spark, which enable scalable and cost-effective analytics.

2(b) Hadoop Ecosystem – Concept and Architecture

The Hadoop ecosystem consists of HDFS for storage, MapReduce for processing, YARN for resource management, and tools like Hive, Pig, HBase, Sqoop, Flume, and Oozie. Together, they support data ingestion, storage, processing, and analysis of large datasets.
(In exam, a neat labeled architecture diagram is expected.)

2(c) HDFS Monitoring and Maintenance Process

HDFS monitoring involves checking disk usage, node health, and block replication using tools like NameNode UI and logs. Maintenance includes balancing data, replacing failed nodes, repairing corrupted blocks, and ensuring optimal replication for reliability.

2(d) New Features in Hadoop 2.0

Hadoop 2.0 introduced YARN for better resource management, improved scalability, support for multiple processing models, enhanced fault tolerance, and better performance compared to Hadoop 1.x.

2(e) Apache Hive Installation and Architecture

Hive is installed on Hadoop to enable SQL-like querying using HiveQL. Its architecture includes user interface, driver, compiler, optimizer, execution engine, and metastore. Hive translates queries into MapReduce jobs for execution on HDFS.

SECTION C

3(a) Big Data Architecture and Characteristics

Big Data architecture includes data sources, data ingestion, storage layer, processing layer, analytics layer, and visualization. Its key characteristics are Volume, Velocity, Variety, Veracity, and Value, which define the nature and complexity of Big Data systems.

3(b) Big Data Security, Protection, and Auditing Features

Big Data security includes authentication, authorization, encryption, data masking, and auditing. Tools like Kerberos, Ranger, and Knox ensure secure access, data protection, and compliance monitoring.

File Size

36.28 KB

Uploader

SuGanta International