Big Data Analytics | Undrstanding

00 / Overview

What is Big Data?

Big Data refers to datasets that are too large, complex, or grow too fast to be handled by traditional data management tools (like RDBMS). It's not just about size; it's about the ability to extract value from data that was previously discarded or ignored.

The 6 Vs of Big Data

1. Volume

The sheer amount of data (Terabytes to Zettabytes). Generated by IoT, logs, transactions.

2. Velocity

The speed at which data is generated and processed (Real-time streams, sensors).

3. Variety

Different forms of data: Structured (SQL), Semi-structured (JSON/XML), Unstructured (Video/Text).

4. Veracity

The quality and trustworthiness of data. Handling noise, bias, and abnormalities.

5. Value

The usefulness of data. Data is useless unless it turns into actionable insights/profit.

6. Variability

Inconsistency in data flows. Managing peaks, troughs, and changing meanings over time.

Big Data vs Traditional Data

Feature	Traditional Data	Big Data
Volume	Gigabytes (GB)	Petabytes (PB), Exabytes (EB)
Architecture	Centralized	Distributed
Structure	Structured (Tables)	Semi/Unstructured
Access	Interactive	Batch or Stream

Necessity & Applications

We reached a limit where vertical scaling (adding CPU/RAM to one machine) became too expensive. Horizontal scaling (adding more cheap machines) was the solution.

Healthcare: Predicting epidemics, personalized medicine.
Finance: Real-time fraud detection, high-frequency trading.
Retail: Recommendation engines (Amazon/Netflix), inventory optimization.

Challenges & Critique

Data Privacy: Risk of exposing sensitive user info.
Skill Gap: Shortage of data scientists and engineers.
"Data Swamp": Dumping data without governance leads to unusable, messy reservoirs.
Hype: Companies collecting data "just in case" without a clear ROI strategy.

01 / Foundation

Apache Hadoop

An open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model. It runs on clusters of commodity hardware (inexpensive computers).

Core Modules

Hadoop Common: The libraries and utilities needed by other Hadoop modules.
HDFS: A distributed file system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN: A resource-management platform responsible for managing computing resources in clusters and using them for scheduling users' applications.
Hadoop MapReduce: An implementation of the MapReduce programming model for large-scale data processing.

Key Features

● Reliability: Assumes hardware failure is the norm. It handles failures automatically by maintaining multiple copies of data.
● Scalability: Designed to scale linearly from a single server to thousands of machines, each expanding local computation and storage.
● Cost-Effective: Runs on commodity hardware rather than expensive specialized servers.

02 / Storage

HDFS (Hadoop Distributed File System)

A highly fault-tolerant distributed file system designed to run on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Architecture: Master/Slave architecture.

Blocks: Files are split into blocks (default 128MB). Each block is stored as a separate file on the local file system of a DataNode.

Replication Factor: To ensure fault tolerance, each block is replicated across multiple DataNodes (default 3x).

Replication Simulator (Factor: 3)

DataNode 1

A

B

DataNode 2

A

C

DataNode 3

B

C

System Healthy. Click status light to simulate node failure.

NameNode (Master)

The centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept.

Manages File System Namespace.
Regulates client access to files.
Executes operations like opening, closing, and renaming files/directories.

DataNode (Slave)

Stores the actual data in blocks. There are usually one per node in the cluster.

Responsible for serving read and write requests from file system clients.
Performs block creation, deletion, and replication upon instruction from the NameNode.

03 / Compute

MapReduce

A core component of the Hadoop ecosystem that provides a paradigm for processing massive data across distributed clusters. It abstracts the complexity of parallel computing, fault tolerance, and data locality.

Key Concepts

Data Locality: Moving computation to the data (cheaper) rather than moving data to computation (expensive network I/O).
Scalability: Tasks are independent, allowing valid parallelization across thousands of nodes.

Pipeline Visualization

Status: Idle

INPUT

MAP

SHUFFLE

REDUCE

HDFS

1. Map Phase

Converts input data into key-value pairs.

InputSplits: Logic splits of data.
Mapping: Processing record by record.

2. Shuffle Phase

Transfers data from Mappers to Reducers.

Sorting: Groups values by key.
Partitioner: Decides which Reducer gets which key.

3. Reduce Phase

Aggregates the grouped data.

Reducing: Summarizes the list of values for each key.
Output: Writes final result to HDFS.

04 / Management

YARN (Yet Another Resource Negotiator)

Introduced in Hadoop 2.0 to solve the scalability issues of MapReduce v1. It separates the resource management layer from the processing layer, allowing other engines (like Spark) to run on the same cluster.

Resource Dashboard

Status: Ready

Client

Job Submission

Resource Manager

Global Scheduler

Node Manager

RAM Usage 0%

ResourceManager (Master)

The ultimate authority that arbitrates resources among all the applications in the system.

Scheduler: Allocates resources.
ApplicationsManager: Accepts job submissions.

NodeManager (Slave)

The per-machine framework agent that is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting to the ResourceManager.

ApplicationMaster (Per-App)

A framework-specific library that negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the tasks.

05 / Speed

Apache Spark

A unified analytics engine for large-scale data processing. Known for being 100x faster than Hadoop MapReduce in memory by reducing the number of read/write operations to disk.

RDD (Resilient Distributed Dataset): The fundamental data structure of Spark. It is an immutable distributed collection of objects.

DAG (Directed Acyclic Graph): Spark creates a graph of tasks to execute. This allows for optimization (like pipelining map tasks).

Lazy Evaluation: Transformations (like map, filter) are not executed until an Action (like count, collect) is called.

DAG Execution Pipeline

RDD 1 Load

RDD 2 Map

RDD 3 Reduce

The Spark Ecosystem

Spark SQL

Module for working with structured data. Allows querying via SQL or DataFrame API.

Spark Streaming

Scalable, high-throughput, fault-tolerant stream processing of live data streams.

MLlib (Machine Learning)

Scalable machine learning library (Classification, Regression, Clustering, etc.).

GraphX

API for graphs and graph-parallel computation.

06 / Real-time

Apache Storm

A free and open source distributed realtime computation system. Unlike Hadoop (which runs "Jobs" that end), Storm runs "Topologies" that process data forever until killed.

TOPOLOGY VISUALIZER

Spout

Bolt

Core Concepts

Spouts: Source of streams in a topology (e.g., reads from Kafka).
Bolts: Processes input streams (filter, aggregate, join) and produces new streams.
Nimbus: Master node (like JobTracker).
Supervisor: Worker nodes.

Stream Groupings

How tuples are routed between components:

Shuffle: Equal random distribution.
Fields: Tuples with same field value go to same task.
All: Send to all tasks (replication).
Global: Send to task with lowest ID.

07 / SQL Interface

Apache Hive

A data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in HDFS and other compatible systems.

How it works

Hive translates HiveQL (an SQL-like language) into MapReduce, Tez, or Spark jobs that run on the Hadoop cluster.

Key Feature

Metastore: A central repository that stores the structure (metadata) of the tables, allowing SQL-on-HDFS.

# HiveQL vs SQL

CREATE TABLE logs (ip STRING, url STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

-- Simple Aggregation

SELECT url, COUNT(*) FROM logs GROUP BY url;

# Running MapReduce Job...

08 / Scripting

Apache Pig

A high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. It abstracts the Java complexity of MapReduce.

Why Pig?

Ease of Programming: 10 lines of Pig Latin = 200 lines of Java MapReduce.
Optimization: Automatically optimizes execution.
Extensibility: Users can create custom functions (UDFs).

Key Feature

Schema-on-Read: Unlike RDBMS (Schema-on-Write), Pig loads data without a predefined schema and applies schema only when reading.

# Pig Latin ETL Example

visits = LOAD '/data/visits' USING PigStorage(',');

-- Filter for US User

us_visits = FILTER visits BY country == 'US';

-- Group by URL

grouped = GROUP us_visits BY url;

-- Count visits

total = FOREACH grouped GENERATE group as url, COUNT(us_visits);

STORE total INTO '/data/us_url_counts';

09 / NLP Process

Text Mining & NLP Pipeline

Text mining is the process of deriving high-quality information from text. It involves preprocessing raw text into a structured format that can be analyzed by machine learning algorithms.

Preprocessing Simulation

Status: Ready

Raw Text

"The players are running fast"

Tokenization

Refining

Vectorization

1. Tokenization

Breaking text into individual units like words or phrases (tokens).

2. Stopwords

Removing common words (the, a, is) that don't add significant meaning.

3. Stemming

Reducing words to their base root form (e.g., 'running' to 'run').

4. Vectorization

Converting text into numerical vectors that computers can process.

10 / Appendix

Big Data Glossary

HDFS

Hadoop Distributed File System; provides high-throughput access to application data.

MapReduce

Software framework for easily writing applications which process vast amounts of data in-parallel.

YARN

Yet Another Resource Negotiator; the cluster management layer of Hadoop.

Spark

A fast and general-purpose cluster computing system for big data processing.

RDD

Resilient Distributed Dataset; fundamental data structure of Apache Spark.

NameNode

The centerpiece of HDFS that keeps the directory tree and tracks where data is kept.

DataNode

Responsible for storing the actual data blocks in HDFS.

ResourceManager

The ultimate authority that arbitrates resources among all applications in YARN.

NodeManager

The per-machine agent responsible for containers and monitoring resource usage in YARN.

ApplicationMaster

A framework-specific library that negotiates resources from ResourceManager for a single job.

Container

A logical bundle of resources (CPU, RAM) on a single node where tasks run.

Spout

A source of streams in an Apache Storm topology.

Bolt

Processes input streams and produces new streams in Apache Storm.

Topology

A network of spouts and bolts that defines a real-time computation in Storm.

Pig Latin

A data flow language used in Apache Pig to analyze large datasets.

ETL

Extract, Transform, Load; the general process of data movement and refinement.

DAG

Directed Acyclic Graph; a representation of data processing steps with no loops.

In-Memory

Storing data in RAM for faster access compared to disk storage.

Fault Tolerance

The ability of a system to continue functioning after a component failure.

Data Locality

Moving the computation to where the data is stored to reduce network traffic.

11 / Verification

Knowledge Check

Verify your understanding of Big Data Analytics. Pass the exam with a perfect score to earn your completion certificate.

05 / Certification

Completion Certificate

Enter your name below to generate your personalized certificate of completion for the Big Data Analytics series.

Full Name

This certificate confirms successful completion of learning and assessment on this platform and is not an accredited or government-recognized qualification.

Distributed Foundations

What is Big Data?

The 6 Vs of Big Data

Big Data vs Traditional Data

Necessity & Applications

Challenges & Critique

Apache Hadoop

Core Modules

Key Features

HDFS (Hadoop Distributed File System)

Replication Simulator (Factor: 3)

NameNode (Master)

DataNode (Slave)

MapReduce

Key Concepts

Pipeline Visualization

1. Map Phase

2. Shuffle Phase

3. Reduce Phase

YARN (Yet Another Resource Negotiator)

Resource Dashboard

Client

Resource Manager

Node Manager

ResourceManager (Master)

NodeManager (Slave)

ApplicationMaster (Per-App)

Apache Spark

DAG Execution Pipeline

The Spark Ecosystem

Spark SQL

Spark Streaming

MLlib (Machine Learning)

GraphX

Apache Storm

Apache Hive

How it works

Key Feature

Apache Pig

Why Pig?

Key Feature

Text Mining & NLP Pipeline

Preprocessing Simulation

1. Tokenization

2. Stopwords

3. Stemming

4. Vectorization

Big Data Glossary

Knowledge Check

Completion Certificate