Big Data

Distributed Foundations

Processing massive datasets that traditional databases cannot handle. The architecture of distributed storage and parallel computation.

00 / Overview

What is Big Data?

Big Data refers to datasets that are too large, complex, or grow too fast to be handled by traditional data management tools (like RDBMS). It's not just about size; it's about the ability to extract value from data that was previously discarded or ignored.

The 6 Vs of Big Data

1. Volume

The sheer amount of data (Terabytes to Zettabytes). Generated by IoT, logs, transactions.

2. Velocity

The speed at which data is generated and processed (Real-time streams, sensors).

3. Variety

Different forms of data: Structured (SQL), Semi-structured (JSON/XML), Unstructured (Video/Text).

4. Veracity

The quality and trustworthiness of data. Handling noise, bias, and abnormalities.

5. Value

The usefulness of data. Data is useless unless it turns into actionable insights/profit.

6. Variability

Inconsistency in data flows. Managing peaks, troughs, and changing meanings over time.

Big Data vs Traditional Data

Feature Traditional Data Big Data
Volume Gigabytes (GB) Petabytes (PB), Exabytes (EB)
Architecture Centralized Distributed
Structure Structured (Tables) Semi/Unstructured
Access Interactive Batch or Stream

Necessity & Applications

We reached a limit where vertical scaling (adding CPU/RAM to one machine) became too expensive. Horizontal scaling (adding more cheap machines) was the solution.

  • Healthcare: Predicting epidemics, personalized medicine.
  • Finance: Real-time fraud detection, high-frequency trading.
  • Retail: Recommendation engines (Amazon/Netflix), inventory optimization.

Challenges & Critique

  • Data Privacy: Risk of exposing sensitive user info.
  • Skill Gap: Shortage of data scientists and engineers.
  • "Data Swamp": Dumping data without governance leads to unusable, messy reservoirs.
  • Hype: Companies collecting data "just in case" without a clear ROI strategy.
01 / Foundation

Apache Hadoop

An open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model. It runs on clusters of commodity hardware (inexpensive computers).

Core Modules

  • Hadoop Common: The libraries and utilities needed by other Hadoop modules.
  • HDFS: A distributed file system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
  • Hadoop YARN: A resource-management platform responsible for managing computing resources in clusters and using them for scheduling users' applications.
  • Hadoop MapReduce: An implementation of the MapReduce programming model for large-scale data processing.

Key Features

  • Reliability: Assumes hardware failure is the norm. It handles failures automatically by maintaining multiple copies of data.
  • Scalability: Designed to scale linearly from a single server to thousands of machines, each expanding local computation and storage.
  • Cost-Effective: Runs on commodity hardware rather than expensive specialized servers.
02 / Storage

HDFS (Hadoop Distributed File System)

A highly fault-tolerant distributed file system designed to run on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Architecture: Master/Slave architecture.

Blocks: Files are split into blocks (default 128MB). Each block is stored as a separate file on the local file system of a DataNode.

Replication Factor: To ensure fault tolerance, each block is replicated across multiple DataNodes (default 3x).

Replication Simulator (Factor: 3)

DataNode 1
A
B
DataNode 2
A
C
DataNode 3
B
C

System Healthy. Click status light to simulate node failure.

NameNode (Master)

The centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept.

  • Manages File System Namespace.
  • Regulates client access to files.
  • Executes operations like opening, closing, and renaming files/directories.

DataNode (Slave)

Stores the actual data in blocks. There are usually one per node in the cluster.

  • Responsible for serving read and write requests from file system clients.
  • Performs block creation, deletion, and replication upon instruction from the NameNode.
03 / Compute

MapReduce

A core component of the Hadoop ecosystem that provides a paradigm for processing massive data across distributed clusters. It abstracts the complexity of parallel computing, fault tolerance, and data locality.

Key Concepts

  • Data Locality: Moving computation to the data (cheaper) rather than moving data to computation (expensive network I/O).
  • Scalability: Tasks are independent, allowing valid parallelization across thousands of nodes.

Pipeline Visualization

Status: Idle

INPUT
MAP
SHUFFLE
REDUCE
HDFS

1. Map Phase

Converts input data into key-value pairs.

  • InputSplits: Logic splits of data.
  • Mapping: Processing record by record.

2. Shuffle Phase

Transfers data from Mappers to Reducers.

  • Sorting: Groups values by key.
  • Partitioner: Decides which Reducer gets which key.

3. Reduce Phase

Aggregates the grouped data.

  • Reducing: Summarizes the list of values for each key.
  • Output: Writes final result to HDFS.
04 / Management

YARN (Yet Another Resource Negotiator)

Introduced in Hadoop 2.0 to solve the scalability issues of MapReduce v1. It separates the resource management layer from the processing layer, allowing other engines (like Spark) to run on the same cluster.

Resource Dashboard

Status: Ready

Client

Job Submission

Resource Manager

Global Scheduler

Node Manager
RAM Usage 0%

ResourceManager (Master)

The ultimate authority that arbitrates resources among all the applications in the system.

  • Scheduler: Allocates resources.
  • ApplicationsManager: Accepts job submissions.

NodeManager (Slave)

The per-machine framework agent that is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting to the ResourceManager.

ApplicationMaster (Per-App)

A framework-specific library that negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the tasks.

05 / Speed

Apache Spark

A unified analytics engine for large-scale data processing. Known for being 100x faster than Hadoop MapReduce in memory by reducing the number of read/write operations to disk.

RDD (Resilient Distributed Dataset): The fundamental data structure of Spark. It is an immutable distributed collection of objects.

DAG (Directed Acyclic Graph): Spark creates a graph of tasks to execute. This allows for optimization (like pipelining map tasks).

Lazy Evaluation: Transformations (like map, filter) are not executed until an Action (like count, collect) is called.

DAG Execution Pipeline

RDD 1 Load
RDD 2 Map
RDD 3 Reduce

The Spark Ecosystem

Spark SQL

Module for working with structured data. Allows querying via SQL or DataFrame API.

Spark Streaming

Scalable, high-throughput, fault-tolerant stream processing of live data streams.

MLlib (Machine Learning)

Scalable machine learning library (Classification, Regression, Clustering, etc.).

GraphX

API for graphs and graph-parallel computation.

06 / Real-time

Apache Storm

A free and open source distributed realtime computation system. Unlike Hadoop (which runs "Jobs" that end), Storm runs "Topologies" that process data forever until killed.

TOPOLOGY VISUALIZER
Spout
Bolt
Core Concepts
  • Spouts: Source of streams in a topology (e.g., reads from Kafka).
  • Bolts: Processes input streams (filter, aggregate, join) and produces new streams.
  • Nimbus: Master node (like JobTracker).
  • Supervisor: Worker nodes.
Stream Groupings

How tuples are routed between components:

  • Shuffle: Equal random distribution.
  • Fields: Tuples with same field value go to same task.
  • All: Send to all tasks (replication).
  • Global: Send to task with lowest ID.
07 / SQL Interface

Apache Hive

A data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in HDFS and other compatible systems.

How it works

Hive translates HiveQL (an SQL-like language) into MapReduce, Tez, or Spark jobs that run on the Hadoop cluster.

Key Feature

Metastore: A central repository that stores the structure (metadata) of the tables, allowing SQL-on-HDFS.

# HiveQL vs SQL
CREATE TABLE logs (ip STRING, url STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-- Simple Aggregation
SELECT url, COUNT(*) FROM logs GROUP BY url;
# Running MapReduce Job...
08 / Scripting

Apache Pig

A high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. It abstracts the Java complexity of MapReduce.

Why Pig?

  • Ease of Programming: 10 lines of Pig Latin = 200 lines of Java MapReduce.
  • Optimization: Automatically optimizes execution.
  • Extensibility: Users can create custom functions (UDFs).

Key Feature

Schema-on-Read: Unlike RDBMS (Schema-on-Write), Pig loads data without a predefined schema and applies schema only when reading.

# Pig Latin ETL Example
visits = LOAD '/data/visits' USING PigStorage(',');
-- Filter for US User
us_visits = FILTER visits BY country == 'US';
-- Group by URL
grouped = GROUP us_visits BY url;
-- Count visits
total = FOREACH grouped GENERATE group as url, COUNT(us_visits);
STORE total INTO '/data/us_url_counts';
09 / NLP Process

Text Mining & NLP Pipeline

Text mining is the process of deriving high-quality information from text. It involves preprocessing raw text into a structured format that can be analyzed by machine learning algorithms.

Preprocessing Simulation

Status: Ready

Raw Text
"The players are running fast"
Tokenization
Refining
Vectorization

1. Tokenization

Breaking text into individual units like words or phrases (tokens).

2. Stopwords

Removing common words (the, a, is) that don't add significant meaning.

3. Stemming

Reducing words to their base root form (e.g., 'running' to 'run').

4. Vectorization

Converting text into numerical vectors that computers can process.

10 / Appendix

Big Data Glossary

HDFS

Hadoop Distributed File System; provides high-throughput access to application data.

MapReduce

Software framework for easily writing applications which process vast amounts of data in-parallel.

YARN

Yet Another Resource Negotiator; the cluster management layer of Hadoop.

Spark

A fast and general-purpose cluster computing system for big data processing.

RDD

Resilient Distributed Dataset; fundamental data structure of Apache Spark.

NameNode

The centerpiece of HDFS that keeps the directory tree and tracks where data is kept.

DataNode

Responsible for storing the actual data blocks in HDFS.

ResourceManager

The ultimate authority that arbitrates resources among all applications in YARN.

NodeManager

The per-machine agent responsible for containers and monitoring resource usage in YARN.

ApplicationMaster

A framework-specific library that negotiates resources from ResourceManager for a single job.

Container

A logical bundle of resources (CPU, RAM) on a single node where tasks run.

Spout

A source of streams in an Apache Storm topology.

Bolt

Processes input streams and produces new streams in Apache Storm.

Topology

A network of spouts and bolts that defines a real-time computation in Storm.

Pig Latin

A data flow language used in Apache Pig to analyze large datasets.

ETL

Extract, Transform, Load; the general process of data movement and refinement.

DAG

Directed Acyclic Graph; a representation of data processing steps with no loops.

In-Memory

Storing data in RAM for faster access compared to disk storage.

Fault Tolerance

The ability of a system to continue functioning after a component failure.

Data Locality

Moving the computation to where the data is stored to reduce network traffic.

11 / Verification

Knowledge Check

Verify your understanding of Big Data Analytics. Pass the exam with a perfect score to earn your completion certificate.