Distributed Foundations
Processing massive datasets that traditional databases cannot handle. The architecture of distributed storage and parallel computation.
What is Big Data?
Big Data refers to datasets that are too large, complex, or grow too fast to be handled by traditional data management tools (like RDBMS). It's not just about size; it's about the ability to extract value from data that was previously discarded or ignored.
The 6 Vs of Big Data
The sheer amount of data (Terabytes to Zettabytes). Generated by IoT, logs, transactions.
The speed at which data is generated and processed (Real-time streams, sensors).
Different forms of data: Structured (SQL), Semi-structured (JSON/XML), Unstructured (Video/Text).
The quality and trustworthiness of data. Handling noise, bias, and abnormalities.
The usefulness of data. Data is useless unless it turns into actionable insights/profit.
Inconsistency in data flows. Managing peaks, troughs, and changing meanings over time.
Big Data vs Traditional Data
| Feature | Traditional Data | Big Data |
|---|---|---|
| Volume | Gigabytes (GB) | Petabytes (PB), Exabytes (EB) |
| Architecture | Centralized | Distributed |
| Structure | Structured (Tables) | Semi/Unstructured |
| Access | Interactive | Batch or Stream |
Necessity & Applications
We reached a limit where vertical scaling (adding CPU/RAM to one machine) became too expensive. Horizontal scaling (adding more cheap machines) was the solution.
- Healthcare: Predicting epidemics, personalized medicine.
- Finance: Real-time fraud detection, high-frequency trading.
- Retail: Recommendation engines (Amazon/Netflix), inventory optimization.
Challenges & Critique
- Data Privacy: Risk of exposing sensitive user info.
- Skill Gap: Shortage of data scientists and engineers.
- "Data Swamp": Dumping data without governance leads to unusable, messy reservoirs.
- Hype: Companies collecting data "just in case" without a clear ROI strategy.
Apache Hadoop
An open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model. It runs on clusters of commodity hardware (inexpensive computers).
Core Modules
- Hadoop Common: The libraries and utilities needed by other Hadoop modules.
- HDFS: A distributed file system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
- Hadoop YARN: A resource-management platform responsible for managing computing resources in clusters and using them for scheduling users' applications.
- Hadoop MapReduce: An implementation of the MapReduce programming model for large-scale data processing.
Key Features
- ● Reliability: Assumes hardware failure is the norm. It handles failures automatically by maintaining multiple copies of data.
- ● Scalability: Designed to scale linearly from a single server to thousands of machines, each expanding local computation and storage.
- ● Cost-Effective: Runs on commodity hardware rather than expensive specialized servers.
HDFS (Hadoop Distributed File System)
A highly fault-tolerant distributed file system designed to run on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
Architecture: Master/Slave architecture.
Blocks: Files are split into blocks (default 128MB). Each block is stored as a separate file on the local file system of a DataNode.
Replication Factor: To ensure fault tolerance, each block is replicated across multiple DataNodes (default 3x).
Replication Simulator (Factor: 3)
System Healthy. Click status light to simulate node failure.
NameNode (Master)
The centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept.
- Manages File System Namespace.
- Regulates client access to files.
- Executes operations like opening, closing, and renaming files/directories.
DataNode (Slave)
Stores the actual data in blocks. There are usually one per node in the cluster.
- Responsible for serving read and write requests from file system clients.
- Performs block creation, deletion, and replication upon instruction from the NameNode.
MapReduce
A core component of the Hadoop ecosystem that provides a paradigm for processing massive data across distributed clusters. It abstracts the complexity of parallel computing, fault tolerance, and data locality.
Key Concepts
- Data Locality: Moving computation to the data (cheaper) rather than moving data to computation (expensive network I/O).
- Scalability: Tasks are independent, allowing valid parallelization across thousands of nodes.
Pipeline Visualization
Status: Idle
1. Map Phase
Converts input data into key-value pairs.
- InputSplits: Logic splits of data.
- Mapping: Processing record by record.
2. Shuffle Phase
Transfers data from Mappers to Reducers.
- Sorting: Groups values by key.
- Partitioner: Decides which Reducer gets which key.
3. Reduce Phase
Aggregates the grouped data.
- Reducing: Summarizes the list of values for each key.
- Output: Writes final result to HDFS.
YARN (Yet Another Resource Negotiator)
Introduced in Hadoop 2.0 to solve the scalability issues of MapReduce v1. It separates the resource management layer from the processing layer, allowing other engines (like Spark) to run on the same cluster.
Resource Dashboard
Status: Ready
Client
Job Submission
Resource Manager
Global Scheduler
Node Manager
ResourceManager (Master)
The ultimate authority that arbitrates resources among all the applications in the system.
- Scheduler: Allocates resources.
- ApplicationsManager: Accepts job submissions.
NodeManager (Slave)
The per-machine framework agent that is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting to the ResourceManager.
ApplicationMaster (Per-App)
A framework-specific library that negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the tasks.
Apache Spark
A unified analytics engine for large-scale data processing. Known for being 100x faster than Hadoop MapReduce in memory by reducing the number of read/write operations to disk.
RDD (Resilient Distributed Dataset): The fundamental data structure of Spark. It is an immutable distributed collection of objects.
DAG (Directed Acyclic Graph): Spark creates a graph of tasks to execute. This allows for optimization (like pipelining map tasks).
Lazy Evaluation: Transformations (like map, filter) are not executed until an Action (like count, collect) is called.
DAG Execution Pipeline
The Spark Ecosystem
Spark SQL
Module for working with structured data. Allows querying via SQL or DataFrame API.
Spark Streaming
Scalable, high-throughput, fault-tolerant stream processing of live data streams.
MLlib (Machine Learning)
Scalable machine learning library (Classification, Regression, Clustering, etc.).
GraphX
API for graphs and graph-parallel computation.
Apache Storm
A free and open source distributed realtime computation system. Unlike Hadoop (which runs "Jobs" that end), Storm runs "Topologies" that process data forever until killed.
- Spouts: Source of streams in a topology (e.g., reads from Kafka).
- Bolts: Processes input streams (filter, aggregate, join) and produces new streams.
- Nimbus: Master node (like JobTracker).
- Supervisor: Worker nodes.
How tuples are routed between components:
- Shuffle: Equal random distribution.
- Fields: Tuples with same field value go to same task.
- All: Send to all tasks (replication).
- Global: Send to task with lowest ID.
Apache Hive
A data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in HDFS and other compatible systems.
How it works
Hive translates HiveQL (an SQL-like language) into MapReduce, Tez, or Spark jobs that run on the Hadoop cluster.
Key Feature
Metastore: A central repository that stores the structure (metadata) of the tables, allowing SQL-on-HDFS.
Apache Pig
A high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. It abstracts the Java complexity of MapReduce.
Why Pig?
- Ease of Programming: 10 lines of Pig Latin = 200 lines of Java MapReduce.
- Optimization: Automatically optimizes execution.
- Extensibility: Users can create custom functions (UDFs).
Key Feature
Schema-on-Read: Unlike RDBMS (Schema-on-Write), Pig loads data without a predefined schema and applies schema only when reading.
Text Mining & NLP Pipeline
Text mining is the process of deriving high-quality information from text. It involves preprocessing raw text into a structured format that can be analyzed by machine learning algorithms.
Preprocessing Simulation
Status: Ready
1. Tokenization
Breaking text into individual units like words or phrases (tokens).
2. Stopwords
Removing common words (the, a, is) that don't add significant meaning.
3. Stemming
Reducing words to their base root form (e.g., 'running' to 'run').
4. Vectorization
Converting text into numerical vectors that computers can process.
Big Data Glossary
Hadoop Distributed File System; provides high-throughput access to application data.
Software framework for easily writing applications which process vast amounts of data in-parallel.
Yet Another Resource Negotiator; the cluster management layer of Hadoop.
A fast and general-purpose cluster computing system for big data processing.
Resilient Distributed Dataset; fundamental data structure of Apache Spark.
The centerpiece of HDFS that keeps the directory tree and tracks where data is kept.
Responsible for storing the actual data blocks in HDFS.
The ultimate authority that arbitrates resources among all applications in YARN.
The per-machine agent responsible for containers and monitoring resource usage in YARN.
A framework-specific library that negotiates resources from ResourceManager for a single job.
A logical bundle of resources (CPU, RAM) on a single node where tasks run.
A source of streams in an Apache Storm topology.
Processes input streams and produces new streams in Apache Storm.
A network of spouts and bolts that defines a real-time computation in Storm.
A data flow language used in Apache Pig to analyze large datasets.
Extract, Transform, Load; the general process of data movement and refinement.
Directed Acyclic Graph; a representation of data processing steps with no loops.
Storing data in RAM for faster access compared to disk storage.
The ability of a system to continue functioning after a component failure.
Moving the computation to where the data is stored to reduce network traffic.
Knowledge Check
Verify your understanding of Big Data Analytics. Pass the exam with a perfect score to earn your completion certificate.