Hadoop


a software framework for distributed storage and processing of big data using the MapReduce programming model. 

Hadoop consists of
1. Hadoop Common : JAR and scripts to start Hadoop
2. HDFS
3. Hadoop YARN (Yet Another Resource Negotiator) : 2 daemons : job tracking (resource manager) and progress monitoring (application master)
4. Hadoop MapReduce

and other tools in Hadoop Ecosystem
* Apache Pig, 
* Apache Hive : 
- Data Warehouse for data query and analysis. 
- input : HiveQL
- output : queries to MapReduce, Apache Tez, Spark Jobs
1. Metastore : Apache Derby RDBMS
2. Driver : controller
3. Compiler : HiveQL Query -> Abstract Syntax Tree AST -> Directed Acyclic Graph DAG
4. Optimizer : optimized DAG
5. Executor : interact with Hadoop Job Tracker
6. CLI / UI / Thrift Server over network like ODBC/JDBC
- (ACID): Atomicity, Consistency, Isolation, and Durability.
* Apache HBase : 
- Database
- Features : compression, in-memory operation, and Bloom filters on a per-column basis
- input and output for MapReduce jobs
- Accessed through Java API, REST, Avro, 
* Apache Phoenix : SQL Layer for HBase. 
* Apache Spark : Analytics Engine
* Apache ZooKeeper, 
* Cloudera Impala, 
* Apache Flume, 
* Apache Sqoop, 
* Apache Oozie,
* Apache Storm.
* Apache Mahout : ML for
- collaborative filtering, 
- clustering and 
- classification

Architecture 

* Master Node: 
Job Tracker, 
Task Tracker, 
NameNode, (Primary and Secondary) 
DataNode. 

* A slave or worker node :
- DataNode and Task Tracker. Task Tracker in separate JVM.  
- DataNode only
- Compute only

Job Tracker and Task Tracker expose status and information over Jetty web server.  

HDFS

default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.

HDFS was designed for mostly immutable files. Not suitable for concurrent write operations.

File access by
- native Java API, 
- the Thrift API (Binary RPC protocol)
over
- CLI
- HDFS-UI application over HTTP
- 3rd party network client libraries 

Monitoring
- Hortonworks
- Cloudera
- Datadog

File Systems
- HDFS
- FTP file system
- Amazon S3 (Simple Storage Service) object storage
- Windows Azure Storage Blobs (WASB) file system
- IBM General Parallel File System
- Parascale file system
- CloudIQ Storage product by Appistry
- location-aware IBRIX Fusion file system driver by HP
- MapR FS by MapR Technologies Inc

Hadoop Major version
Hadoop 1
Hadoop 2
- YARN
Hadoop 3
- multiple name nodes
- container
- decreases storage overhead with erasure coding.
- GPU hardware for deep learning. 

Hadoop on AWS
Amazon Elastic MapReduce EMR
Amazon Elastic Compute Cloud EC2
Amazon Simple Storage Service S3

0 comments:

Post a Comment