Hadoop

a software framework for distributed storage and processing of big data using the MapReduce programming model.

Hadoop consists of
1. Hadoop Common : JAR and scripts to start Hadoop
2. HDFS
3. Hadoop YARN (Yet Another Resource Negotiator) : 2 daemons : job tracking (resource manager) and progress monitoring (application master)
4. Hadoop MapReduce

and other tools in Hadoop Ecosystem
* Apache Pig,
* Apache Hive :
- Data Warehouse for data query and analysis.
- input : HiveQL
- output : queries to MapReduce, Apache Tez, Spark Jobs
1. Metastore : Apache Derby RDBMS
2. Driver : controller
3. Compiler : HiveQL Query -> Abstract Syntax Tree AST -> Directed Acyclic Graph DAG
4. Optimizer : optimized DAG
5. Executor : interact with Hadoop Job Tracker
6. CLI / UI / Thrift Server over network like ODBC/JDBC
- (ACID): Atomicity, Consistency, Isolation, and Durability.
* Apache HBase :
- Database
- Features : compression, in-memory operation, and Bloom filters on a per-column basis
- input and output for MapReduce jobs
- Accessed through Java API, REST, Avro,
* Apache Phoenix : SQL Layer for HBase.
* Apache Spark : Analytics Engine
* Apache ZooKeeper,
* Cloudera Impala,
* Apache Flume,
* Apache Sqoop,
* Apache Oozie,
* Apache Storm.
* Apache Mahout : ML for
- collaborative filtering,
- clustering and
- classification

Architecture

* Master Node:
Job Tracker,
Task Tracker,
NameNode, (Primary and Secondary)
DataNode.

* A slave or worker node :
- DataNode and Task Tracker. Task Tracker in separate JVM.
- DataNode only
- Compute only

Job Tracker and Task Tracker expose status and information over Jetty web server.

HDFS

default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.

HDFS was designed for mostly immutable files. Not suitable for concurrent write operations.

File access by
- native Java API,
- the Thrift API (Binary RPC protocol)
over
- CLI
- HDFS-UI application over HTTP
- 3rd party network client libraries

Monitoring
- Hortonworks
- Cloudera
- Datadog

File Systems
- HDFS
- FTP file system
- Amazon S3 (Simple Storage Service) object storage
- Windows Azure Storage Blobs (WASB) file system
- IBM General Parallel File System
- Parascale file system
- CloudIQ Storage product by Appistry
- location-aware IBRIX Fusion file system driver by HP
- MapR FS by MapR Technologies Inc

Hadoop Major version
Hadoop 1
Hadoop 2
- YARN
Hadoop 3
- multiple name nodes
- container
- decreases storage overhead with erasure coding.
- GPU hardware for deep learning.

Hadoop on AWS
Amazon Elastic MapReduce EMR
Amazon Elastic Compute Cloud EC2
Amazon Simple Storage Service S3

Life is an exercise to express the InExpressible.

Express YourSelf !

Hadoop

0 comments:

Post a Comment

Total Pageviews

Subscribe by E-mail

Labels

Popular Posts

Coming Soon...

Followers

My Social Network

Blog Archive