Hadoop
Posted by
Manish Panchmatia
on Thursday, March 7, 2019
Labels:
ArtificialIntelligence,
Bangalore,
MachineLearning,
software
a software framework for distributed storage and processing of big data using the MapReduce programming model.
Hadoop consists of
1. Hadoop Common : JAR and scripts to start Hadoop
2. HDFS
3. Hadoop YARN (Yet Another Resource Negotiator) : 2 daemons : job tracking (resource manager) and progress monitoring (application master)
4. Hadoop MapReduce
and other tools in Hadoop Ecosystem
* Apache Pig,
* Apache Hive :
- Data Warehouse for data query and analysis.
- input : HiveQL
- output : queries to MapReduce, Apache Tez, Spark Jobs
1. Metastore : Apache Derby RDBMS
2. Driver : controller
3. Compiler : HiveQL Query -> Abstract Syntax Tree AST -> Directed Acyclic Graph DAG
4. Optimizer : optimized DAG
5. Executor : interact with Hadoop Job Tracker
6. CLI / UI / Thrift Server over network like ODBC/JDBC
- (ACID): Atomicity, Consistency, Isolation, and Durability.
* Apache HBase :
- Database
- Features : compression, in-memory operation, and Bloom filters on a per-column basis
- input and output for MapReduce jobs
- Accessed through Java API, REST, Avro,
* Apache Phoenix : SQL Layer for HBase.
* Apache Spark : Analytics Engine
* Apache ZooKeeper,
* Cloudera Impala,
* Apache Flume,
* Apache Sqoop,
* Apache Oozie,
* Apache Storm.
* Apache Mahout : ML for
- collaborative filtering,
- clustering and
- classification
Architecture
* Master Node:
Job Tracker,
Task Tracker,
NameNode, (Primary and Secondary)
DataNode.
* A slave or worker node :
- DataNode and Task Tracker. Task Tracker in separate JVM.
- DataNode only
- Compute only
Job Tracker and Task Tracker expose status and information over Jetty web server.
HDFS
default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.
HDFS was designed for mostly immutable files. Not suitable for concurrent write operations.
File access by
- native Java API,
- the Thrift API (Binary RPC protocol)
over
- CLI
- HDFS-UI application over HTTP
- 3rd party network client libraries
Monitoring
- Hortonworks
- Cloudera
- Datadog
File Systems
- HDFS
- FTP file system
- Amazon S3 (Simple Storage Service) object storage
- Windows Azure Storage Blobs (WASB) file system
- IBM General Parallel File System
- Parascale file system
- CloudIQ Storage product by Appistry
- location-aware IBRIX Fusion file system driver by HP
- MapR FS by MapR Technologies Inc
Hadoop Major version
Hadoop 1
Hadoop 2
- YARN
Hadoop 3
- multiple name nodes
- container
- decreases storage overhead with erasure coding.
- GPU hardware for deep learning.
Hadoop on AWS
Amazon Elastic MapReduce EMR
Amazon Elastic Compute Cloud EC2
Amazon Simple Storage Service S3
Hadoop consists of
1. Hadoop Common : JAR and scripts to start Hadoop
2. HDFS
3. Hadoop YARN (Yet Another Resource Negotiator) : 2 daemons : job tracking (resource manager) and progress monitoring (application master)
4. Hadoop MapReduce
and other tools in Hadoop Ecosystem
* Apache Pig,
* Apache Hive :
- Data Warehouse for data query and analysis.
- input : HiveQL
- output : queries to MapReduce, Apache Tez, Spark Jobs
1. Metastore : Apache Derby RDBMS
2. Driver : controller
3. Compiler : HiveQL Query -> Abstract Syntax Tree AST -> Directed Acyclic Graph DAG
4. Optimizer : optimized DAG
5. Executor : interact with Hadoop Job Tracker
6. CLI / UI / Thrift Server over network like ODBC/JDBC
- (ACID): Atomicity, Consistency, Isolation, and Durability.
* Apache HBase :
- Database
- Features : compression, in-memory operation, and Bloom filters on a per-column basis
- input and output for MapReduce jobs
- Accessed through Java API, REST, Avro,
* Apache Phoenix : SQL Layer for HBase.
* Apache Spark : Analytics Engine
* Apache ZooKeeper,
* Cloudera Impala,
* Apache Flume,
* Apache Sqoop,
* Apache Oozie,
* Apache Storm.
* Apache Mahout : ML for
- collaborative filtering,
- clustering and
- classification
Architecture
* Master Node:
Job Tracker,
Task Tracker,
NameNode, (Primary and Secondary)
DataNode.
* A slave or worker node :
- DataNode and Task Tracker. Task Tracker in separate JVM.
- DataNode only
- Compute only
Job Tracker and Task Tracker expose status and information over Jetty web server.
HDFS
default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.
HDFS was designed for mostly immutable files. Not suitable for concurrent write operations.
File access by
- native Java API,
- the Thrift API (Binary RPC protocol)
over
- CLI
- HDFS-UI application over HTTP
- 3rd party network client libraries
Monitoring
- Hortonworks
- Cloudera
- Datadog
File Systems
- HDFS
- FTP file system
- Amazon S3 (Simple Storage Service) object storage
- Windows Azure Storage Blobs (WASB) file system
- IBM General Parallel File System
- Parascale file system
- CloudIQ Storage product by Appistry
- location-aware IBRIX Fusion file system driver by HP
- MapR FS by MapR Technologies Inc
Hadoop Major version
Hadoop 1
Hadoop 2
- YARN
Hadoop 3
- multiple name nodes
- container
- decreases storage overhead with erasure coding.
- GPU hardware for deep learning.
Hadoop on AWS
Amazon Elastic MapReduce EMR
Amazon Elastic Compute Cloud EC2
Amazon Simple Storage Service S3
0 comments:
Post a Comment