Apache Spark
Posted by
Manish Panchmatia
on Wednesday, March 6, 2019
Labels:
ArtificialIntelligence,
MachineLearning,
software
Spark is lightning-fast unified analytics engine for large-scale data processing.
Dataset API 2.x
Dataframe API
resilient distributed dataset (RDD) 1.x. RDD is any type of Python, Java, or Scala objects.
Spark Cluster Manager
- native Spark cluster
- Hadoop YARN
- Apache Mesos
- K8s
Spark Distributed storage
- Alluxio,
- Hadoop Distributed File System (HDFS),
- MapR File System (MapR-FS),
- Cassandra,
- OpenStack Swift,
- Amazon S3,
- Kudu,
- a custom solution
Languages
- Java,
- Scala,
- Python,
- R, and
- SQL.
Spark Components
1. Spark Core
- RDD centric functional programming
2. Spark SQL
- DSL Domain Specific Language to manipulate dataset
- supports CLI and JDBC/ODBC server
3. Spark Streaming
- Consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets
4. MLlib
5. GraphX
Spark can perform
- batch processing (similar to MapReduce)
- streaming,
- interactive queries, and
- machine learning
Reference
https://spark.apache.org/third-party-projects.html
https://spark.apache.org/docs/latest/quick-start.html
https://github.com/apache/spark/tree/master/examples/src/main/python
Dataset API 2.x
Dataframe API
resilient distributed dataset (RDD) 1.x. RDD is any type of Python, Java, or Scala objects.
Spark Cluster Manager
- native Spark cluster
- Hadoop YARN
- Apache Mesos
- K8s
Spark Distributed storage
- Alluxio,
- Hadoop Distributed File System (HDFS),
- MapR File System (MapR-FS),
- Cassandra,
- OpenStack Swift,
- Amazon S3,
- Kudu,
- a custom solution
Languages
- Java,
- Scala,
- Python,
- R, and
- SQL.
Spark Components
1. Spark Core
- RDD centric functional programming
2. Spark SQL
- DSL Domain Specific Language to manipulate dataset
- supports CLI and JDBC/ODBC server
3. Spark Streaming
- Consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets
4. MLlib
5. GraphX
Spark can perform
- batch processing (similar to MapReduce)
- streaming,
- interactive queries, and
- machine learning
Reference
https://spark.apache.org/third-party-projects.html
https://spark.apache.org/docs/latest/quick-start.html
https://github.com/apache/spark/tree/master/examples/src/main/python
0 comments:
Post a Comment