Apache Spark


Spark is lightning-fast unified analytics engine for large-scale data processing.

Dataset API 2.x
Dataframe API
resilient distributed dataset (RDD) 1.x. RDD is any type of Python, Java, or Scala objects.

Spark Cluster Manager
- native Spark cluster
- Hadoop YARN
- Apache Mesos
- K8s

Spark Distributed storage
- Alluxio, 
- Hadoop Distributed File System (HDFS),
- MapR File System (MapR-FS),
- Cassandra,
- OpenStack Swift, 
- Amazon S3, 
- Kudu, 
- a custom solution

Languages
- Java, 
- Scala, 
- Python, 
- R, and 
- SQL.

Spark Components
1. Spark Core
- RDD centric functional programming
2. Spark SQL
- DSL Domain Specific Language to manipulate dataset
- supports CLI and JDBC/ODBC server
3. Spark Streaming
- Consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets
4. MLlib
5. GraphX

Spark can perform

- batch processing (similar to MapReduce) 
- streaming, 
- interactive queries, and 
- machine learning

Reference

https://spark.apache.org/third-party-projects.html
https://spark.apache.org/docs/latest/quick-start.html
https://github.com/apache/spark/tree/master/examples/src/main/python

0 comments:

Post a Comment