Apache Spark

Spark is lightning-fast unified analytics engine for large-scale data processing.

Dataset API 2.x
Dataframe API
resilient distributed dataset (RDD) 1.x. RDD is any type of Python, Java, or Scala objects.

Spark Cluster Manager
- native Spark cluster
- Hadoop YARN
- Apache Mesos
- K8s

Spark Distributed storage
- Alluxio,
- Hadoop Distributed File System (HDFS),
- MapR File System (MapR-FS),
- Cassandra,
- OpenStack Swift,
- Amazon S3,
- Kudu,
- a custom solution

Languages
- Java,
- Scala,
- Python,
- R, and
- SQL.

Spark Components
1. Spark Core
- RDD centric functional programming
2. Spark SQL
- DSL Domain Specific Language to manipulate dataset
- supports CLI and JDBC/ODBC server
3. Spark Streaming
- Consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets
4. MLlib
5. GraphX

Spark can perform

- batch processing (similar to MapReduce)
- streaming,
- interactive queries, and
- machine learning

Reference

https://spark.apache.org/third-party-projects.html
https://spark.apache.org/docs/latest/quick-start.html
https://github.com/apache/spark/tree/master/examples/src/main/python

Life is an exercise to express the InExpressible.

Express YourSelf !

Apache Spark

0 comments:

Post a Comment

Total Pageviews

Subscribe by E-mail

Labels

Popular Posts

Coming Soon...

Followers

My Social Network

Blog Archive