Comparison between Spark vs Hadoop and On-prem vs Cloud

In the last session (Session 2) we discussed some challenges we were facing with Hadoop. Let's see how we can solve those challenges.

Spark comes into the picture for rescue.
What is spark?  
It is an in-memory distributed processing engine. It can be compared with Mapreduce not Hadoop.

Advantages
- Very flexible — its kind of portable framework means we can use it with any storage and resource manager
Example: 
For any big data application we need three components
1. Scalable storage
2. Scalable computation
3. Resource Manager

If we consider Hadoop then we have very limited options.
Scalable storage - HDFS
Scalable computation - MapReduce
Resource Manager - YARN

 

But when it comes to Spark then we have multiple options.
Scalable storage - HDFS, S3, ADLS Gen2, GCS
Scalable computation - Spark, Databricks (Spark on cloud), EMR, Synapse, Dataproc
Resource Manager - YARN, Mesos, Kubernetes

- Relatively easy to learn and write transformations using different available APIs RDDs/Dataframe/SparkSQL Table
- 10 to 100X fast compared to Map-reduce — In memory processing using RDDs
- can do batch as well as steaming processing Companies are moving towards cloud from on-premise to build their applications and softwares.

So, first let's see what it actually means?

On-premise vs Cloud
On-premise — owning IT infra hardware and software for building applications 
Need to buy the machines and setup the networking by yourself
- Need a space/infra to keep these machines and maintenance of the same (Cooling, upgrading the patches, security and need admin team)
- Cannot scale instantly if unexpected demand comes - not scalable
- Huge cap-ex and op-ex (hardware failures/licensed softwares) - not cost effective
High latency if application is used in different geo location/country
- No recovery in case of disaster

Cloud — hardware and softwares are provided by cloud providers and can be accessed through web interface
- No need to buy the machine instead cloud provider will provide the same. Cloud provider will take care of networking(provide public IP and vnet).
- On a couple of clicks your machines will be ready — Agile
- No need of space as it will be taken care of by the cloud provider itself.
- Can scale to any level even in unexpected demand scenarios - scalable
- No cap-ex and low op-ex (maintenance cost will be charged by the provider) — pay as you go
- Geo distribution - low latency
- Disaster recovery
- Cloud is service provider and users are tenants

Cloud Types
Public - In a public cloud, you share the same hardware, storage, and network devices with other organisations or cloud tenants. If the data is not highly confidential in that case we use public cloud. (Banks mostly uses on-premise/private cloud)

Private - Its again a cloud like setup only that do on-demand deployment but cloud computing services and infra (servers) are hosted privately within a company’s own environment using proprietary resources. VMware, Dell, Oracle etc.

Hybrid - Its a combination of both meaning if the company has generic and confidential data as well then they will be using public cloud for generic data and public cloud for highly confidential data.

Types of services available in cloud:

IaaS (Infrastructure as a Service): This is like renting a computer over the internet. You get full control over the virtual machines and can install whatever software you need. Example service: Azure Virtual Machines.

PaaS (Platform as a Service): PaaS gives you a platform to develop, run, and manage applications without dealing with the underlying infrastructure. Example: Azure App Service.

SaaS (Software as a Service): With SaaS, you use software that's hosted in the cloud, accessible via the internet. You don't need to install or maintain it; just use it through your web browser. Example: Microsoft Office 365.

Serverless Models: Serverless means you don't need to manage the servers yourself. You just write and upload your code, and the cloud provider takes care of running and scaling it automatically. Example: Azure Functions.