AI-powered search & chat for Data / Computer Science Students

Apache Spark for the Impatient

 Analytics Vidhya

Below is a list of the most important topics in Spark that everyone who does not have the time to go through an entire book but wants to discover the amazing power of this distributed computing…

Read more at Analytics Vidhya

Beginner’s Guide to Apache Spark

 Level Up Coding

The company founded by the creators of Spark — Databricks — summarizes its functionality best in their Gentle Intro to Apache Spark eBook (highly recommended read — link to PDF download provided at…

Read more at Level Up Coding

Getting started with Apache Spark — Part 1

 Analytics Vidhya

In this era of big data where mind-boggling amount of data are being created every minute, it is becoming increasingly important for businesses to analyze these data for quick insights. This has…

Read more at Analytics Vidhya

Getting Started with Apache Spark

 Towards Data Science

Medium Article on the Architecture of Apache Spark. Implementation of some CORE APIs in java with code. Memory and performance tuning for better running jobs.

Read more at Towards Data Science

A Beginner’s Guide to Apache Spark

 Towards Data Science

The company founded by the creators of Spark — Databricks — summarizes its functionality best in their Gentle Intro to Apache Spark eBook (highly recommended read - link to PDF download provided at…

Read more at Towards Data Science

1. Introduction To Apache Spark

 Towards Data Science

Apache Spark is a popular framework in the field of Big Data. Coming from a background of coding in Python and SQL, it didn’t take me long to get my hands on using Spark. However, without…

Read more at Towards Data Science

High Level Overview of Apache Spark

 Better Programming

Spark is the cluster computing framework for large-scale data processing. Spark offers a set of libraries in 3 languages (Java, Scala, Python) for its unified computing engine.

Read more at Better Programming

The What, Why, and When of Apache Spark

 Towards Data Science

Spark has been called a “general purpose distributed data processing engine”1 and “a lightning fast unified analytics engine for big data and machine learning”². It lets you process big data sets…

Read more at Towards Data Science

Apache Spark: A Conceptual Orientation

 Towards Data Science

Apache Spark, once part of the Hadoop ecosystem, is a powerful open-source, general-purpose distributed data-processing engine that provides real-time stream processing, interactive processing, graph…...

Read more at Towards Data Science

A n00bs guide to Apache Spark

 Towards Data Science

I wrote this guide to help my self understand the basic underlying functions of Spark, where it fits in the Hadoop ecosystem and how it works in Java and Scala. I hope it helps you as much it helped…

Read more at Towards Data Science

Apache Spark with Python

 Python in Plain English

What is Apache Spark? Apache Spark is an open-source processing system that is distributed and commonly utilized for dealing with large-scale data workloads. The system is designed to ensure fast anal...

Read more at Python in Plain English

Apache Spark for Data Science — How to Install and Get Started with PySpark

 Towards Data Science

Install PySpark locally and load your first dataset — Only 5 minutes required. Continue reading on Towards Data Science

Read more at Towards Data Science

Apache Spark Primer

 Analytics Vidhya

Apache Spark is an open-source, fast, distributed cluster-computing framework for large-scale data processing. Spark is an execution engine that runs not only on Hadoop YARN but also on Apache Mesos…

Read more at Analytics Vidhya

Apache Spark 3.0: The 5 Most Exciting New Features

 Towards Data Science

A new major release was made available on the 10th of June 2020 for Apache Spark. Version 3.0 — a result of more than 3,400 tickets — builds on top of version 2.x and comes with numerous features —…

Read more at Towards Data Science

Apache Spark Optimization Techniques

 Towards Data Science

A review of some of the most common Spark performance problems and how to address them Continue reading on Towards Data Science

Read more at Towards Data Science

Analyzing Data and Performance Tuning of Apache Spark Engine..

 Analytics Vidhya

Apache Spark is a fast, in-memory processing framework designed to support and process big data. Any form of data which is immensely huge in size (i.e. GB’s, TB’s, PB’s) and unable to be processed…

Read more at Analytics Vidhya

Which Language to choose when working with Apache Spark

 Javarevisited

I have been working with Java for 7 years now and lately started working with Apache Spark for some real world the big data and data science projects.And when starting with Apache Spark, and based on ...

Read more at Javarevisited

Apache Spark — Fast and Furious.

 Analytics Vidhya

Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its…

Read more at Analytics Vidhya

Finding Needle in Haystack with Apache Spark

 Towards Data Science

TL; DR: Customer churn is a real deal for businesses, and predicting which user is likely to churn might be difficult in an ever growing (Big) data. Apache Spark allows data scientist to do data…

Read more at Towards Data Science

Big Data Engineering —  Apache Spark

 Towards Data Science

This is part 2 of a series on data engineering in a big data environment. It will reflect my personal journey of lessons learnt and culminate in the open source tool Flowman I created to take the…

Read more at Towards Data Science

Spark 3.0  — New Functions in a Nutshell

 Javarevisited

Recently Apache Spark community releases the preview of Spark 3.0 which holds many significant new features that will help Spark to make a powerful mark, which already has a wide range of enterprise u...

Read more at Javarevisited

Introduction to Apache Spark with Scala

 Towards Data Science

This article is a follow-up note for the March edition of Scala-Lagos meet-up where we discussed Apache Spark, it’s capability and use-cases as well as a brief example in which the Scala API was used…...

Read more at Towards Data Science

Running a Spark Job in less than 10 minutes with No Infrastructure

 Towards Data Science

A quick hands-on tutorial on setting up Spark with Google Cloud Platform Continue reading on Towards Data Science

Read more at Towards Data Science

Beginners guide to Apache Spark for data analytics — Part 1

 Analytics Vidhya

Spark dataframe is a distributed collection of data organized into named columns, equivalent to tables in relational database. Dataframes can be constructed from wide array of sources such as: structu...

Read more at Analytics Vidhya