PySpark | Cookbook

Ralph/ March 31, 2020/ Apache Spark, Cookbook, PySpark

Websites The Blaze Ecosystem (Blaze) Dask: Flexible library for parallel computing in Python. DataShape: Data layout language for array programming.  Odo: Shapeshifting for your dataIt efficiently migrates data from the source to the target through a network of conversions. Reading Textfiles Read CSV file with known structure

Apache Spark | Getting started

Ralph/ June 28, 2019/ Apache, Apache Spark

Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing. This is an extract from this brief tutorial that explains the basics of Spark Core programming. Environment / Requirements Spark Java

Read More