PySpark | Cookbook

Websites

  • The Blaze Ecosystem (Blaze)
  • Dask: Flexible library for parallel computing in Python.
  • DataShape: Data layout language for array programming. 
  • Odo: Shapeshifting for your data
    It efficiently migrates data from the source to the target through a network of conversions.

Reading Textfiles

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Demo") \
    .config("spark.demo.config.option", "demo-value") \
    .getOrCreate()

df = spark.read.csv("input.csv",header=True,sep="|");
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

sc = SparkContext('local','example') 
sql_sc = SQLContext(sc)

p_df= pd.read_csv('file.csv')  # assuming the file contains a header
# p_df = pd.read_csv('file.csv', names = ['column 1','column 2']) # if no header
s_df = sql_sc.createDataFrame(pandas_df)
sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)>1) \
    .map(lambda line: (line[0],line[1])) \
    .collect()
sc.textFile("input.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)<=1) \
    .collect()
spark.read.csv(
    "input.csv", header=True, mode="DROPMALFORMED", schema=schema
)
(spark.read
    .schema(schema)
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .csv("input.csv"))

Read CSV file with known structure

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType

schema = StructType([
    StructField("A", IntegerType()),
    StructField("B", DoubleType()),
    StructField("C", StringType())
])

(sqlContext
    .read
    .format("com.databricks.spark.csv")
    .schema(schema)
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .load("input.csv"))

Build an Fullstack Development Environment

In this post, we will describe how to setup a modern development environment, so that you can work with such tools as Git, GoGS, Jenkins, CI/CD, Unittesting using languages like Python, Groovy, Bash Scripting.

Introduction

Working as a software developer today poses great challenges to know-how, especially due to the large number of existing products and technologies.

It is not easy to be always up to date.

The first step to achieve this is to have your own software environment in which you can work and play with the necessary products.

This post describes how such an environment can be built for a full stack developer. Components of the environment are:

Jenkins | Build and Deploy a Groovy App

Introduction

Using Jenkins as an automation server for your development, you can automate such repeating tasks as testing and deploying your app.

Starting with a sample Groovy App (a simple calculator) with tests, you will learn how to integrate your app in Jenkins and build a pipeline, so that Jenkins runs the desired tasks every time, you change the code.

Prepare the sources

Clone the sample repository from Github.

You should clone the demo repository into you demo account, because you may change some file during this post., and you will not get write permissions for the demo repository.

Also, clone the repository to your local machine to see what our demo app looks like.

$ git clone https://github.com/jenkins-toolbox/SampleApp_GroovyCalculator
Cloning into 'SampleApp_GroovyCalculator'...
remote: Enumerating objects: 194, done.
remote: Counting objects: 100% (194/194), done.
remote: Compressing objects: 100% (138/138), done.
remote: Total 194 (delta 44), reused 137 (delta 23), pack-reused 0
Receiving objects: 100% (194/194), 93.40 KiB | 817.00 KiB/s, done.
Resolving deltas: 100% (44/44), done.

Go into the new create folder

$ cd SampleApp_GroovyCalculator/
$ ls
Jenkinsfile      README.md        bin              build.gradle     gradlew          src
Makefile         SampleCalculator build            gradle           settings.gradle

The first task, Jenkins will do in our pipeline: build your app

$ ./gradlew build

Because it’s the first time you start gradlew, the required software will be downloaded:

First: the current Gradle Version (Gradle is the Build Tool used by Groovy Projects)

Downloading https://services.gradle.org/distributions/gradle-6.2.1-bin.zip
………10%………20%………30%……….40%………50%………60%……….70%………80%………90%……….100%

Welcome to Gradle 6.2.1!

Here are the highlights of this release:
 - Dependency checksum and signature verification
 - Shareable read-only dependency cache
 - Documentation links in deprecation messages

For more details see https://docs.gradle.org/6.2.1/release-notes.html

Starting a Gradle Daemon, 2 stopped Daemons could not be reused, use --status for details

After this, your app will be tested

> Task :test

Calculator02Spec > two plus two should equal four PASSED

Calculator01Spec > add: 2 + 3 PASSED

Calculator01Spec > subtract: 4 - 3 PASSED

Calculator01Spec > multiply: 2 * 3 PASSED

BUILD SUCCESSFUL in 34s
5 actionable tasks: 5 executed

Perform the build again

No download is required. The build is much quicker.

$ ./gradlew build

BUILD SUCCESSFUL in 1s
5 actionable tasks: 5 up-to-date

Now, test our app:

./gradlew clean test

> Task :test

Calculator02Spec > two plus two should equal four PASSED

Calculator01Spec > add: 2 + 3 PASSED

Calculator01Spec > subtract: 4 - 3 PASSED

Calculator01Spec > multiply: 2 * 3 PASSED

BUILD SUCCESSFUL in 4s
5 actionable tasks: 5 executed

Create a Jenkins Pipeline

Start by clicking on the BlueOcean menu item.

Hint: Blue Ocean is not installed with the default Jenkins installation.

You have to install the corresponding Plugins.

Select Manage JenkinsManage Plugins.

Then, select the tab Available and enter in the Filter box: Blue Ocean.

Install all plugins, that will be listed.

Next: Click on the New Pipeline to create your first Pipeline

Use the Item GitHub to specify, where our code is stored

Next, use your GitHub account.

Be sure, that you cloned the demo repository

Next, we select the demo repository SampleApp_GroovyCalculator

Click on Create Pipeline and after a few seconds, the pipeline is created.

Immediately after creating the pipeline, Jenkins is starting the pipeline and all steps included.

If everything went well, you see a positive status

Now, click on the pipeline (e.g. the text master or the status icon) and you will see the pipeline with all steps and their corresponding state.

If you, want to edit the pipeline, for example to add another step, like on the pencil in the header.

Click on Cancel to leave the Pipeline editor.

Hint: If you click on Save, all changes are pushed back to the repository and Jenkins starts the Pipeline again.

Run the Pipeline

If you want to run your pipeline, click on the rerun icon for your pipeline

Jenkins | Cookbook

Working with VS Code

Validate Jenkins File

Install VS Code Plugin Jenkins Pipeline Linter Connector

Add configuration in .vscode/settings.json

"jenkins.pipeline.linter.connector.crumbUrl": "<JENKINS_URL>/crumbIssuer/api/xml?xpath=concat(//crumbRequestField,%22:%22,//crumb)",
"jenkins.pipeline.linter.connector.user": "<USERNAME>",
"jenkins.pipeline.linter.connector.pass": "<PASSWORD>",
"jenkins.pipeline.linter.connector.url": "<JENKINS_URL>/pipeline-model-converter/validate",

Replace <USERNAME>, <PASSWORD> and <JENKINS_URL> with your values, for example

"jenkins.pipeline.linter.connector.crumbUrl": "http://localhost:8080/crumbIssuer/api/xml?xpath=concat(//crumbRequestField,%22:%22,//crumb)",
"jenkins.pipeline.linter.connector.user": "admin",
"jenkins.pipeline.linter.connector.pass": "secret",
"jenkins.pipeline.linter.connector.url": "http://localhost:8080/pipeline-model-converter/validate",

Working with Jenkins Client (CLI)

Download Client

wget localhost:8080/jnlpJars/jenkins-cli.jar

Working with Plugins

Create aPlugin

mkdir SamplePlugin
cd SamplePlugin
mvn -U archetype:generate -Dfilter="io.jenkins.archetypes:"
mvn -U archetype:generate -Dfilter="io.jenkins.archetypes:global-configuration-plugin"
[INFO] Scanning for projects...
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-metadata.xml
Downloading from central: https://repo.maven.apache.org/maven2/org/codehaus/mojo/maven-metadata.xml
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-metadata.xml (14 kB at 32 kB/s)
Downloaded from central: https://repo.maven.apache.org/maven2/org/codehaus/mojo/maven-metadata.xml (20 kB at 44 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-archetype-plugin/maven-metadata.xml
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-archetype-plugin/maven-metadata.xml (918 B at 18 kB/s)
[INFO]
[INFO] ------------------< org.apache.maven:standalone-pom >-------------------
[INFO] Building Maven Stub Project (No POM) 1
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] >>> maven-archetype-plugin:3.1.2:generate (default-cli) > generate-sources @ standalone-pom >>>
[INFO]
[INFO] <<< maven-archetype-plugin:3.1.2:generate (default-cli) < generate-sources @ standalone-pom <<<
[INFO]
[INFO]
[INFO] --- maven-archetype-plugin:3.1.2:generate (default-cli) @ standalone-pom ---
[INFO] Generating project in Interactive mode
[INFO] No archetype defined. Using maven-archetype-quickstart (org.apache.maven.archetypes:maven-archetype-quickstart:1.0)
Choose archetype:
1: remote -> io.jenkins.archetypes:global-configuration-plugin (Skeleton of a Jenkins plugin with a POM and an example piece of global configuration.)
Choose a number or apply filter (format: [groupId:]artifactId, case sensitive contains): : 1
Choose io.jenkins.archetypes:global-configuration-plugin version:
1: 1.2
2: 1.3
3: 1.4
4: 1.5
5: 1.6
Choose a number: 5:
[INFO] Using property: groupId = unused
Define value for property 'artifactId': com.examples.jenkins.plugins
Define value for property 'version' 1.0-SNAPSHOT: :
[INFO] Using property: package = io.jenkins.plugins.sample
Confirm properties configuration:
groupId: unused
artifactId: com.examples.jenkins.plugins
version: 1.0-SNAPSHOT
package: io.jenkins.plugins.sample
 Y: : y
[INFO] ----------------------------------------------------------------------------
[INFO] Using following parameters for creating project from Archetype: global-configuration-plugin:1.6
[INFO] ----------------------------------------------------------------------------
[INFO] Parameter: groupId, Value: unused
[INFO] Parameter: artifactId, Value: com.examples.jenkins.plugins
[INFO] Parameter: version, Value: 1.0-SNAPSHOT
[INFO] Parameter: package, Value: io.jenkins.plugins.sample
[INFO] Parameter: packageInPathFormat, Value: io/jenkins/plugins/sample
[INFO] Parameter: version, Value: 1.0-SNAPSHOT
[INFO] Parameter: package, Value: io.jenkins.plugins.sample
[INFO] Parameter: groupId, Value: unused
[INFO] Parameter: artifactId, Value: com.examples.jenkins.plugins
[INFO] Project created from Archetype in dir: /Users/Shared/CLOUD/Kunde.BSH/workspace/SamplePlugin_Config/com.examples.jenkins.plugins
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  45.525 s
[INFO] Finished at: 2020-03-01T17:28:27+01:00
[INFO] ------------------------------------------------------------------------

Verify Plugin

 cd com.examples.jenkins.plugins
mvn verify

Run Plugin

mvn hpi:run

Working with Groovy Scripts

Include a common groovy script in Jenkins file

1: Create a common.groovy file with function as needed

def mycommoncode() {
}

2: In the main Jenkinfile load the file and use the function as shown below

node{ 
   def common = load “common.groovy”
   common.mycommoncode()
}

Basic example of Loading Groovy scripts

File example.groovy

def example1() {
 println 'Hello from example1' 
}
def example2() {
 println 'Hello from example2'
}

The example.groovy script defines example1 and example2 functions before ending with return this. Note that return this is definitely required and one common mistake is to forget ending the Groovy script with it.Jenkinsfile

def code node('java-agent') { 
    stage('Checkout') { checkout scm } 
    stage('Load') { code = load 'example.groovy' } 
    stage('Execute') { code.example1() }
} 

code.example2()

Processing Github JSON from Groovy

In this demo, we first show how to process JSON response from Github API in Groovy.Processing JSON from Github

String username = System.getenv('GITHUB_USERNAME') 
String password = System.getenv('GITHUB_PASSWORD') 
String GITHUB_API = 'https://api.github.com/repos' String repo = 'groovy' 
String PR_ID = '2' // Pull request ID 
String url = "${GITHUB_API}/${username}/${repo}/pulls/${PR_ID}" 

println "Querying ${url}" 

def text = url.toURL().getText(requestProperties: ['Authorization': "token ${password}"]) 
def json = new JsonSlurper().parseText(text) 

def bodyText = json.body // Check if Pull Request body has certain text if ( bodyText.find('Safari') ) {
     println 'Found Safari user' }

The equivalent bash command for retrieving JSON response from Github API is as follows:Equivalent bash command

// Groovy formatted string 
String cmd = "curl -s -H \"Authorization: token ${password}\" ${url}" 
// Example String 
example = 'curl -s -H "Authorization: token XXX" https://api.github.com/repos/tdongsi/groovy/pulls/2'

Processing Github JSON from Jenkinsfile

Continuing the demo from the last section, we now put the Groovy code into a callable function in a script called “github.groovy”. Then, in our Jenkinsfile, we proceed to load the script and use the function to process JSON response from Github API.github.groovy

import groovy.json.JsonSlurper
def getPrBody(String githubUsername, String githubToken, String repo, String id) {
   String GITHUB_API = 'https://api.github.com/repos' 
   String url = "${GITHUB_API}/${githubUsername}/${repo}/pulls/${id}" 

   println "Querying ${url}"

def text = url.toURL().getText(requestProperties: ['Authorization': "token ${githubToken}"])
def json = new JsonSlurper().parseText(text) 
def bodyText = json.body return bodyText } 
return this

Jenkinsfile

def code node('java-agent') { 
    stage('Checkout') { checkout scm }
    stage('Load') { code = load 'github.groovy' } 
    stage('Execute') {

Best Practice