Apache Spark | Getting started

Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing.

This is an extract from this brief tutorial that explains the basics of Spark Core programming.

Environment / Requirements

Installation on Mac OS X

Check or install java

$ java -version
java version "12.0.1" 2019-04-16
Java(TM) SE Runtime Environment (build 12.0.1+12)
Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)

Check or install Scala

$ brew install scala
$ scala -version
Scala code runner version 2.13.0 -- Copyright 2002-2019, LAMP/EPFL and Lightbend, Inc.

Check or install Apache Spark





Setup environment in .bashrc

export PATH="$PATH:$SPARK_HOME/bin"

Installation on Ubuntu

Prepate Upuntu

apt update
apt upgrade
 apt-get install openjdk-8-jdk
 java -version

Links and Resources

Build a Jekyll Template based on Bootstrap 4

TL;DR

Combine two amazing open source tools: Jekyll and Bootstrap. The final template is here.

Bootstrap Template and Jekyll: two powerful tools

Start Point

While i want to learn about and work with bootstrap, i decided to build a Jekyll Template, so that i can build a dynamic website.

Asking Google for first inspiration leads me to this wonderful Blog entry:

Choose a Bootstrap Template

Quite nice. So I decided to use one of the free templates from Start Bootstrap: Modern Business

When i downloaded the template from Github and examine the content, i find out, that for each component (Pricing, Service, Contact), there is a corresponding HTML-file with all the content and all the formatting code:

  • about.html
  • blog-home-1.html
  • blog-home-2.html
  • blog-post.html
  • contact.html
  • faq.html
  • full-width.html
  • index.html
  • portfolio-1-col.html
  • portfolio-2-col.html
  • portfolio-3-col.html
  • portfolio-4-col.html
  • portfolio-item.html
  • pricing.html
  • services.html
  • sidebar.html

The Plan

My plan was to separate the presentation layer (what you will see) from the business layer (what creates the content for the presentation layer).

To achieve this with Jekyll, i convert the Bootstrap pages to Jekyll include pages. The final result should look like this:

The frontpage for the component

The jekyll include file with the component

---
layout: page
title: Services
---
<div class="container">
    <h1 class="mt-4 mb-3">{{ page.title }}</h1>
    {% include component/services.html %}
</div>
{% assign images = site.url |  append:  '/' | append: site.baseurl | append: '/assets/img/services' %}

<h2>Services: {{ site.services.title }}</h2>

<!-- Image Header -->
<img class="img-fluid rounded mb-4" src="{{ images }}/header.jpg" alt="">

<div class="row">
    {% for item in site.services %}
        <div class="col-lg-4 mb-4">
            <div class="card h-100">
                <h4 class="card-header">{{ item.title }}</h4>
                <div class="card-body">
                    <p class="card-text">{{ item.text | markdownify }}</p>
                </div>
                <div class="card-footer">
                    <a href="#" class="btn btn-primary">Learn More</a>
                </div>
            </div>
        </div>
    {% endfor %}
</div>

Next step was to convert every Bootstrap Template Page to a Jekyll Include File

Slider image

About Page

Slider image

FAQ Page

Slider image

Portfolio Page with 1 Column

Slider image

Portfolio Page with 2 Column

Slider image

Services Page

Slider image
Slider image

Pricing Page

The main challenge in separating the presentation from the business layer was: where to place the data to be displayed?

Depending on the type of the component, i choose three different solutions:

  1. Place the data in the corresponding include file of the component
  2. Place the date in the page, which calls the corresponding include file of the component
  3. Place the data in a Jekyll collection file

Data in corresponding include file of the component

I used this approach for components, which are used only once on the website and have a mostly static content, e.g. the FAQ Page

The component page

The frontend page

Date in the page, which calls the corresponding include file of the component

I used this approach for components, which are used more than once on the website, e.g. a Blog Post

The component page

The frontend page

Data in a Jekyll collection file

I used this approach for components, which are used only once on the website, but needs more configuration information, e.g. the Services- or Portfolio Page.

This step needs an additional configuration task: create the Jekyll Collections.

Jekyll collections are a great way to group related content like members of a team or talks at a conference.

To use a Collection you first need to define it in your _config.yml.

#
collections_dir: collections # folder, where collections files are stored
collections:
  services:
    title: "Services"
    output: true # store output files for each item under the collections folder

Then, you have to create the collection files, for each item in your collection one file:

These files look like this:

---
img: 1.jpg
title: Development
subtitle: 
footer: 
text: Lorem ipsum dolor sit amet, consectetur adipisicing elit. Possimus aut mollitia eum ipsum fugiat odio officiis odit.
---

And the data of this files can be accessed in the Jekyll include file with this code fragment:

  • all items of the collection: site.services, {% for item in site.services %}
  • data of each item: {{ item.title }}, {{ item.text | markdownify }}
{% for item in site.services %}
    <div class="col-lg-4 mb-4">
        <div class="card h-100">
            <h4 class="card-header">{{ item.title }}</h4>
            <div class="card-body">
                <p class="card-text">{{ item.text | markdownify }}</p>
            </div>
            <div class="card-footer">
                <a href="#" class="btn btn-primary">Learn More</a>
            </div>
        </div>
    </div>
{% endfor %}

The final result

Bootstrap Template and Jekyll: two powerful tools

Jekyll | Cookbook

Working with Arrays

Define the array

---
layout: post
title:  "Universe"
date:   2019-06-17 10:00:00
planets:
    - mercury 
    - venus
    - earth

Access the array

{% for planet in page.planets %}
    <a href="https://{{planet}}.universe}">{{planet}}</a>
{% endfor %}

Liquid

Links

https://github.com/Shopify/liquid/wiki/Liquid-for-Designers#optional-arguments

Code Snippets

for-loop-sorted-collection

<ul>
    {% assign sorted = (site.collection_name | sort: 'date') | reverse %}
    {% for item in sorted %}
    <li>{{ item.title }}</li>
    {% endfor %}
</ul>
{% for item in site.collection_name reversed %}

Code Snippets and recieps

Jekyll & Liquid Cheatsheet

A list of the most common functionalities in Jekyll (Liquid). You can use Jekyll with GitHub Pages, just make sure you are using the proper version.

Running

Running a local server for testing purposes:

jekyll serve
jekyll serve --watch --baseurl ''

Creating a final outcome (or for testing on a server):

jekyll build
jekyll build -w

The -w or --watch flag is for enabling auto-regeneration, the --baseurl '' one is useful for server testing.

Troubleshooting

On Windows you can get this error when building/serving:

Liquid Exception: incompatible character encodings: UTF-8 and IBM437 in index.html

You need to set the code-page first:

chcp 65001

Liquid

Output

Simple example of Output:

Hello {{name}}
Hello {{user.name}}
Hello {{ 'leszek' }}

Filtering output:

Word hello has {{ 'hello' | size }} letters!
Todat is {{ 'now' | date: "%Y %h" }}

Useful where filter example of getting single item from _data:

{% assign currentItem = site.data.foo | where:"slug","bar" %}
{{ newArray[0].name }}

Most common filters:

  • where — select elements from array with given property value: {{ site.posts | where:"category","foo" }}
  • group_by — group elements from array by given property: {{ site.posts | group_by:"category" }}
  • markdownify — convert markdown to HTML
  • jsonify — convert data to JSON: {{ site.data.dinosaurs | jsonify }}
  • date — reformat a date (syntax reference)
  • capitalize — capitalize words in the input sentence
  • downcase — convert an input string to lowercase
  • upcase — convert an input string to uppercase
  • first — get the first element of the passed in array
  • last — get the last element of the passed in array
  • join — join elements of the array with certain character between them
  • sort — sort elements of the array: {{ site.posts | sort: 'author' }}
  • size — return the size of an array or string
  • strip_newlines — strip all newlines (\n) from string
  • replace — replace each occurrence: {{ 'foofoo' | replace:'foo','bar' }}
  • replace_first — replace the first occurrence: {{ 'barbar' | replace_first:'bar','foo' }}
  • remove — remove each occurrence: {{ 'foobarfoobar' | remove:'foo' }}
  • remove_first — remove the first occurrence: {{ 'barbar' | remove_first:'bar' }}
  • truncate — truncate a string down to x characters
  • truncatewords — truncate a string down to x words
  • prepend — prepend a string: {{ 'bar' | prepend:'foo' }}
  • append — append a string: {{ 'foo' | append:'bar' }}
  • minus, plus, times, divided_by, modulo — working with numbers: {{ 4 | plus:2 }}
  • split — split a string on a matching pattern: {{ "a~b" | split:~ }}

Tags

Tags are used for the logic in your template.

Comments

For swallowing content.

We made 1 million dollars {% comment %} in losses {% endcomment %} this year

Raw

Disables tag processing.

{% raw %}
    In Handlebars, {{ this }} will be HTML-escaped, but {{{ that }}} will not.
{% endraw %}

If / Else

Simple expression with if/unless, elsif [sic!] and else.

{% if user %}
    Hello {{ user.name }}
{% elsif user.name == "The Dude" %}
    Are you employed, sir?
{% else %}
    Who are you?
{% endif %}
{% unless user.name == "leszek" and user.race == "human" %}
    Hello non-human non-leszek
{% endunless %}
# array: [1,2,3]
{% if array contains 2 %}
    array includes 2
{% endif %}

Case

For more conditions.

{% case condition %}
    {% when 1 %}
        hit 1
    {% when 2 or 3 %}
        hit 2 or 3
    {% else %}
        don't hit
{% endcase %}

For loop

Simple loop over a collection:

{% for item in array %}
    {{ item }}
{% endfor %}

Simple loop with iteration:

{% for i in (1..10) %}
    {{ i }}
{% endfor %}

There are helper variables for special occasions:

  • forloop.length — length of the entire for loop
  • forloop.index — index of the current iteration
  • forloop.index0 — index of the current iteration (zero based)
  • forloop.rindex — how many items are still left?
  • forloop.rindex0 — how many items are still left? (zero based)
  • forloop.first — is this the first iteration?
  • forloop.last — is this the last iteration?

Limit and offset starting collection:

# array: [1,2,3,4,5,6]
{% for item in array limit:2 offset:2 %}
    {{ item }}
{% endfor %}

You can also reverse the loop:

{% for item in array reversed %}
...

Storing variables

Storing data in variables:

{% assign name = 'leszek' %}

Combining multiple strings into one variable:

{% capture full-name %}{{ name }} {{ surname }}{% endcapture %}

Permalinks

Permalinks are constructed with a template:

/:categories/:year/:month/:day/:title.html

These variables are available:

  • year — year from the filename
  • short_year — same as above but without the century
  • month — month from the filename
  • i_month — same as above but without leading zeros
  • day — day from the filename
  • i_day — same as above but without leading zeros
  • title — title from the filename
  • categories— specified categories for the post
view raw jekyll-and-liquid.md hosted with ❤ by GitHub
Django Cookbook

Working with Django

Tutorial: Django-Polls App

We start with this amazing tutorial in building a Django polls app. The final code of the tutorial is here.

After this, we will extend the App with additional feature like Bootstrap and Angular 8.

Starting with Django

Installation with Pip

$ pip install Django
$ python -m django --version
$ python -c "import django; print(django.__file__)"
<BASEDIR>/python/lib/python3.7/site-packages/django/__init__.py

Install from Source

$ git clone https://github.com/django/django.git

Tutorial

https://docs.djangoproject.com/en/2.2/intro/tutorial02/

Configure development environment

Visual Studio Code

pip install pylint-django

Helpfull links and documentation

Links

https://www.fullstackpython.com/django.html

Tipps and Tricks

Customize 404 Page

If Django could not find a page for a given url, the following page ist used as error page:

<DJANGOINSTALLDIR>/views/templates/technical_404.html

Find the line with

{% for pat in pattern %}
{{ pat.pattern }}
{% if forloop.last and pat.name %}[name='{{ pat.name }}']{% endif %}
{% endfor %}

Replace the content with

{% for pat in pattern %}
<a href="{{ request.build_absolute_uri }}/{{ pat.pattern }}">{{ pat.pattern }}</a>
{% if forloop.last and pat.name %}[name='{{ pat.name }}']{% endif %}        
{% endfor %}

Resources

https://github.com/gothinkster/thinkster-django-angular-tutorial

Hadoop | Getting started

Modules

HDFSHadoop’s File Share which can be local or shared depending on your setup
MapReduceHadoop’s Aggregation/Synchronization tool enabling highly parallel processing…this is the true “engine” or time saver in Hadoop
HiveHadoop’s SQL query window, equivalent to Microsoft Query Analyzer
PigDataflow scripting tool similar to a Batch job or simplistic ETL processer
FlumeCollector/Facilitator of Log file information
AmbariWeb-based Admin tool utilized for managing, provisioning, and monitoring Hadoop Cluster
CassandraHigh-Availability, Scalable, Multi-Master database platform…RDBMS on sterioids
MahoutMachine Learning engine, which translates into, it does complex calculations, algorithmic processing, and statistical/stochastic operations using R and other frameworks…it does serious math!
SparkProgrammatic based compute engine allowing for ETL, machine learning, stream processing, and graph computation
ZooKeeperCoordinator service for all your distributed processing
OozieWorkflow scheduler managing Hadoop jobs

Links

Apache

https://sentry.apache.org/

https://de.hortonworks.com/apache/ranger/

https://mahout.apache.org/

https://pig.apache.org/

https://zookeeper.apache.org/

https://oozie.apache.org/

Diverses

http://ercoppa.github.io/HadoopInternals/