Author Archives: matteo

Scheduling start/stop of Amazon RDS instances using CDK libraries

Instead of creating the necessary aws resources using the Aws Console, I wanted to use the new AWS CDK libraries: in this way the aws resources can be created and deleted using Python.

“The AWS Cloud Development Kit (AWS CDK) is an open source software development framework to model and provision your cloud application resources using familiar programming languages.

Provisioning cloud applications can be a challenging process that requires you to perform manual actions, write custom scripts, maintain templates, or learn domain-specific languages. AWS CDK uses the familiarity and expressive power of programming languages for modeling your applications. ” [source aws-cdk]

As suggested by https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html I installed the required software and then I ran

mkdir rds-start-stop-cdk
cd rds-start-stop-cdk
cdk init --language python
python3 -m venv .env
source .env/bin/activate
# now I added the code you can see at
# https://gitlab.com/matteo.redaelli/rds-start-stop-cdk
pip install -r requirements.txt
cdk ls

You can see my sample code at https://gitlab.com/matteo.redaelli/rds-start-stop-cdk

Scheduling AWS EMR clusters resize

Below a sample of howto schedule an Amzon Elastic MapReduce (EMR) cluster resize. It is useful if you have a cluster that is less used during the nights or in the weekends

I used a lambda function triggered by a Cloudwatch rule. Here is my python lambda function

import boto3, json

MIN=1
MAX=10

def lambda_handler(event, context):
    region = event["region"]
    ClusterId = event["ClusterId"]
    InstanceGroupId = event["InstanceGroupId"]
    InstanceCount = int(event['InstanceCount'])
    
    if InstanceCount >= MIN and InstanceCount <= MAX:
        client = boto3.client('emr', region_name=region)
        response = client.modify_instance_groups(
            ClusterId=ClusterId,
            InstanceGroups= [{
                "InstanceGroupId": InstanceGroupId,
                "InstanceCount": InstanceCount
            }])
        return response
    else:
        msg = "EMR cluster id %s (%s): InstanceCount=%d is NOT allowed [%d,%d]" % (ClusterId, region, InstanceGroupId, InstanceCount, MIN,MAX)
        return {"response": "ko", "message": msg}

Below the CloudWatch rule where the input event is a constant json object like 

{"region": "eu-west-1","ClusterId": "j-dsds","InstanceGroupId": "ig-sdsd","InstanceCount": 8}



Exporting database tables to csv files with Apache Camel

Below the interested part of code using spring xml

     <bean id="ds-patriot-dw_ro" class="org.springframework.jdbc.datasource.DriverManagerDataSource">
         <property name="driverClassName" value="oracle.jdbc.OracleDriver" />
          <property name="url" value="jdbc:oracle:thin:@//patriot.redaelli.org:1111/RED"/>
          <property name="Username" value="user"/>
          <property name="Password" value="pwd"/>
  </bean>


<camelContext id="MyCamel" streamCache="true" xmlns="http://camel.apache.org/schema/spring">

    <route id="scheduler">
      <from uri="timer:hello?repeatCount=1"/>
      <setHeader headerName="ndays">
        <constant>0</constant>
      </setHeader>
      <to uri="direct:start"/>
    </route>

    <route>
      <from uri="direct:start"/>
      <setBody>
        <constant>table1,table2,table3</constant>
      </setBody>
      <split streaming="true">
        <tokenize token="," />
        <setHeader headerName="tablename">
          <simple>${body}</simple>
        </setHeader>
        <to uri="direct:jdbc2csv"/>
      </split>
    </route>

    <route>
      <from uri="direct:jdbc2csv"/>
        <to uri="direct:get-jdbc-data" pattern="InOut" />
        <to uri="direct:export-csv" />
    </route>

    <route>
      <from uri="direct:get-jdbc-data"/>
      <log message="quering table ${headers.tablename}..."/>

      <setBody>
        <groovy><![CDATA[
          "SELECT * from " + request.headers.get('tablename')
    ]]>
        </groovy>
      </setBody>
      <log message="quering statement: ${body}..."/>
      <to uri="jdbc:ds-patriot-dw_ro?useHeadersAsParameters=true&outputType=StreamList"/>
    </route>

    <route>
      <from uri="direct:export-csv"/>
            <log message="saving table ${headers.tablename} to ${headers.CamelFileName}..."/>
      <setHeader headerName="CamelFileName">
        <groovy>
          request.headers.get('tablename').replace(".", "_") + "/" + request.headers.get('tablename') + ".csv"
        </groovy>
      </setHeader>
      
      <!-- <marshal><csv></marshal> does not include header. I have to export it manualy.. -->
      
      <multicast stopOnException="true">
        <pipeline>
          <log message="saving table ${headers.tablename} header to ${headers.CamelFileName}..."/>
          <setBody>     
    <groovy>request.headers.get('CamelJdbcColumnNames').join(";") + "\n"</groovy>
          </setBody>
          <to uri="file:output"/>
        </pipeline>

        <pipeline>
          <log message="saving table ${headers.tablename} rows to ${headers.CamelFileName}..."/>
          <marshal>
            <csv delimiter=";" headerDisabled="false" useMaps="true"/>
          </marshal>
          <to uri="file:output?fileExist=Append"/>
        </pipeline>
      </multicast>

      <log message="saved table ${headers.tablename} to ${headers.CamelFileName}..."/>
  </route>

  </camelContext>

Add AD users from csv to group using powershell 

$GroupName = "Qliksense_SI_Techedge"
$Users =  "e:\scripts\users.csv"

Import-module ActiveDirectory 

$dc = Get-ADDomainController -DomainName mydomain.redaelli.org -Discover -NextClosestSite
$server = $dc.HostName[0]

get-content $Users | ForEach-Object {
  Get-ADUser -Server $server -LDAPFilter "(mail=$_)" } |
  Select-Object -ExpandProperty sAMAccountName |  ForEach-Object { Add-ADGroupMember -Server $server -Identity $GroupName -Member $_ }

AWS Lake Formation: the new Datalake solution proposed by Amazon

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions.

However, setting up and managing data lakes today involves a lot of manual, complicated, and time-consuming tasks. This work includes loading data from diverse sources, monitoring those data flows, setting up partitions, turning on encryption and managing keys, defining transformation jobs and monitoring their operation, re-organizing data into a columnar format, configuring access control settings, deduplicating redundant data, matching linked records, granting access to data sets, and auditing access over time.

Creating a data lake with Lake Formation is as simple as defining where your data resides and what data access and security policies you want to apply. Lake Formation then collects and catalogs data from databases and object storage, moves the data into your new Amazon S3 data lake, cleans and classifies data using machine learning algorithms, and secures access to your sensitive data. Your users can then access a centralized catalog of data which describes available data sets and their appropriate usage. Your users then leverage these data sets with their choice of analytics and machine learning services, like Amazon EMR for Apache Spark, Amazon Redshift, Amazon Athena, Amazon Sagemaker, and Amazon QuickSight. [aws.amazon.com]

Lake Formation automatically configures underlying AWS services, including S3, AWS Glue, AWS IAM, AWS KMS, Amazon Athena, Amazon Redshift, and Amazon EMR for Apache Spark, to ensure compliance with your defined policies. If you’ve set up transformation jobs spanning AWS services, Lake Formation configures the flows, centralizes their orchestration, and lets you monitor the execution of your jobs. With Lake Formation, you can configure and manage your data lake without manually integrating multiple underlying AWS services

Sources:

Building a Cloud-Agnostic Serverless infrastructure with Apache OpenWhisk

Apache OpenWhisk (Incubating) is an open source, distributed Serverless platform that executes functions (fx) in response to events at any scale. OpenWhisk manages the infrastructure, servers and scaling using Docker containers so you can focus on building amazing and efficient applications…

DEPLOY Anywhere: Since Apache OpenWhisk builds its components using containers it easily supports many deployment options both locally and within Cloud infrastructures. Options include many of today’s popular Container frameworks such as KubernetesMesos and Compose

ANY LANGUAGES: Work with what you know and love. OpenWhisk supports a growing list of your favorite languages such as NodeJSSwiftJavaGoScalaPythonPHP and Ruby.

If you need languages or libraries the current “out-of-the-box” runtimes do not support, you can create and customize your own executables as Zip Actions which run on the Docker runtime by using the Docker SDK. ” [openwhisk.apache.org]

Building a Cloud-Agnostic Serverless infrastructure with Knative

KNATIVE is Kubernetes-based platform to build, deploy, and manage modern serverless workloads

“Knative provides a set of middleware components that are essential to build modern, source-centric, and container-based applications that can run anywhere: on premises, in the cloud, or even in a third-party data center. Knative components are built on Kubernetes and codify the best practices shared by successful real-world Kubernetes-based frameworks. It enables developers to focus just on writing interesting code, without worrying about the “boring but difficult” parts of building, deploying, and managing an application.” [https://cloud.google.com/knative/]

“Knative has been developed by Google in close partnership with PivotalIBMRed Hat, and SAP.” [infoq.com]

A simpel rest web service with powershell

Below a sample webservice for exposte active directory queries using a powershell server… Ttest it wirh http://localhost:8000/user/<domainname>/<SamAccountName>

# Create a listener on port 8000
$listener = New-Object System.Net.HttpListener
$listener.Prefixes.Add(‘http://+:8000/’)
$listener.Start()
‘Listening …’

# Run until you send a GET request to /end
while ($true) {
$context = $listener.GetContext()

# Capture the details about the request
$request = $context.Request

# Setup a place to deliver a response
$response = $context.Response

# Break from loop if GET request sent to /end
if ($request.Url -match ‘/end$’) {
break
} else {

# Split request URL to get command and options
$requestvars = ([String]$request.Url).split(“/”);

# If a request is sent to http:// :8000/user/<domainname>/<SamAccountName>

if ($requestvars[3] -eq “user”) {
$dom = $requestvars[4]
$user = $requestvars[5]
$domainname = $dom + “.redaelli.org”
$dc = Get-ADDomainController -DomainName $domainname -Discover -NextClosestSite
echo $dc
$searchbase = ‘DC=’ + $dom + ‘,DC=redaelli,DC=org’
# Get the class name and server name from the URL and run get-WMIObject
$result = Get-ADUser -Server $dc.HostName[0] -SearchBase $searchbase -Filter {SamAccountName -eq $user} -Properties * | select SamAccountName, sn,GivenName,DisplayName,mail,DistinguishedName,telephoneNumber,mobile,l,company,co,whenCreated,whenChanged,PasswordExpired,PasswordLastSet,PasswordNeverExpires,lockedOut,LastLogonDate,lockoutTime

# Convert the returned data to JSON and set the HTTP content type to JSON
$message = $result | ConvertTo-Json;
$response.ContentType = ‘application/json’;

} else {

# If no matching subdirectory/route is found generate a 404 message
$message = “This is not the page you’re looking for.”;
$response.ContentType = ‘text/html’ ;
}

# Convert the data to UTF8 bytes
[byte[]]$buffer = [System.Text.Encoding]::UTF8.GetBytes($message)

# Set length of response
$response.ContentLength64 = $buffer.length

# Write response out and close
$output = $response.OutputStream
$output.Write($buffer, 0, $buffer.length)
$output.Close()
}
}

#Terminate the listener
$listener.Stop()

 

Querying public knowledge graph databases

You can query public knowledge graph databases (like wikidata.org and dbpedia.org) using SPARQL. For instance for extracting all “known” programming languages, you can use the query

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q9143.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 1000

There are also SPARQL clients for most of programming languages.

With (swi) prolog you can easily run

[library(semweb/sparql_client)].
sparql_query('SELECT ?item ?itemLabel WHERE {?item wdt:P31 wd:Q9143. SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }} LIMIT 1000', Row, [ scheme(https),host('query.wikidata.org'), path('/sparql')]).