The Systems Engineer organized chaos: August 2012

Wednesday, August 29, 2012

How to control and manage you cloud servers from a bastion server

The idea is very simple. We would like to have one (or more) server that belongs to our cloud account and use it only to execute and orchestrate various execution tasks. A diagram below is showing the concept. From the machine you have to ssh to your bastion (1) and then from it you can run any further tasks (2).

An hardened servers should be used as a bastion host. This server will provide the following functions:

Act as a secure gateway into the cloud environment
You should configure all other server to accept connections from this server only
From bastion you can lunch tasks that will perform further actions on the other cloud servers

Problem

How to run ssh or scp command over ssh that is initiated by the client and need to be executed from a bastion host on other cloud server.

Solution

This relatively long script written in python that uses paramiko module demonstrates the idea. It can be definitely extended and improved but you get the idea I hope :).

Monday, August 27, 2012

Rackconnect load and performance tests

Rackconnect

Rackconnect as a hybrid hosting solution that Rackspace has to offer that represents an excellent technology if you want to use the open cloud with your dedicated hardware.

When we think of dedicated hardware like servers performance statics are relatively easy to estimate or understand. This is different and be much more complicated for the IaaS cloud. Open cloud like IaaS is technology that is highly API driven and the performance results may differ base on the time of the day, physical location of the data center or the presence of the user.

It is to expected that the overall performance will be impacted as well as when we try to integrate with Open cloud. Rackconnect is a hybrid technology that on one side uses the available cloud public API and on the other site manages the dedicated hardware. Cloud imposes a challenge for it because it can be inpredicatble, can have time spikes, timeouts or even local outages that can take from minutes to hours.

In this post I would like to summarize and list some of my experiments I did to better understand the cloud potential and as well as find areas of improvements for the Rackconnect itself.

Rackconnect load and performance tests articles:

Part1 - cloud build bursting performance from 1 to 5 cloud servers

A client uses Cisco ASA 5505 and core cloud account with a physical presence in London data center for Rackconnect testing.

Part2 - TO BE DONE
Extra - First post about this problem

References

http://www.rackspace.com/cloud/hybrid/
http://rtomaszewski.blogspot.co.uk/search/label/rackconnect

How long does it take to Rackconnect one to five cloud server on Cisco ASA 5505 (Part1)

Update:
There is a more comprehensive comparison and listing of available resources for the topic:
Rackconnect load and performance tests

This post is a continuum of the previous one: How long does it take to rackconnect a newly built cloud server

The data below has been generated with the same script [1] that has been specifically modified to generate necessary statistics. The method used to measure a time for the Rackconnect build has been improved as well. With the new code we should get even better and more accurate results.

For the tests below the new measurement error should be around 7 seconds only. This has been mainly achieved with a modified script check_rackconnect.sh [2].

In basic the new code uses heuristic base on the bastion host local time stamps as well as the time stamps from the check_rackconnect.sh script. The final calculation of how long the Rackconenct build took is done later in log_status3 function [1].

All tests below are executed on the same Rackconnect environment: Cisco ASA 5505 and core cloud account. Every built cloud server has a configuration: flavor 1 (256MB), image type 112 (Ubuntu 10.04).

Test case #1

How long does it take to create 1 cloud sever and Rackconnect it.

A test below will simulate:

Using cloud API create one cloud server
Monitor the cloud build
Once the cloud server is built start to monitor Rackconnect (RC) build
Once the RC build is done generate stats and delete the cloud server
Repeat the above cycle 10 times

The logs below show how to start the test and the resulting statistics.

$ python -u performance-single-cs.py -v -t 10 -s 1 -b pass@bastion -u user -k key  run | tee log.$(date +%s).txt
$ cat firstgen_rc_performance_report.1346017811.txt
Overall tests duration and statistics
test #,                          start,                            end, duration [s]
     1,     2012-08-26 22:07:06.914565,     2012-08-26 22:11:30.591570, 263
     2,     2012-08-26 22:11:30.592652,     2012-08-26 22:16:00.184358, 269
     3,     2012-08-26 22:16:00.185239,     2012-08-26 22:20:29.710007, 269
     4,     2012-08-26 22:20:29.711945,     2012-08-26 22:24:53.598046, 263
     5,     2012-08-26 22:24:53.598958,     2012-08-26 22:29:41.288767, 287
     6,     2012-08-26 22:29:41.290309,     2012-08-26 22:33:40.750456, 239
     7,     2012-08-26 22:33:40.752308,     2012-08-26 22:37:40.368505, 239
     8,     2012-08-26 22:37:40.370067,     2012-08-26 22:41:39.680134, 239
     9,     2012-08-26 22:41:39.680827,     2012-08-26 22:46:03.081319, 263
    10,     2012-08-26 22:46:03.083230,     2012-08-26 22:50:11.637474, 248

cloud building statistics
 cs #,   test1,   test2,   test3,   test4,   test5,   test6,   test7,   test8,   test9,  test10
    1,     213,     213,     214,     214,     244,     213,     213,     213,     214,     183

rackconnect building statistics
 cs #,   test1,   test2,   test3,   test4,   test5,   test6,   test7,   test8,   test9,  test10
    1,      30,      42,      42,      29,      24,      23,      22,      22,      28,      48

A graphical representation of the above data.

Test case #2

How long does it take to create 2 cloud severs and Rackconnect them.

As before the flowing test is going to simulate:

Using cloud API create 2 cloud servers
Monitor builds of the 2 cloud serves
Once any of the cloud servers is built start to monitor the Rackconnect (RC) build
Once the RC build is done save stats and delete a cloud server
Generate stats when all 2 cloud servers are deleted
Repeat 10 times the above cycle

The logs below show how to start the test and the results.

$ python -u firstgen_cs_performance.py -v -t 10 -s 2 -b pass@bastion -u user -k key  run 2>&1 | tee log.$(date +%s).txt 
$ cat firstgen_rc_performance_report.1346020358.txt
Overall tests duration and statistics
test #,                          start,                            end, duration [s]
     1,     2012-08-26 22:43:00.756231,     2012-08-26 22:48:06.580328, 305
     2,     2012-08-26 22:48:06.580876,     2012-08-26 22:53:06.387345, 299
     3,     2012-08-26 22:53:06.388921,     2012-08-26 22:57:45.312555, 278
     4,     2012-08-26 22:57:45.313452,     2012-08-26 23:02:42.359540, 297
     5,     2012-08-26 23:02:42.361089,     2012-08-26 23:08:06.284967, 323
     6,     2012-08-26 23:08:06.286676,     2012-08-26 23:13:06.233792, 299
     7,     2012-08-26 23:13:06.235701,     2012-08-26 23:18:06.165773, 299
     8,     2012-08-26 23:18:06.166642,     2012-08-26 23:23:02.864452, 296
     9,     2012-08-26 23:23:02.865955,     2012-08-26 23:27:35.655586, 272
    10,     2012-08-26 23:27:35.662204,     2012-08-26 23:32:38.403999, 302

cloud building statistics
 cs #,   test1,   test2,   test3,   test4,   test5,   test6,   test7,   test8,   test9,  test10
    1,     216,     216,     216,     216,     186,     248,     247,     216,     216,     216
    2,     212,     213,     213,     212,     244,     213,     213,     213,     213,     213

rackconnect building statistics
 cs #,   test1,   test2,   test3,   test4,   test5,   test6,   test7,   test8,   test9,  test10
    1,      62,      44,      19,      69,      32,      46,      46,      68,      19,      69
    2,      55,      62,      55,      29,      51,      47,      40,      28,      50,      35

A graphical representation of the test #2 data.

Test case #3

How long does it take to create 3 cloud severs and Rackconnect them.

The logs below show how to start the test and the resulting statistics

 
$ python -u firstgen_cs_performance.py -v -t 10 -s 3 -b pass@bastion -u user -k key  run 2>&1 | tee log.$(date +%s).txt 
$ cat firstgen_rc_performance_report.1346023741.txt
Overall tests duration and statistics
test #,                          start,                            end, duration [s]
     1,     2012-08-26 23:36:24.625215,     2012-08-26 23:41:45.844673, 321
     2,     2012-08-26 23:41:45.846358,     2012-08-26 23:47:07.165236, 321
     3,     2012-08-26 23:47:07.166380,     2012-08-26 23:51:55.442095, 288
     4,     2012-08-26 23:51:55.443153,     2012-08-26 23:57:13.466668, 318
     5,     2012-08-26 23:57:13.468356,     2012-08-27 00:02:58.579180, 345
     6,     2012-08-27 00:02:58.580118,     2012-08-27 00:08:19.777111, 321
     7,     2012-08-27 00:08:19.778890,     2012-08-27 00:13:37.742591, 317
     8,     2012-08-27 00:13:37.744408,     2012-08-27 00:18:55.713891, 317
     9,     2012-08-27 00:18:55.715903,     2012-08-27 00:23:46.616462, 290
    10,     2012-08-27 00:23:46.617623,     2012-08-27 00:29:01.689222, 315

cloud building statistics
 cs #,   test1,   test2,   test3,   test4,   test5,   test6,   test7,   test8,   test9,  test10
    1,     220,     188,     219,     220,     219,     219,     219,     219,     220,     220
    2,     216,     215,     215,     216,     215,     215,     215,     247,     216,     216
    3,     212,     211,     212,     212,     211,     242,     242,     212,     212,     212

rackconnect building statistics
 cs #,   test1,   test2,   test3,   test4,   test5,   test6,   test7,   test8,   test9,  test10
    1,      35,      98,      17,      44,      93,      36,      38,      42,      15,      79
    2,      64,      36,      55,      81,      67,      58,      32,      40,      60,      25
    3,      87,      60,      57,      47,      39,      57,      52,      36,      56,      63

Like before these are the data visualised on graphs

Test case #4

How long does it take to create four cloud severs and Rackconnect them.

The logs below show how to start the test and results.

$ python -u firstgen_cs_performance.py -v -t 10 -s 4 -b pass@bastion -u user -k key  run 2>&1 | tee log.$(date +%s).txt 
$ cat firstgen_rc_performance_report.1346071307.txt 
Overall tests duration and statistics
test #,                          start,                            end, duration [s]
     1,     2012-08-27 12:42:58.464643,     2012-08-27 12:48:51.826041, 353
     2,     2012-08-27 12:48:51.827840,     2012-08-27 12:54:51.083906, 359
     3,     2012-08-27 12:54:51.085615,     2012-08-27 13:00:59.075276, 367
     4,     2012-08-27 13:00:59.076493,     2012-08-27 13:06:37.176187, 338
     5,     2012-08-27 13:06:37.177456,     2012-08-27 13:12:24.448441, 347
     6,     2012-08-27 13:12:24.449740,     2012-08-27 13:18:02.517695, 338
     7,     2012-08-27 13:18:02.519349,     2012-08-27 13:23:52.897831, 350
     8,     2012-08-27 13:23:52.898722,     2012-08-27 13:29:58.276944, 365
     9,     2012-08-27 13:29:58.278082,     2012-08-27 13:35:42.351643, 344
    10,     2012-08-27 13:35:42.352320,     2012-08-27 13:41:47.521873, 365

cloud building statistics
 cs #,   test1,   test2,   test3,   test4,   test5,   test6,   test7,   test8,   test9,  test10
    1,     224,     224,     224,     224,     224,     224,     192,     257,     195,     225
    2,     157,     222,     222,     221,     221,     222,     189,     222,     225,     254
    3,     249,     219,     250,     219,     218,     219,     217,     219,     221,     219
    4,     215,     216,     216,     216,     215,     216,     245,     248,     218,     216

rackconnect building statistics
 cs #,   test1,   test2,   test3,   test4,   test5,   test6,   test7,   test8,   test9,  test10
    1,      58,      62,      90,      91,      42,      48,      33,     100,      43,      82
    2,      59,      23,      71,      37,      93,      87,      63,      12,      89,      83
    3,      70,      41,      90,      40,      89,      81,     112,      44,      49,      21
    4,      61,     107,      45,      71,      84,      41,      55,      62,      94,      51

Like before these are the data visualised in a form of a graphs

Test case #5

How long does it take to create five cloud severs and Rackconnect them.

The logs below show how to start the test and results.

$ python -u firstgen_cs_performance.py -v -t 10 -s 5 -b pass@bastion -u user -k key  run 2>&1 | tee log.$(date +%s).txt 
cat firstgen_rc_performance_report.1346075296.txt
Overall tests duration and statistics
test #,                          start,                            end, duration [s]
     1,     2012-08-27 13:46:48.810109,     2012-08-27 13:53:00.032992, 371
     2,     2012-08-27 13:53:00.034355,     2012-08-27 13:59:41.839530, 401
     3,     2012-08-27 13:59:41.840434,     2012-08-27 14:05:53.244177, 371
     4,     2012-08-27 14:05:53.245105,     2012-08-27 14:11:55.703266, 362
     5,     2012-08-27 14:11:55.704239,     2012-08-27 14:18:19.149206, 383
     6,     2012-08-27 14:18:19.150468,     2012-08-27 14:24:33.617616, 374
     7,     2012-08-27 14:24:33.618816,     2012-08-27 14:30:51.220322, 377
     8,     2012-08-27 14:30:51.221913,     2012-08-27 14:36:53.563330, 362
     9,     2012-08-27 14:36:53.564169,     2012-08-27 14:42:31.914619, 338
    10,     2012-08-27 14:42:31.915871,     2012-08-27 14:48:16.481674, 344

cloud building statistics
 cs #,   test1,   test2,   test3,   test4,   test5,   test6,   test7,   test8,   test9,  test10
    1,     228,     228,     227,     194,     195,     228,     228,     227,     259,     228
    2,     225,     192,     225,     224,     256,     225,     226,     192,     192,     225
    3,     222,     222,     222,     222,     221,     223,     254,     222,     189,     223
    4,     220,     220,     250,     219,     218,     251,     188,     250,     217,     220
    5,     217,     217,     248,     246,     247,     215,     217,     216,     214,     217

rackconnect building statistics
 cs #,   test1,   test2,   test3,   test4,   test5,   test6,   test7,   test8,   test9,  test10
    1,      36,      54,      30,      44,      64,      30,      61,      48,      43,      44
    2,      52,      56,      54,     103,      97,      80,      50,      43,      51,      67
    3,      97,     166,      58,      42,      93,      68,      90,      52,      58,      69
    4,     114,      42,      71,     101,      53,      89,      55,      65,      49,     100
    5,      67,      74,      88,      73,      60,      80,      93,     113,     101,      67

Like before these are the data visualised in a form of a graph

Summary and results description

Through all the 5 test cases above we have been creating cloud servers and repeating this process 10 times. The graphs show that the numbers are changing and the visible trend is that the times are increasing.

We take a look at each of the 5 test cases again and summarize the 10 repetitions to a 3 number result: min, max and average Rackconnect build time. Next we can summarize all the 5 tests/50 cloud build tests and represent all the data in a single graph. The graph below compares all above tests.

As an example, for the test 1 we have the min, max and average times. It has been calculated base on the results from the Test Case #1. We did the some for the other Test Cases.

We see that every time we increase a number of cloud server to build by one the rackconnect build time increases as well. It means that if we burst 1, 2 and up to 5 cloud servers in one single test the time needed for the cloud infrastructure to provision and than to finish up and RackConnect a single cloud server increases as well.

References

Friday, August 24, 2012

How long does it take to rackconnect a newly built cloud server

Update:
There is a more comprehensive comparison and listing of available resources for the topic:
Rackconnect load and performance tests

Rackconnect is a Rackspace cloud product that allows you to create a secure environment where cloud servers can communicate with your dedicated hardware and vice versa. In the current version of Rackconnect and Rackspace tools you can observe and monitor the progress in Myrackspace portal. It gives you a good visibility what task need to be done and what the progress is. This article describes this this a little better: How to check or monitor build status of a cloud server that belong to a RackConnect cloud account

Problem

After a cloud server is fully created and built how long does it take on average to execute all the Rackconnect tasks?

Analysis

In my attempt to answer the question I wrote a python script [1]. The script is capable of doing a lot more than what is described below (for more info please take a look at the github [1] and other blog entries here). With the help of this script I repetitively run a simple test, collect data and summarized it here.

All test has been performed on a Rackconnect account using Cisco ASA 5505 and core cloud account.

Test case for Rackconnect build

Using the cloud API create a cloud server.
Using a pooling mechanism every 30 seconds send a 'check' API request to confirm if the cloud built is complete.
Once the cloud is built use a pooling again and ever 20 seconds try to verify if all Rackconnect tasks are completed against the cloud server.
Generate summary and report.

Starting a test

$ for i in $(seq 1 7); do 
  python -u firstgen_cs_performance.py -v -t 1 -s 1 -b pass@bastion_ip -u user -k key run 2>&1 | tee log.$(date +%s).txt
done

The test will generate a log file for each test. These simple bash aliases will help you to parse the data [2].

Example output from the test

[ ][ ] Preparing to start all 1 tests
[ ][ ] test nr 1 started at 2012-08-23 22:45:23.631241
[ 1][  ] starting test nr 1, creating 1 cloud server, please wait ...
[ 1][ 1] created image: {'flavor': 1, 'image': 112, 'name': 'csperform1345758323'}
[ 1][ 1] cloud server build [csperform1345758323] created in 214.059273 seconds / 3.56765455 minutes
[ 1][ 1] rackconnect build [csperform1345758323] finished in 46.875914 seconds / 0.781265233333 minutes
[ ][ ] test nr 1 finished at 2012-08-23 22:49:50.318006

[ ][ ] Preparing to start all 1 tests
[ ][ ] test nr 1 started at 2012-08-23 22:49:51.105979
[ 1][  ] starting test nr 1, creating 1 cloud server, please wait ...
[ 1][ 1] created image: {'flavor': 1, 'image': 112, 'name': 'csperform1345758591'}
[ 1][ 1] cloud server build [csperform1345758591] created in 213.882997 seconds / 3.56471661667 minutes
[ 1][ 1] rackconnect build [csperform1345758591] finished in 41.747438 seconds / 0.695790633333 minutes
[ ][ ] test nr 1 finished at 2012-08-23 22:54:14.590053

[ ][ ] Preparing to start all 1 tests
[ ][ ] test nr 1 started at 2012-08-23 22:54:15.326702
[ 1][  ] starting test nr 1, creating 1 cloud server, please wait ...
[ 1][ 1] created image: {'flavor': 1, 'image': 112, 'name': 'csperform1345758855'}
[ 1][ 1] cloud server build [csperform1345758855] created in 213.884405 seconds / 3.56474008333 minutes
[ 1][ 1] rackconnect build [csperform1345758855] ERROR, couldn't find server or timeout after 1008.921862 seconds / 16.8153643667 minutes
[ ][ ] test nr 1 finished at 2012-08-23 23:14:39.879807

[ ][ ] Preparing to start all 1 tests
[ ][ ] test nr 1 started at 2012-08-23 23:33:54.517918
[ 1][  ] starting test nr 1, creating 1 cloud server, please wait ...
[ 1][ 1] created image: {'flavor': 1, 'image': 112, 'name': 'csperform1345761234'}
[ 1][ 1] cloud server build [csperform1345761234] created in 214.204459 seconds / 3.57007431667 minutes
[ 1][ 1] rackconnect build [csperform1345761234] finished in 13.489741 seconds / 0.224829016667 minutes
[ ][ ] test nr 1 finished at 2012-08-23 23:37:53.865203

[ ][ ] Preparing to start all 1 tests
[ ][ ] test nr 1 started at 2012-08-23 23:37:54.768286
[ 1][  ] starting test nr 1, creating 1 cloud server, please wait ...
[ 1][ 1] created image: {'flavor': 1, 'image': 112, 'name': 'csperform1345761474'}
[ 1][ 1] cloud server build [csperform1345761474] created in 213.867189 seconds / 3.56445315 minutes
[ 1][ 1] rackconnect build [csperform1345761474] finished in 41.842998 seconds / 0.6973833 minutes
[ ][ ] test nr 1 finished at 2012-08-23 23:42:18.215062

[ ][ ] Preparing to start all 1 tests
[ ][ ] test nr 1 started at 2012-08-23 23:47:07.441398
[ 1][  ] starting test nr 1, creating 1 cloud server, please wait ...
[ 1][ 1] created image: {'flavor': 1, 'image': 112, 'name': 'csperform1345762027'}
[ 1][ 1] cloud server build [csperform1345762027] created in 214.401686 seconds / 3.57336143333 minutes
[ 1][ 1] rackconnect build [csperform1345762027] finished in 88.597479 seconds / 1.47662465 minutes
[ ][ ] test nr 1 finished at 2012-08-23 23:52:16.013129

Results discussion

Firstly all the information about how long it took to build a cloud server or Rackconnect it have a measurement error appropriately of 30 or 20 seconds. But even with this relatively big numbers we can clearly see a pattern that:

It is possible to Rackconnect a server in 13s (*)
Many Rackconnect builds were finished in about 41-46s
In extreme case a cloud server build finished in 3,5 minute but the Rackconnect was still ongoing after 16min.

Add1.
This value is acceptable and correct even though the mesurement error for rackconnect task is 20s. The reason is that the script firstgen_cs_performance.py uses different threads to perform various tasks. The 20s is simply a delay between the API calls performed by a single thread. For more info please take a look at the evaluate_rackconnect_status(self, test_nr) function.

References

Sunday, August 19, 2012

Problem runing ssh or scp from a python script using the paramiko module

Paramiko project [1] is a native SSH Python library for scritpting. It provides an extensible API that allows you to imitate a SSH session, control it and later as well as execute commands.

I tried to use it to implement one of my Rackconnect scritps:

create a cloud server, let's call it a bastion server
connect to bastion server over SSH
execute ssh command from bastion against other cloud server (10.178.7.217/ServiceNet) or
execute scp command to copy a file from bastion to a cloud server over local ServiceNet

I failed with my first attempt to solve this problem. I discovered that the SSH session doesn't have a tty attached to it.

First attempt.
Below is the error message I always got.

$ python -u example_paramiko_notty.py

 test 1
 cmd pwd; ls; date

stdout : /root
stdout : check_rackconnect.sh
stdout : Sun Aug 19 19:04:03 UTC 2012

 test 2
 cmd scp -q  -o NumberOfPasswordPrompts=1 -o StrictHostKeyChecking=no /root/check_rackconnect.sh root@10.178.7.217:~/; echo $? done.

stdout : 1 done.
stderr : lost connection

 test 3
 cmd scp -q -v -o NumberOfPasswordPrompts=1 -o StrictHostKeyChecking=no /root/check_rackconnect.sh root@10.178.7.217:~/; echo $? done.

stdout : 1 done.
stderr : Executing: program /usr/bin/ssh host 10.178.7.217, user root, command scp -v -t ~/
stderr : OpenSSH_5.3p1 Debian-3ubuntu3, OpenSSL 0.9.8k 25 Mar 2009
stderr : debug1: Reading configuration data /etc/ssh/ssh_config
stderr : debug1: Applying options for *
stderr : debug1: Connecting to 10.178.7.217 [10.178.7.217] port 22.
stderr : debug1: Connection established.
stderr : debug1: permanently_set_uid: 0/0
stderr : debug1: identity file /root/.ssh/identity type -1
stderr : debug1: identity file /root/.ssh/id_rsa type -1
stderr : debug1: identity file /root/.ssh/id_dsa type -1
stderr : debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3p1 Debian-3ubuntu3
stderr : debug1: match: OpenSSH_5.3p1 Debian-3ubuntu3 pat OpenSSH*
stderr : debug1: Enabling compatibility mode for protocol 2.0
stderr : debug1: Local version string SSH-2.0-OpenSSH_5.3p1 Debian-3ubuntu3
stderr : debug1: SSH2_MSG_KEXINIT sent
stderr : debug1: SSH2_MSG_KEXINIT received
stderr : debug1: kex: server-client aes128-ctr hmac-md5 none
stderr : debug1: kex: client-server aes128-ctr hmac-md5 none
stderr : debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024 1024 8192) sent
stderr : debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
stderr : debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
stderr : debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
stderr : debug1: Host '10.178.7.217' is known and matches the RSA host key.
stderr : debug1: Found key in /root/.ssh/known_hosts:5
stderr : debug1: ssh_rsa_verify: signature correct
stderr : debug1: SSH2_MSG_NEWKEYS sent
stderr : debug1: expecting SSH2_MSG_NEWKEYS
stderr : debug1: SSH2_MSG_NEWKEYS received
stderr : debug1: SSH2_MSG_SERVICE_REQUEST sent
stderr : debug1: SSH2_MSG_SERVICE_ACCEPT received
stderr : debug1: Authentications that can continue: publickey,password
stderr : debug1: Next authentication method: publickey
stderr : debug1: Trying private key: /root/.ssh/identity
stderr : debug1: Trying private key: /root/.ssh/id_rsa
stderr : debug1: Trying private key: /root/.ssh/id_dsa
stderr : debug1: Next authentication method: password
stderr : debug1: read_passphrase: can't open /dev/tty: No such device or address
stderr : debug1: Authentications that can continue: publickey,password
stderr : debug1: No more authentication methods to try.
stderr : Permission denied (publickey,password).
stderr : lost connection

The problems is that the exec_command() function [2] doens't open a SSH session that has a terminal attached. I couldn't find a working solution using it so I have written and to use invoke_shell() function [3] instead. Most of the code was inspired and copied from an example found here [4].Below is my working script.

Second attempt This is the output when we run it this time.

$ python -u example_paramiko_with_tty.py 

 test 2
 cmd scp -q  -o NumberOfPasswordPrompts=1 -o StrictHostKeyChecking=no /root/check_rackconnect.sh root@10.178.7.217:~/; echo $? done.

Linux rctest 2.6.32-31-server #61-Ubuntu SMP Fri Apr 8 19:44:42 UTC 2011 x86_64 GNU/Linux
Ubuntu 10.04 LTS

Welcome to the Ubuntu Server!
 * Documentation:  http://www.ubuntu.com/server/doc
Last login: Sun Aug 19 19:47:09 2012 from bbb.rrr.com

root@rctest:~# 
/root/check_rackconnect.sh root@10.178.7.217:~/; echo $? done.no  

root@10.178.7.217's password: 

0 done.
root@rctest:~# 
command was successful:True

As we can see the invoke_shell() function is very different from the exec_command() one. To make it work we have to deal with all the terminal outputs and make sure we sent the command string at a right time. We can see as well as that with default terminal settings the output is not showing all text. We can see only part of the overlapped command string we sent in line #11

References

Saturday, August 18, 2012

Redhat says Openstack is going to win the Cloud battle

Looking at the numerous Openstack blogs on Internet I have found this one from Bryan Che very interesting. Bryan is a Redhat employee and on his blog he recently published an article why he believes Openstack is the feature for the Cloud.

The article is interested because is not a pure marketing but instead the author tries to convince us by listing a number of semi technical arguments. The full article can be found here: The 2nd Tenet of Open Source: Bet on the Community, Not the Current State of Technology

For me the most important argument that I agree with is that Openstack is going to win as long as its community continue to grow. The only that I would like to say at the end is that let's keep doing the good work and prove him right ;)

Thursday, August 16, 2012

How to calculate a number of new SSL/TCP connections per every 10ms

Hardware load balancers like F5 are a graet product that offers a lot of featreus still combined with a simple and intuitive management GUI. The only problem is the price you have to pay to buy it and than further to pay the support and the license fees.

When working with F5 I have run once into an interesting SSL/TLS problem. It is documented and described SOL6475: Overview of SSL TPS licensing limits.

The most important part from the solutions is:

The BIG-IP system measures SSL TPS based on client-side connection attempts to any
virtual server configured with a Client SSL profile. SSL TPS is enforced across a
sliding time window. The BIG-IP system utilizes a 10ms window (1/100 of a second)
to calculate the current TPS. If the number of TPS requests within any 10ms window
exceeds 1/100 of the licensed TPS, an error message regarding the TPS limit being
reached is sent to the /var/log/ltm file.

Problem

How to know what clients IPs cause the error to be logged. How to measure and calculate the number of SSL connection per seconds for even 10ms.

Solution

As there are no tools on F5 that helps you to find this out I thought that a simple way to get some visibility of it would be to capture all TCP SYN packets hitting the LB and then later do some analysis of it. An implementation of this ideas in a form of a python script can be found here [1].

Demonstration

To test sslAnalyze.py script we need first a tcpdump file. For this purpose we can use the nmap command and run a SYN flood. For the desciption of the nmap options you can take a look here [2].

$ nmap -P0 -TNormal -D 1.2.3.4,1.2.3.5,1.2.3.6,1.2.3.7,1.2.3.8,1.2.3.9,1.2.3.10 -iR 10

All what we have to do now is to run on one session a tcpump and on the other the nmap command. As we are only interested in the TCP SYN packets we should tailor the tcpdump filtering syntax properly. A tcpdump that will capture only the SYN packets:

$ tcpdump -vvv -nn -i eth0 -w /var/tmp/syn-flood-example.pcap 'tcp[13]&2!=0 and tcp[13]&16==0'

All what we have to do is not run our script to see the statistics.

I have to quickly explain the script itself. Once run it will prints on stdout a listing of found connections and additionally will create a log file with a name sslConnHigh.txt for only these connections that are over the threshold.

The parameters that you have to specify are:

param1 - tcpdump file (it has to have only SYN packets)
param2 - time fractions in microseconds ( 1000000 microseconds -> 1 second )
param3 - connection threshold per time to log this result to a sslConnHigh.txt file

Examples

# Example 1: to see a  number of connection per 1 second 

$ python sslAnalyze.py  syn-flood-example.pcap 1000000 1

# Example 2: to see a number of connection per every 500ms (half a second)

$ python sslAnalyze.py  syn-flood-example.pcap 500000 1

# Example3: to see a number of connection per every 500ms (half a second) and log only
# these timestamps that have more than 100 connection in a single half a second
# some example output has been attached as well below

$ python sslAnalyze.py  syn-flood-example.pcap 500000 100

keeping the line: reading from file syn-flood-example.pcap, link-type EN10MB (Ethernet)
                     date     timestamp     sumOfConn [... 500000 microsecond periods ... ]
 Tue Aug 14 23:33:30 2012    1344983610       sum:183     0  183 
 Tue Aug 14 23:33:31 2012    1344983611        sum:95     6   89 
 Tue Aug 14 23:33:32 2012    1344983612       sum:614   430  184 
 Tue Aug 14 23:33:33 2012    1344983613       sum:520   216  304

To better understand why F5 logs the error message and what trigger the TPS log error messages we have to run this command:

# 10 milliseconds = 10000 microseconds
$ python sslAnalyze.py  syn-flood-example.pcap 10000 [F5_SSL_total_TPS]
$ cat sslConnHigh.txt

In the output you are going to see the timestamps (rounded to 1 second) where the number of connections in a single 10ms window are above the licensing limit you device has. For further analize you can extract these data from the tcpdump with a help of tcpslice tool.

# 1268649656 is an example timestamp from above
$ tcpslice 1268649656  +1 syn-flood-example.pcap -w 1268649656.pcap

$ tcpdump -tt -nr 1268649656.pcap

reading from file 1268649656.pcap, link-type EN10MB (Ethernet)
1268649656.042723 vlan 4093, p 0, IP 19.26.168.192.4598 - 19.26.225.215.443: S 2973530156:2973530156(0) win 64512 mss 1460,nop,nop,sackOK
1268649656.056163 vlan 4093, p 0, IP 19.89.139.199.1622 - 19.26.225.23.443: S 1522394445:1522394445(0) win 64512 mss 1460,nop,wscale 0,nop,nop,sackOK

References

Tuesday, August 14, 2012

How to find the public IP address for my rackconnected cloud server

Every cloud server that belongs to a cloud account that was linked with Rackconnect is going to have a static NAT created on the external network device like ASA firewall or F5 loadbalancer. At the moment there is a problem that there is not an easy way to find out what this IP actually is.

When you first create a cloud server it will have assigned an IP from the cloud public IP address pool. You can find it the example output below.

$ cloudservers --username user --apikey  key  boot rctest --flavor 1 --image 112
+-----------+------------------------------------------------------------------+
|  Property |                              Value                               |
+-----------+------------------------------------------------------------------+
| addresses | {u'public': [u'31.222.163.128'], u'private': [u'10.177.69.211']} |
| adminPass |                         rrrrrrrrrrrr                          |
|  flavorId |                                1                                 |
|   hostId  |                 0652da292b44004e3aa76dc80bd912d5                 |
|     id    |                             10209889                             |
|  imageId  |                               112                                |
|  metadata |                                {}                                |
|    name   |                              rctest                              |
|  progress |                                0                                 |
|   status  |                              BUILD                               |
+-----------+------------------------------------------------------------------+

The initial IP of 31.222.163.128 is going to be changed as soon as all RackConnect task will be run again this cloud server. A next problem is that all subsequent API calls still may return the original IP address instead of the new one assigned by the RackConnect system.

$ cloudservers --username user  --apikey  key show 10209889
+------------+----------------------------------+
|  Property  |              Value               |
+------------+----------------------------------+
|   flavor   |            256 server            |
|   hostId   | 0652da292b44004e3aa76dc80bd912d5 |
|     id     |             10209889             |
|   image    |         Ubuntu 10.04 LTS         |
|  metadata  |                {}                |
|    name    |              rctest              |
| private ip |          10.177.69.211           |
|  progress  |                0                 |
| public ip  |          31.222.163.128          |
|   status   |              BUILD               |
+------------+----------------------------------+

Problem

How do I find the external IP address that Rackconnect assigns to my cloud server.

Solution

Once the cloud server is built you can open a browser from the cloud server (may by a bit of a problem) and try to google for: what's my IP.

Alternatively you can have a bastion server that you know how to login over SSH and from there try run run over internal IP address of your new cloud server his command:

mybastion$ ssh root@10.177.69.211 "curl  http://icanhazip.com"
11.138.183.11

How to terminate a ssh session to a cloud server that hanged

When working with Rackspace cloud serves you run sometimes to an issue with the remote ssh session that it hangs.

This is expected behavior. The session after some time of inactivity timeouts and in my case this led to my bash terminal session to hang. As I didn't want to terminate and close my terminal to resolve this issue and as well as I wanted to keep my previuos log still available I was looking for a possible solution.

Solution

To terminate a hanged ssh session please type the following keys: [enter]~.

Example

root@mycloud:~# ~?
Supported escape sequences:
  ~.  - terminate connection (and any multiplexed sessions)
  ~B  - send a BREAK to the remote system
  ~C  - open a command line
  ~R  - Request rekey (SSH protocol 2 only)
  ~^Z - suspend ssh
  ~#  - list forwarded connections
  ~&  - background ssh (when waiting for connections to terminate)
  ~?  - this message
  ~~  - send the escape character by typing it twice
(Note that escapes are only recognized immediately after newline.)
# we waiting now for the session to timeout 
root@rctest:~#

# no press the magica keys :)
root@rctest:~# Connection to 83.138.183.15 closed.

References

http://www.thelinuxblog.com/ssh-escape/

Monday, August 13, 2012

How to check or monitor build status of a cloud server that belong to a RackConnect cloud account

There is a difference when you use cloud account that is linked with Rackspace RackConnect (RC) product.

Every cloud server from a cloud account that is rackconnected is gong to be reconfigured. All tasks that the RackConnect system will execute can be seen and followed on MyRackspace portal. In short these tasks will change the initial IP settings, route configuration and firewall settings on the original cloud server.

RC tasks from MyRackspace portal

Cloud Server Created: Add "rackconnect" user 
Cloud Server Created: Validate existence of gateway interface on dedicated network device 
Cloud Server Created: Retrieve metadata 
Cloud Server Created: Provision public IP address 
Cloud Server Created: Update access on dedicated network devices 
Cloud Server Created: Configure network stack 
Cloud Server Created: Configure software firewall 
Cloud Server Created: Update software firewall on other Cloud Servers

Status and monitoring

At the moment the only way to know that the RC is done is to monitor the network settings on the cloud server manually. This is a known limitation and there are going to be new changes deployed to address this in a close future. For now to know that the RC is done we can for example monitor the last possible task: Cloud Server Created: Configure software firewall.

As soon as we know that the firewall config has changed the RC is done (almost done because there is one last task that can still be something that affects our cloud server). A simple example how the settings change is below.

Before the RC changes

# iptables -nL
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

After the RC changes

# iptables -nL
Chain INPUT (policy DROP)
target     prot opt source               destination         
RS-RackConnect-INBOUND  all  --  0.0.0.0/0            0.0.0.0/0           /* RackConnectChain-INBOUND */ 

Chain FORWARD (policy DROP)
target     prot opt source               destination         
RS-RackConnect-INBOUND  all  --  0.0.0.0/0            0.0.0.0/0           /* RackConnectChain-FORWARD */ 

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

Chain RS-RackConnect-INBOUND (2 references)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           state RELATED,ESTABLISHED /* RackConnectChain-INBOUND-RE */ 
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           /* Local-Loopback */ 
...

To know that the RC is done you need some simple script to check this. An example bash script is listed below.

That means the you can use the example script above and run it as many times as you want (in some loop with delays between the executions). As soon as the cloud server will be rackconnected the script output will turn into 'yes'.

Rackspace Cloud server API performance analisis for cloud server bursting

In the cloud world the public cloud API allows us to create cloud servers that can differ in size. A current list of available flavors (combination of a RAM sizes and disk space) for the FirstGen Rackspace cloud is:

$ /usr/bin/cloudservers --username $FG_OS_USERNAME --apikey $FG_OS_PASSWORD  flavor-list
+----+---------------+-------+------+
| ID |      Name     |  RAM  | Disk |
+----+---------------+-------+------+
| 1  |   256 server  |  256  |  10  |
| 2  |   512 server  |  512  |  20  |
| 3  |   1GB server  |  1024 |  40  |
| 4  |   2GB server  |  2048 |  80  |
| 5  |   4GB server  |  4096 | 160  |
| 6  |   8GB server  |  8192 | 320  |
| 7  | 15.5GB server | 15872 | 620  |
| 8  |  30GB server  | 30720 | 1200 |
+----+---------------+-------+------+

Problem

A performance can be measured in many different ways. When working with the API I start asking my self recently these questions. As I couldn't fine definitive answer I decided to perform some testing to get a better feeling what you can expect.

How long does it take to create XYZ identical cloud servers and get access to them (XYZ can vary from 4 to 100)
Is there any performance degradation to expect when we want to create a high number of cloud servers on demand in a short amount of time
Does the cloud server size affects the time needed to create a single server
Does the cloud server size has to be taken into consideration when we perform cloud bursting?
What is a cloud server build failure rate when performing cloud busting?

Analysis

I have run a number of tests to determine and to find out more information about the performance when doing cloud bursting. The tests were relatively simple but they still give us a good overview.

Single test case description
- Create XYZ cloud servers as quickly as possible (for big numbers we need to watch for the API limits and introduce artificial delays).
- Using API pooling try to verify if the build is complete. How long does it take.
- Save logs for offline review and and generate build statistics.
- Measure how long the overall test takes.

For different cloud flavor I simulate different cloud bursting scenarios. All the tests were tailored and limited to the maximal of 150GB amount of RAM to use. The tests were run sequentialy one after another with a small gap of about 2-10min between them (logs overview and results collection)

 simulate test#1 -s 100 -f 1  # build 100 cloud servers, using 256MB instance
 simulate test#2 -s 50  -f 2  
 simulate test#3 -s 50  -f 3  
 simulate test#4 -s 50  -f 4
 simulate test#5 -s 50  -f 5
 simulate test#6 -s 18  -f 6 
 simulate test#7 -s 9   -f 7
 simulate test#8 -s 4   -f 8

Results

Each tests has a different color. A single dot represent a whole time needed to start a build for a single cloud server and then checking a status until it is ready to be use.

The table below shows a whole time for running a single test.

bursting	Flavor RAM	Flavor Id	test start	test end	duration [m]	errors
100	256	1	03:28:27 PM	04:11:35 PM	43:08.10	0
50	512	2	08:33:23 PM	08:56:02 PM	22:39.12	1
50	1024	3	09:09:55 PM	09:32:04 PM	22:09.25	1
50	2048	4	09:46:14 PM	10:02:41 PM	16:26.50	1
50	4096	5	10:22:21 PM	10:35:05 PM	12:43.30	12
18	8192	6	11:06:48 PM	11:14:28 PM	07:39.87	0
9	15872	7	11:16:23 PM	11:23:38 PM	07:15.35	0
4	30720	8	11:35:33 PM	11:46:27 PM	10:53.72	1

Having these results above I can say that:

1. In average it takes from 300 to 500 seconds to create a cloud server. The only exception is when you try to build the biggest one with 30GB of RAM that can take up to 600s and more.

2. The API limitation for creating servers are directly influencing the results. The burst test for 256MB instances took 43 minutes to complete. The create time for a single cloud server were from min 250 to almost 500 seconds.

3. Base on the tests we don't see any significant build performance degradation pattern. It is somehow expected result as when the cloud servers are built they are built on different hypervisors in different zones (huddles). This means that as long there are enough free resources in the data center the builds should be fine.

4. In average the build time for different flavors is similar. In every test there were min and maximum times. The more cloud servers we build during the burst the more the distribution varies. We need to keep in mind that the number of cloud servers to build were not the some.

5. Although all the build times are between min 250 and 500 max (exception is the 30GB cloud server) it is visible that bigger instances require more time to build. Interestingly in a single test the graph shows that as we progress and create more and more cloud servers the build time decreases fist and then later spikes up. This pattern can be clearly observed for the 256MB instances.

6. The results for the 4GB instances are not fully correct. It has to be noted that the errors in table were a result of the hard API limitations we run into. The test cloud account had a limit of about 150GB of RAM that we could consume.

7. With this in mind the build failure rates were minimal or none for all tests. It is important to note that the errors are actually timeout issues only. In every test we were waiting about for 10min maximum for a cloud server to be up and running. Only 3 test produced cloud servers that were not accessible in 10 min. I believe that if we had waited longer these servers would be built successfully.

References

1. http://searchcloudcomputing.techtarget.com/definition/cloud-bursting

Sunday, August 12, 2012

What cloud provider do you use

Every one knows that there is a number of cloud players there on the market. Some of them have been there from the beginning like Amazon some of the others are new like Google with the recent release of Google Compute Engine (GCE)[1].

These links gave you a list of some of the big names behind the ongoing cloud adoption and industry transformation:

But these information is only a listing of possible vendors to chose from. But how is the market shared between them. These questions are not answered there. There is a little bit of publicly available documents on the Internet that elaborate on it as well.

This is one of the links I found on the Openstack mailing list that give us a small inside view to this problem: Cloud tools survey

The most interesting results are listed in point # 4 and 5.

4. Which clouds do you use (choose all that are appropriate)?
Amazon: 47.4%
Rackspace: 36.8%
HP Cloud: 21.1%
Go Grid: 0%
Linode: 31.6%
Softlayer: 0%
Joyent: 0%
Azure: 5.3%
Google: 10.5%
Private Cloud: 42.1%
Other responses included AppFog, OrionVM (Australian provider), KT UCloud Biz (Korean provider), eNoCloud and Physical Servers.

5. How many cloud based servers do you manage (not physical servers)?
1-10: 45.5%
11-50: 27.3%
50-200: 9.1%
200-1000: 18.2%
1000+: 0%
A lot of fairly large deployments – many of them appear to be private clouds (who responded via the OpenStack list).

References

http://cloud.google.com/products/compute-engine.html

High Availability trends and architectures in cloud computing

Every now and than there is a shift and change in IT industry. The changes try to provide a solution to our old problems and try to as well as predict and introduce new ideas. Many times it is not only a change in hardware itself or a change in software only but a mixture of both.

The Cloud is today the buzz word that drives the changes and powers the transformation. In my attempts to embrace and understand what it is I have found couple if videos on YouTube that give us a very nice inside view into the cloud what it is, how this works and what ideas it brings with.

This is a link to the repository of all videos: http://www.youtube.com/user/TheCloudcastNET

The one I particularly like because it is discussing concept of High Availability in Cloud:

Saturday, August 11, 2012

How to execute remote commands from python over ssh connection

There are a number of possible solution for Python that allow you to execute a command remotely. Below is a list of libraries/modules I found when researching it.

paramiko

http://www.lag.net/paramiko/

http://pypi.python.org/pypi/paramiko/1.7.7.2

ssh module ( a wrapper for paramiko)

http://stackoverflow.com/questions/1939107/python-libraries-for-ssh-handling

http://media.commandline.org.uk/code/ssh.txt

pythion binding to C libssh library

http://www.no-ack.org/2010/11/python-bindings-for-libssh2.html

fabric

http://docs.fabfile.org/en/1.4.3/index.html

python SSH module (base on the paramiko)

http://pypi.python.org/pypi/ssh/1.7.11

https://github.com/bitprophet/ssh

An interesting introduction to the paramiko can be found here: SSH Programming with Paramiko

This is a simple python example as well.

cloudserver script doesn't work anylonger after installing novaclient

I have played with the FirsGen Rackspace cloud on my box. It was working fine. Some info about how to start with it can be found here rackspace-cloudservers .

At some point I have started using the Openstack nova tool to interact with the Rackspace NextGen cloud to run some tests. Once I was done I returned to the openstack tool and discoved that it stopped to work.

Problem

For every comamnd I run i got alwasys the same error message. An example output.

$ /usr/bin/cloudservers --username user --apikey 123  list
Traceback (most recent call last):
  File "/usr/bin/cloudservers", line 9, in module
    load_entry_point('python-cloudservers==1.0a5', 'console_scripts', 'cloudservers')()
  File "/usr/lib/pymodules/python2.7/cloudservers/shell.py", line 413, in main
    CloudserversShell().main(sys.argv[1:])
  File "/usr/lib/pymodules/python2.7/cloudservers/shell.py", line 127, in main
    args.func(args)
  File "/usr/lib/pymodules/python2.7/cloudservers/shell.py", line 279, in do_list
    print_list(self.cs.servers.list(), ['ID', 'Name', 'Status', 'Public IP', 'Private IP'])
  File "/usr/lib/pymodules/python2.7/cloudservers/shell.py", line 402, in print_list
    pt.printt(sortby=fields[0])
  File "/usr/local/lib/python2.7/dist-packages/prettytable.py", line 163, in __getattr__
    raise AttributeError(name)
AttributeError: printt

Solution

After debugin the prettytable.py code it turned out that there is not such a function like printt. Further research confirmed this [1]. The fix this I have changed the code of cloudserver module on my PC. The new code after changes is listed below.

# vim +399 /usr/lib/pymodules/python2.7/cloudservers/shell.py

def print_list(objs, fields):
    pt = prettytable.PrettyTable([f for f in fields], caching=False)
    pt.aligns = ['l' for f in fields]
    for o in objs:
        pt.add_row([getattr(o, f.lower().replace(' ', '_'), '') for f in fields])

    # pt.printt(sortby=fields[0])
    print pt.get_string(sortby=fields[0])

References

http://code.google.com/p/prettytable/issues/detail?id=14&q=printt
https://answers.launchpad.net/nova/+question/198709

https://github.com/calebgroom/clb/issues/24
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=673790

Thursday, August 9, 2012

Difference between class and instance variables in Python

Object oriented programming in Python can be tricky if your background comes from C++ or Java. First thing is that both C++ and Java are statically typed. This is different in Python that belongs to the class of dynamic scoped languages [4].

For Java and C++ for example if you forget to declare a variable before you try to access it it will generate an error during compilation phase. Below is an example code in Java that illustrates this.

Java

$ javac JavaExample.java 
JavaExample.java:7: cannot find symbol
symbol  : variable number
location: class JavaExample
        number=_number;
        ^
JavaExample.java:11: cannot find symbol
symbol  : variable number
location: class JavaExample
       System.out.println("[test " + number + "] list=%s");
                                     ^
2 errors

Python

When you start applying object oriented programming paradigms in Python and start to write classes and create instances you may encounter an interesting phenomena. The code below illustrate this.

The difference between the 2 classes is that in the class Test2ClassVariable we have defined this variable self.mylist=[] additionally.

When you run this code this is the output:

$ python object_variables_example.py
[test 1] list=one
[test 2] list=one two
[test 3] list=three
[test 4] list=four

The interesting part of the output is in line #3. The variable list has 2 entries where at first look we would expect it should have only one. Not knowing anything about objects in Python this would be a natrual intuition for everyone with Java or C++ background.

Explanation

We see this behavoiur because Python uses a concept of class and instance variables. For Java/C++ programmers a python class variable is simply a Java/C++ static class variable. In our example variable list is a class variable and it is shared between all class instances. That explains the output we see.

References

Tuesday, August 7, 2012

Basic tutorial on how to use and debug Cloud API for the NextGene (Openstack based) or FirsGen (build by Mosso) Rackspace cloud infrastructure

Before we start writing code and play with API we have to install the necessary python libraries first. I used the Ubuntu 11.04 for my testing.

Tools installation

Openstack API library

There is a packet with the Openstack libraries in repositories unfortunately this is not the latest code that is compatible with the NextGen Rackspace cloud release. We have to install the tools manually as described here [1].

# don't do this i Ubuntu 11.04 because it is based on older version of the library

# and will not work

aptitude install python-novaclient

aptitude install python-setuptools
easy_install pip
pip install python-novaclient
pip install --upgrade  python-novaclient

# To verify that the files were installed 
find /usr -name novaclient
/usr/share/pyshared/novaclient
/usr/local/lib/python2.7/dist-packages/novaclient

FirstGen

# to install the library
aptitude install python-rackspace-cloudservers

# to verify where the files were installed
find /usr -name cloudservers
/usr/share/pyshared/cloudservers
/usr/lib/pymodules/python2.7/cloudservers

Debugging Nova Openstack API

As Rackspace lunched his cloud first in the USA and Europe is following a few weeks later the URL below is for a US base cloud account. More about the URLs can be found here [2].

nova_example.py

# cat nova_example.py 
import httplib2
httplib2.debuglevel = 1

OS_USERNAME="user"
OS_PASSWORD="api"
OS_AUTH_URL="https://identity.api.rackspacecloud.com/v2.0/"

from novaclient.v1_1 import client
nt=client.Client(OS_USERNAME,OS_PASSWORD,'',OS_AUTH_URL)
nt.flavors.list()

You can than run this like python -i nova_example.py or run python and copy the code into it like that:

$ python -u
>>> import httplib2
>>> httplib2.debuglevel = 1
>>> 
>>> OS_USERNAME="user"
>>> OS_PASSWORD="api"
>>> OS_AUTH_URL="https://identity.api.rackspacecloud.com/v2.0/"
>>> 
>>> from novaclient.v1_1 import client
>>> nt=client.Client(OS_USERNAME,OS_PASSWORD,'',OS_AUTH_URL)
>>> nt.flavors.list()
connect: (dfw.servers.api.rackspacecloud.com, 443)
send: 'GET /v2/672114/flavors/detail HTTP/1.1\r\nHost: dfw.servers.api.rackspacecloud.com\r\nx-auth-token: mytoken\r\naccept-encoding: gzip, deflate\r\naccept: application/json\r\nuser-agent: python-novaclient\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 07 Aug 2012 21:16:28 GMT
header: Content-Length: 2448
header: Content-Type: application/json
header: X-Compute-Request-Id: req-4e7dda9c-7eee-402f-acc8-b98dc08e21b5
header: Server: Jetty(8.0.y.z-SNAPSHOT)
[Flavor: 512MB Standard Instance, Flavor: 1GB Standard Instance, Flavor: 2GB Standard Instance, Flavor: 4GB Standard Instance, Flavor: 8GB Standard Instance, Flavor: 15GB Standard Instance, Flavor: 30GB Standard Instance]

Debugging FirstGen API

firstgen_example.py

# cat  firstgen_example.py
import httplib2
httplib2.debuglevel = 1

u='user'
k='api'

from cloudservers import CloudServers
cs=CloudServers(u,k)
cs.flavors.list()

An example output

$ python -u
>>> import httplib2
>>> httplib2.debuglevel = 1
>>>
>>> from cloudservers import CloudServers
>>> u='user'
>>> k='api'
>>> cs=CloudServers(u,k)
>>> cs.flavors.list()
connect: (lon.auth.api.rackspacecloud.com, 443)
send: 'GET /v1.0 HTTP/1.1\r\nHost: lon.auth.api.rackspacecloud.com\r\nx-auth-key: key\r\naccept-encoding: gzip, deflate\r\nx-auth-user: hugoalmeidauk\r\nuser-agent: python-cloudservers/1.0a1\r\n\r\n'
reply: 'HTTP/1.1 204 No Content\r\n'
header: Server: Apache/2.2.3 (Red Hat)
header: vary: X-Auth-User,X-Auth-Key,X-Storage-User,X-Storage-Pass
header: X-Storage-Url: https://storage101.lon3.clouddrive.com/v1/MossoCloudFS_c99f3ebf-f27d-4933-94b7-9e9ebf2bc7cd
header: Cache-Control: s-maxage=60806
header: Content-Type: text/xml
header: Date: Tue, 07 Aug 2012 21:19:36 GMT
header: X-Auth-Token: token
header: X-Storage-Token: token2
header: X-Server-Management-Url: https://lon.servers.api.rackspacecloud.com/v1.0/10001641
header: Connection: Keep-Alive
header: X-CDN-Management-Url: https://cdn3.clouddrive.com/v1/MossoCloudFS_c99f3ebf-f27d-4933-94b7-9e9ebf2bc7cd
header: Content-Length: 0
connect: (lon.servers.api.rackspacecloud.com, 443)
send: 'GET /v1.0/10001641/flavors/detail?fresh HTTP/1.1\r\nHost: lon.servers.api.rackspacecloud.com\r\nx-auth-token: token\r\naccept-encoding: gzip, deflate\r\nuser-agent: python-cloudservers/1.0a1\r\n\r\n'
reply: 'HTTP/1.1 203 OK\r\n'
header: Server: Apache-Coyote/1.1
header: vary:  Accept, Accept-Encoding, X-Auth-Token
header: Content-Encoding: gzip
header: Vary: Accept-Encoding
header: Last-Modified: Tue, 21 Jun 2011 21:09:45 GMT
header: X-PURGE-KEY: /flavors
header: Cache-Control: s-maxage=1800
header: Content-Type: application/json
header: Content-Length: 175
header: Date: Tue, 07 Aug 2012 21:19:36 GMT
header: X-Varnish: 1664913388 1664913324
header: Age: 44
header: Via: 1.1 varnish
header: Connection: keep-alive
[Flavor: 256 server, Flavor: 512 server, Flavor: 1GB server, Flavor: 2GB server, Flavor: 4GB server, Flavor: 8GB server, Flavor: 15.5GB server, Flavor: 30GB server]

References

Monday, August 6, 2012

Links to Cloud provider web management consoles

For each cloud provider you have to learn what are the most important links and how to use their tools. Below is a short list in alfabetical order of these I came across.

Amazon Web Service (AWS) management console

https://console.aws.amazon.com

HP Cloud based on Openstack management console

https://console.hpcloud.com

Rackspace Cloud Control Panel (FirstGen Cloud)

https://lon.manage.rackspacecloud.com
https://manage.rackspacecloud.com

Rackspace Cloud Control Panel (NextGen Cloud base on Openstack)

https://mycloud.rackspace.com

I'm using Rackconnect and after my cloud server is built I can't ping anything

Rackconnect is Rackspace product that allows you to use a dedicated as well as the cloud products (cloud servers, cloud files and others). It provides cloud servers a secure way to connect to the dedicated and other way around. The majority of implementations are done by using F5 load balancers and Cisco ASA firewalls but company has plans to support other vendors as well. To have a better understanding of RC an implementation can be done for example on:

standalone ASA
standalone F5
HA ASA
HA F5
single ASA and single F5
single ASA and HA f5
HA ASA and single F5
HA ASA and HA F5

There are sometimes situations when the things don't work as expected. It can be because of a mis-configuration. In a very rare cases because of bugs. Below is an example of a problem I was ask to help to troubleshoot.

Problem

My newly built cloud server is isolated. After logging over console I can't ping any other servers. Example:

I can't ping my default gateway from the cloud
I can't ping any other internal or external address (example Google DNS 8.8.8.8)

Troubleshooting and Analysis

The customer is using Rackconnect. That makes all the network and routing configuration a little bit different comparing to how we set up a standard cloud server without Rackconnect.

Example of a route output from a cloud servers that is rackconnected:

$ ip r | sort | column -t
10.176.0.0/12    via  10.176.0.1     dev    eth1
10.176.0.0/18    dev  eth1           proto  kernel  scope   link  src  10.176.4.37
10.191.192.0/18  via  10.176.0.1     dev    eth1
default          via  10.176.11.111  dev    eth1

For a comparison a routing table from a cloud server that belongs to a cloud account that is not linked to Rackconnect:

# ip r | sort | column -t
10.176.0.0/12     via  10.177.128.1   dev    eth1
10.177.128.0/18   dev  eth1           proto  kernel  scope   link  src  10.177.132.15
10.191.192.0/18   via  10.177.128.1   dev    eth1
164.177.146.0/24  dev  eth0           proto  kernel  scope   link  src  164.177.146.87
default           via  164.177.146.1  dev    eth0    metric  100

Starting troubleshooting I was able to reproduce the issue and confirm that you can't ping the default gateway or anything internal or external IP:

$ ping 10.176.11.111
PING 10.176.11.111 (10.176.11.111) 56(84) bytes of data.
^C
--- 10.176.11.111 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 1293ms

Looking further I saw that the host fails to resolve the IP 10.176.11.111 (default gateway) to its MAC address. The following output from the cloud server confirmed this.

$ arp -an
? (10.176.11.111) at incomplete on eth1

As the customer was using RC with F5 load balancer i went there to check what traffic was hitting the LB. I confirmed that the cloud default gateway is defined on F5 as a self ip object so F5 should be responding to the pings.

# tmsh list /net self
net self 10.176.11.111/18 {
    vlan hybridServiceNet-100
}

When pinging from cloud server I saw that F5 see the ARP request but never replays.

[lb:Active]~ # tcpdump -s0 -l -nn -i 0.0:nnn arp or icmp or host 10.176.4.37 | grep --color=always 10.176.4.37
15:14:26.465108 arp who-has 10.176.11.111 tell 10.176.4.37

I tried to ping the cloud server from F5 and to my surprise it worked fine.

Even at some point shortly after I pinged the cloud server from F5 successfully I was able to see ICMP requests from the cloud server but still F5 was not responding back. That means once the cloud server learned the MAC address of its default gateway it stopped sending the ARP requests and proceed to sending ICMP requests next, as expected.

The cloud server learned the MAC of its default gateway for a while but as the entry timeouted there were only ARP request on the wire again.

[lb:Active] ~ # tcpdump -s0 -l -nn -i 0.0:nnn arp or icmp or host 10.176.4.37 | grep --color=always 10.176.4.37
15:20:35.309694 IP 10.176.4.37 > 10.176.11.111: ICMP echo request, id 36453, seq 1, length 64 in slot1/tmm0 lis= flowtype=0 flowid=0 =00000000:00000000:00000000:00000000 remoteport=0 localport=0 proto=0 vlan=0

By looking further at the F5 configuration I found that there was a missing filter to allow this traffic. It is important because the default RC implementation is dropping any traffic sent from the cloud network to the Rackconnect device (F5 for my customer). Below is a example of a filter that was needed for the ping to start to work.

Example of a missing filter

net packet-filter RCAuto-ID_8-NP_28780-CS_57663-GW_62497 {
    action accept
    order 8
    rule "( src host 10.176.4.37 ) and ( dst host 10.176.11.111 )"
}

The easy way to create it is to go to MyRackspace portal and add all the basic network polices under the Rackconnect. By adding these network policies the Rackconnect system generated number of rules for the F5 and one of them was the missing filter described above:

Basic Access Configuration Policy 1 CLOUD SERVERS CLOUD SERVERS ANY ALL  
Basic Access Configuration Policy 2 CLOUD SERVERS DEDICATED ANY ALL
Basic Access Configuration Policy 3 CLOUD SERVERS INTERNET ANY ALL    
Basic Access Configuration Policy 4 DEDICATED CLOUD SERVERS ANY ALL

This resolved the issue with the default gateway but I was still unable to ping 8.8.8.8. Once again looking at the tcpdumps on F5 I saw traffic hitting F5. The loadbalancer never forwarded it further. The traffic was simply dropped on the incoming VLAN.

To understand why this is happening it is important to know that by default F5 will drop any traffic unless there is a configuration object to handle it. In my case there were no specific virtual servers, NAT, SNAT that could handle the traffic. We had only a forwarding VS. As the VS was enabled only on specific VLANS it was not used to handle the cloud traffic so all traffic was dropped.

ltm virtual VS-FORWARDING {
    destination any:any
    ip-forward
    mask any
    profiles {
        PROF-FASTL4-FORWARDING { }
    }
    translate-address disabled
    translate-port disabled
 vlans {   
        internal
        external
    }
    vlans-enabled
}

Once we resolved this and enable the VS on necessary vlans all issues were resolved ;).

Sunday, August 5, 2012

Do you know what is powering Rackconnect at Rackspace

These a few videos are showing some inside into the Hybrid Cloud Product Rackspace has to offer to its customers:

If you are looking for more information these are some links to read more about Rackconnect:

http://www.rackspace.com/cloud/hybrid/dedicated_cloud/rackconnect/

http://www.rackspace.com/cloud/hybrid/dedicated_cloud/

How to configure the Python interpreter to save all your commands to a history file on a disk

The interactive Python interpreter is a great way to quickly test python expressions and code snippets. With the default configuration the only disadvantage is that it doesn't keep any history of the commands your run last time. If you restart the interpreter you have to type all commands again and again.

Solution

On of the solution is to define a script to execute when the interpreter starts. This is described here [1]. My own configs can be found here [2].

What does the cloud API returns when a server build failed

Cloud is a very flexible solution that can be used to build resilient, highly available and self healing architectures. But it is important to understand that it is not unbreakable and the solution has to be able to deal with it.

As an example, once in a while you are going to see that the cloud API to created a server succeded but the server itself never came online because the build failed. On Rackspace cloud infrastructure this can be visible like this.

In the original Cloud Control Panel (CP / CCP)

https://lon.manage.rackspacecloud.com/CloudServers/ServerList.do

In the new Cloud Control Panel (CP / CCP)

https://mycloud.rackspace.com/a/accoutname/#

Cloud API

$ curl -v  -H 'x-auth-token: token' https://lon.servers.api.rackspacecloud.com/v1.0/10001641/servers/10204743?fresh | json_xs 
 GET /v1.0/10001641/servers/10204743?fresh HTTP/1.1
 User-Agent: curl/7.21.3 (i686-pc-linux-gnu) libcurl/7.21.3 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.18
 Host: lon.servers.api.rackspacecloud.com
 Accept: */*
 x-auth-token: token
 
 HTTP/1.1 203 OK
 Server: Apache-Coyote/1.1
 vary:  Accept, Accept-Encoding, X-Auth-Token
 Last-Modified: Sun, 05 Aug 2012 01:58:58 GMT
 X-PURGE-KEY: /10001641/servers/10204743
 Cache-Control: s-maxage=1800
 Content-Type: application/json
 Content-Length: 238
 Date: Sun, 05 Aug 2012 11:19:24 GMT
 X-Varnish: 1664599762 1664599608
 Age: 50
 Via: 1.1 varnish
 Connection: keep-alive
{
   "server" : {
      "status" : "ERROR",
      "progress" : 0,
      "name" : "csperform1344130906",
      "imageId" : 112,
      "flavorId" : 1,
      "addresses" : {
         "private" : [
            "10.176.4.188"
         ],
         "public" : [
            "46.38.185.119"
         ]
      },
      "hostId" : "cb728dfc3549dc7c92fb4abcba89dd0a",
      "metadata" : {},
      "id" : 10204743
   }
}

Saturday, August 4, 2012

How to change the action for Alt-Tab shortcut in VirtualBox and switch application windows running on your Windows 7 desktop rather than switching windows between the applications inside the VM only

Using multiple virtual mashines on your single desktop can be a little bit tricky when you need to change quickly focus between the various windows. In my practice I have offen a Windows 7 base desktop and run multiple VM's either using the VirtualBox or VMware Workstation

Problem

From the desktop in Windows 7 Alt+Tab switches between all the running application windows perfectly. Once in the single VM context the some combination switches windows of your VM application rather then all your origianl Windows application.

Solution

Use the host key . In VirtualBox by default, this is the right Control key on your keyboard. It changes the focus and once pressed any key pressed after is evaluated according to your host operating system and not the VM itself.

References

Search This Blog

Wednesday, August 29, 2012

Monday, August 27, 2012

Friday, August 24, 2012

Sunday, August 19, 2012

Saturday, August 18, 2012

Thursday, August 16, 2012

Tuesday, August 14, 2012

Monday, August 13, 2012

Sunday, August 12, 2012

Saturday, August 11, 2012

Thursday, August 9, 2012

Tuesday, August 7, 2012

Monday, August 6, 2012

Sunday, August 5, 2012

Saturday, August 4, 2012

RSS/Atom feed

Popular posts last week

Last posts

About Me

Linux enthusiast

My links

Friendly sites

Visitor locations

Total Pageviews

Blog Archive

Labels