Skip to content

Commit 8fe3884

Browse files
Initial commit
0 parents  commit 8fe3884

File tree

5 files changed

+717
-0
lines changed

5 files changed

+717
-0
lines changed

Readme.md

Lines changed: 313 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,313 @@
1+
# Automatic Deployment of a Hadoop-Spark Cluster using Terraform
2+
---
3+
4+
This is a project I created for the `Big Data Systems & Techniques` course of my MSc in Data Science.
5+
It's a very basic implementation of an orchestration systems that provisions and configures a 3-node cluster (the number of data-nodes can be easily extended) with Apache Hadoop and Apache Spark.
6+
7+
## Project's Task
8+
9+
The task of this project is to use the [Terraform](https://terraform.io) IAC (Infrastructure as Code) tool to automatically provision Amazon VMs and install Hadoop on the cluster.
10+
The resources used are:
11+
- Linux image: `Ubuntu 16.04`
12+
- Java: `jdk1.8.0_131`
13+
- Apache Hadoop: `hadoop-2.7.2`
14+
- Apache Spark: `spark-2.1.1`
15+
16+
## What is Infrastructure As Code and what is Terraform?
17+
18+
Infrastructure as code is a new DevOps philosophy where the application infrastructure is no longer created by hand but programmatically. The benefits are numerous
19+
including but not limited to:
20+
- Speed of deployment
21+
- Version Control of Infrastructure
22+
- Engineer agnostic infrastructure (no single point of failure/no single person to bug)
23+
- Better lifetime management (automatic scale up/down, healing)
24+
- Cross-provider deployment with minimal changes
25+
26+
Terraform is a tool that helps in this direction. It is an open source tool developed by [Hashicorp](https://www.hashicorp.com/).
27+
28+
This tool allows you to write the final state that you wish your infrastructure to have and terraform applies those changes for you.
29+
30+
You can provision VMs, create subnets, assign security groups and pretty much perform any action that any cloud provider allows.
31+
32+
Terraform support a wide range of [providers](https://www.terraform.io/docs/providers/index.html) including the big 3 ones AWS, GCP, Microsoft Azure.
33+
34+
## Installing Terraform
35+
36+
Terraform is written in Go and is provided as a binary for the major OSs but can also be compiled from [source code](https://github.com/hashicorp/terraform).
37+
38+
The binary can be downloaded from the Terraform [site](https://www.terraform.io/downloads.html) and does not require any installation. We just need to set it to the path variable (for Linux/macOS instructionscan be found [here](https://stackoverflow.com/questions/14637979/how-to-permanently-set-path-on-linux) and for Windows [here](https://stackoverflow.com/questions/1618280/where-can-i-set-path-to-make-exe-on-windows)) so that it is accessible from our system in any path.
39+
40+
After we have this has finished we can confirm that it is ready to be used by running the terraform command and we should get something like the following:
41+
42+
```
43+
$ terraform
44+
Usage: terraform [--version] [--help] <command> [args]
45+
46+
The available commands for execution are listed below.
47+
The most common, useful commands are shown first, followed by
48+
less common or more advanced commands. If you're just getting
49+
started with Terraform, stick with the common commands. For the
50+
other commands, please read the help and docs before usage.
51+
52+
Common commands:
53+
apply Builds or changes infrastructure
54+
console Interactive console for Terraform interpolations
55+
destroy Destroy Terraform-managed infrastructure
56+
env Environment management
57+
fmt Rewrites config files to canonical format
58+
get Download and install modules for the configuration
59+
graph Create a visual graph of Terraform resources
60+
import Import existing infrastructure into Terraform
61+
init Initialize a new or existing Terraform configuration
62+
output Read an output from a state file
63+
plan Generate and show an execution plan
64+
push Upload this Terraform module to Atlas to run
65+
refresh Update local state file against real resources
66+
show Inspect Terraform state or plan
67+
taint Manually mark a resource for recreation
68+
untaint Manually unmark a resource as tainted
69+
validate Validates the Terraform files
70+
version Prints the Terraform version
71+
72+
All other commands:
73+
debug Debug output management (experimental)
74+
force-unlock Manually unlock the terraform state
75+
state Advanced state management
76+
```
77+
78+
Now we can move on the using the tool.
79+
80+
## Setting up the AWS account
81+
82+
This is a step that is not specific to this project but rather it's something that needs to be configured whenever a new AWS account is set up.
83+
When we create a new account with Amazon, the default account we are given has root access to any action. Similarly with the linux root user we do not want to be using this account for the day-to-day actions, so we need to create a new user.
84+
85+
We navigate to the [Identity and Access Management (IAM)](https://console.aws.amazon.com/iam/home#) page, click on `Users`, then the `Add user` button. We provide the User name, and click the Programmatic access checkbox so that an access key ID and a secret access key will be generated.
86+
87+
Clicking next we are asked to provide a Security Group that this User will belong to. Security Groups are the main way to provide permission and restrict access to specific actions required. For this purpose of this project we will give the `AdministratorAccess` permission to this user, however when used in a professional setting it is advised to only allow permissions that a user needs (like AmazonEC2FullAccess if a user will only be creating EC2 instances).
88+
89+
Finishing the review step Amazon will provide the Access key ID and Secret access key. We will provide these to terraform to grant it access to create the resources for us. We need to keep these as they are only provided once and cannot be retrieved (however we can always create a new pair).
90+
91+
The secure way to store these credentials as recommended by [Amazon](https://aws.amazon.com/blogs/security/a-new-and-standardized-way-to-manage-credentials-in-the-aws-sdks/) is keeping them in a hidden folder under a file called `credentials`. This file can be accessed by terraform to retrieve them.
92+
93+
```
94+
$ cd
95+
$ mkdir .aws
96+
$ cd .aws
97+
~/.aws$ vim credentials
98+
```
99+
100+
We add the following to the credentials file after replacing `ACCESS_KEY` and `SECRET_KEY` and then save it:
101+
102+
```
103+
[default]
104+
aws_access_key_id = ACCESS_KEY
105+
aws_secret_access_key = SECRET_KEY
106+
```
107+
108+
We also restrict access to this file only to the current user:
109+
110+
```
111+
~/.aws$ chmod 600 credentials
112+
```
113+
114+
## Setting up a key pair
115+
116+
The next step is to create a key pair so that terraform can access the newly created VMS. Notice that this is different than the above credentials. The Amazon credentials are for accessing and allowing the AWS service to create the resources required, while this key pair will be used for accessing the new instances.
117+
118+
Log into the [AWS console](https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#KeyPairs:sort=keyName) and select `Create Key Pair`. Add a name and click `Create`. AWS will create a .pem file and download it locally.
119+
120+
Move this file to the `.aws` directory.
121+
```
122+
~/Downloads$ mv ssh-key.pem ../.aws/
123+
```
124+
125+
The restrict the permissions:
126+
```
127+
$ chmod 400 ssh-key.pem
128+
```
129+
130+
Now we ready to use this key pair either via a direct ssh to our instances, or for terraform to use this to connect to the instances and run some scripts.
131+
132+
## Provisioning VMs & Configuring Them
133+
134+
The following terraform script is responsible for the creation of the VM instances, copying the relevant keys to give us access to them as well as run the startup script that configures the nodes.
135+
136+
We run this with `terraform plan` and terraform informs us about the changes it's going to make:
137+
138+
```
139+
$ terraform plan
140+
Refreshing Terraform state in-memory prior to plan...
141+
The refreshed state will be used to calculate this plan, but will not be
142+
persisted to local or remote state storage.
143+
144+
The Terraform execution plan has been generated and is shown below.
145+
Resources are shown in alphabetical order for quick scanning. Green resources
146+
will be created (or destroyed and then created if an existing resource
147+
exists), yellow resources are being changed in-place, and red resources
148+
will be destroyed. Cyan entries are data sources to be read.
149+
150+
Note: You didn't specify an "-out" parameter to save this plan, so when
151+
"apply" is called, Terraform can't guarantee this is what will execute.
152+
153+
+ aws_instance.Datanode.0
154+
ami: "ami-a8d2d7ce"
155+
associate_public_ip_address: "<computed>"
156+
availability_zone: "<computed>"
157+
ebs_block_device.#: "<computed>"
158+
ephemeral_block_device.#: "<computed>"
159+
instance_state: "<computed>"
160+
instance_type: "t2.micro"
161+
ipv6_address_count: "<computed>"
162+
ipv6_addresses.#: "<computed>"
163+
key_name: "ssh-key"
164+
network_interface.#: "<computed>"
165+
network_interface_id: "<computed>"
166+
placement_group: "<computed>"
167+
primary_network_interface_id: "<computed>"
168+
private_dns: "<computed>"
169+
private_ip: "172.31.32.102"
170+
public_dns: "<computed>"
171+
public_ip: "<computed>"
172+
root_block_device.#: "<computed>"
173+
security_groups.#: "<computed>"
174+
source_dest_check: "true"
175+
subnet_id: "<computed>"
176+
tags.%: "1"
177+
tags.Name: "s02"
178+
tenancy: "<computed>"
179+
volume_tags.%: "<computed>"
180+
vpc_security_group_ids.#: "<computed>"
181+
182+
+ aws_instance.Datanode.1
183+
ami: "ami-a8d2d7ce"
184+
associate_public_ip_address: "<computed>"
185+
availability_zone: "<computed>"
186+
ebs_block_device.#: "<computed>"
187+
ephemeral_block_device.#: "<computed>"
188+
instance_state: "<computed>"
189+
instance_type: "t2.micro"
190+
ipv6_address_count: "<computed>"
191+
ipv6_addresses.#: "<computed>"
192+
key_name: "ssh-key"
193+
network_interface.#: "<computed>"
194+
network_interface_id: "<computed>"
195+
placement_group: "<computed>"
196+
primary_network_interface_id: "<computed>"
197+
private_dns: "<computed>"
198+
private_ip: "172.31.32.103"
199+
public_dns: "<computed>"
200+
public_ip: "<computed>"
201+
root_block_device.#: "<computed>"
202+
security_groups.#: "<computed>"
203+
source_dest_check: "true"
204+
subnet_id: "<computed>"
205+
tags.%: "1"
206+
tags.Name: "s03"
207+
tenancy: "<computed>"
208+
volume_tags.%: "<computed>"
209+
vpc_security_group_ids.#: "<computed>"
210+
211+
+ aws_instance.Namenode
212+
ami: "ami-a8d2d7ce"
213+
associate_public_ip_address: "<computed>"
214+
availability_zone: "<computed>"
215+
ebs_block_device.#: "<computed>"
216+
ephemeral_block_device.#: "<computed>"
217+
instance_state: "<computed>"
218+
instance_type: "t2.micro"
219+
ipv6_address_count: "<computed>"
220+
ipv6_addresses.#: "<computed>"
221+
key_name: "ssh-key"
222+
network_interface.#: "<computed>"
223+
network_interface_id: "<computed>"
224+
placement_group: "<computed>"
225+
primary_network_interface_id: "<computed>"
226+
private_dns: "<computed>"
227+
private_ip: "172.31.32.101"
228+
public_dns: "<computed>"
229+
public_ip: "<computed>"
230+
root_block_device.#: "<computed>"
231+
security_groups.#: "<computed>"
232+
source_dest_check: "true"
233+
subnet_id: "<computed>"
234+
tags.%: "1"
235+
tags.Name: "s01"
236+
tenancy: "<computed>"
237+
volume_tags.%: "<computed>"
238+
vpc_security_group_ids.#: "<computed>"
239+
240+
+ aws_security_group.instance
241+
description: "Managed by Terraform"
242+
egress.#: "1"
243+
egress.482069346.cidr_blocks.#: "1"
244+
egress.482069346.cidr_blocks.0: "0.0.0.0/0"
245+
egress.482069346.from_port: "0"
246+
egress.482069346.ipv6_cidr_blocks.#: "0"
247+
egress.482069346.prefix_list_ids.#: "0"
248+
egress.482069346.protocol: "-1"
249+
egress.482069346.security_groups.#: "0"
250+
egress.482069346.self: "false"
251+
egress.482069346.to_port: "0"
252+
ingress.#: "4"
253+
ingress.2214680975.cidr_blocks.#: "1"
254+
ingress.2214680975.cidr_blocks.0: "0.0.0.0/0"
255+
ingress.2214680975.from_port: "80"
256+
ingress.2214680975.ipv6_cidr_blocks.#: "0"
257+
ingress.2214680975.protocol: "tcp"
258+
ingress.2214680975.security_groups.#: "0"
259+
ingress.2214680975.self: "false"
260+
ingress.2214680975.to_port: "80"
261+
ingress.2319052179.cidr_blocks.#: "1"
262+
ingress.2319052179.cidr_blocks.0: "0.0.0.0/0"
263+
ingress.2319052179.from_port: "9000"
264+
ingress.2319052179.ipv6_cidr_blocks.#: "0"
265+
ingress.2319052179.protocol: "tcp"
266+
ingress.2319052179.security_groups.#: "0"
267+
ingress.2319052179.self: "false"
268+
ingress.2319052179.to_port: "9000"
269+
ingress.2541437006.cidr_blocks.#: "1"
270+
ingress.2541437006.cidr_blocks.0: "0.0.0.0/0"
271+
ingress.2541437006.from_port: "22"
272+
ingress.2541437006.ipv6_cidr_blocks.#: "0"
273+
ingress.2541437006.protocol: "tcp"
274+
ingress.2541437006.security_groups.#: "0"
275+
ingress.2541437006.self: "false"
276+
ingress.2541437006.to_port: "22"
277+
ingress.3302755614.cidr_blocks.#: "1"
278+
ingress.3302755614.cidr_blocks.0: "0.0.0.0/0"
279+
ingress.3302755614.from_port: "50010"
280+
ingress.3302755614.ipv6_cidr_blocks.#: "0"
281+
ingress.3302755614.protocol: "tcp"
282+
ingress.3302755614.security_groups.#: "0"
283+
ingress.3302755614.self: "false"
284+
ingress.3302755614.to_port: "50010"
285+
name: "Namenode-instance"
286+
owner_id: "<computed>"
287+
vpc_id: "<computed>"
288+
289+
290+
Plan: 4 to add, 0 to change, 0 to destroy.
291+
```
292+
293+
Then we run `terraform apply` to start the creation of our resources. Once we are done we can see that terraform has output the dns name of the master node so that we can login to it and start out services.
294+
295+
In order to remove all resources we run `terraform destroy`.
296+
297+
## Using Configuration Tools
298+
While using the above bash script is OK for a small project, we want to use a more advanced configuration tool is we are going to use terraform in production.
299+
There are many choices here, with the main being `Chef` that terraform supports natively, however all we can use the rest of the major tools like Ansible, Puppet etc if they are installed in our local terraform machine.
300+
301+
Furthermore Terraform suggests creating custom images using the [Packer](https://www.packer.io) tool. Customer images are built using the base linux (or any other image) after we have added the software required. This is then packaged into a single image which is loaded into the VM ready to be used. This saves both time, as well as bandwidth when creating the infrastructure.
302+
303+
## Future improvements
304+
As mentioned above the goal is to make this more customizable regarding the number of nodes that can be created, the versions of Java, Hadoop, Spark used, as well as the instance type of the nodes.
305+
306+
## Resources:
307+
308+
- [Terraform](https://www.terraform.io/intro/index.html)
309+
- [AWS](https://www.terraform.io/docs/providers/aws/index.html)
310+
- [Google Cloud](https://www.terraform.io/docs/providers/google/index.html)
311+
- [Introduction to Packer](https://www.packer.io/intro/getting-started/install.html)
312+
- [Using Chef](https://www.terraform.io/docs/provisioners/chef.html)
313+
- [Using Ansible](https://www.trainingdevops.com/training-material/ansible-workshop/using-ansible-with-terrafoam)

0 commit comments

Comments
 (0)