Skip to content

Commit 03cc765

Browse files
committed
Merge pull request #881 from pwendell/master
Extend QuickStart to include next steps
2 parents 0e9565a + 0e375a3 commit 03cc765

File tree

1 file changed

+31
-8
lines changed

1 file changed

+31
-8
lines changed

docs/quick-start.md

+31-8
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ scala> textFile.filter(line => line.contains("Spark")).count() // How many lines
5353
res3: Long = 15
5454
{% endhighlight %}
5555

56-
## More On RDD Operations
56+
## More on RDD Operations
5757
RDD actions and transformations can be used for more complex computations. Let's say we want to find the line with the most words:
5858

5959
{% highlight scala %}
@@ -163,8 +163,6 @@ $ sbt run
163163
Lines with a: 46, Lines with b: 23
164164
{% endhighlight %}
165165

166-
This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS.
167-
168166
# A Standalone Job In Java
169167
Now say we wanted to write a standalone job using the Java API. We will walk through doing this with Maven. If you are using other build systems, consider using the Spark assembly JAR described in the developer guide.
170168

@@ -252,8 +250,6 @@ $ mvn exec:java -Dexec.mainClass="SimpleJob"
252250
Lines with a: 46, Lines with b: 23
253251
{% endhighlight %}
254252

255-
This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS.
256-
257253
# A Standalone Job In Python
258254
Now we will show how to write a standalone job using the Python API (PySpark).
259255

@@ -290,6 +286,33 @@ $ ./pyspark SimpleJob.py
290286
Lines with a: 46, Lines with b: 23
291287
{% endhighlight python %}
292288

293-
This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS.
294-
295-
Also, this example links against the default version of HDFS that Spark builds with (1.0.4). You can run it against other HDFS versions by [building Spark with another HDFS version](index.html#a-note-about-hadoop-versions).
289+
# Running Jobs on a Cluster
290+
291+
There are a few additional considerations when running jobs on a
292+
[Spark](spark-standalone.html), [YARN](running-on-yarn.html), or
293+
[Mesos](running-on-mesos.html) cluster.
294+
295+
### Including Your Dependencies
296+
If your code depends on other projects, you will need to ensure they are also
297+
present on the slave nodes. A popular approach is to create an
298+
assembly jar (or "uber" jar) containing your code and its dependencies. Both
299+
[sbt](https://github.com/sbt/sbt-assembly) and
300+
[Maven](http://maven.apache.org/plugins/maven-assembly-plugin/)
301+
have assembly plugins. When creating assembly jars, list Spark
302+
itself as a `provided` dependency; it need not be bundled since it is
303+
already present on the slaves. Once you have an assembled jar,
304+
add it to the SparkContext as shown here. It is also possible to submit
305+
your dependent jars one-by-one when creating a SparkContext.
306+
307+
### Setting Configuration Options
308+
Spark includes several configuration options which influence the behavior
309+
of your job. These should be set as
310+
[JVM system properties](configuration.html#system-properties) in your
311+
program. The options will be captured and shipped to all slave nodes.
312+
313+
### Accessing Hadoop Filesystems
314+
315+
The examples here access a local file. To read data from a distributed
316+
filesystem, such as HDFS, include
317+
[Hadoop version information](index.html#a-note-about-hadoop-versions)
318+
in your build file. By default, Spark builds against HDFS 1.0.4.

0 commit comments

Comments
 (0)