Hướng dẫn spark with python
The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don’t know Scala. This guide will show how to use the Spark features described there in Python. Show Key Differences in the Python APIThere are a few key differences between the Python and Scala APIs:
In PySpark, RDDs support the same methods as their Scala counterparts but take Python functions and return Python collection types. Short functions can be passed to RDD methods using
Python’s
You can also pass functions that are defined with the
Functions can access objects in enclosing scopes, although modifications to those objects within RDD methods will not be propagated back:
PySpark will automatically ship these functions to workers, along with any objects that they reference. Instances of classes will be serialized and shipped to workers by PySpark, but classes themselves cannot be automatically distributed to workers. The Standalone Use section describes how to ship code dependencies to workers. In addition, PySpark fully supports interactive use—simply run
Installing and Configuring PySparkPySpark requires Python 2.6 or higher. PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions. We have not tested PySpark with Python 3 or with alternative Python interpreters, such as PyPy or Jython. By default, PySpark requires All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported. Standalone PySpark
applications should be run using the Interactive UseThe
The Python shell can be used explore data interactively and is a simple way to learn the API:
By default, the
Or, to use four cores on the local machine:
IPythonIt is also possible to launch PySpark in IPython, the enhanced Python interpreter. PySpark works with IPython 1.0.0 and later. To use IPython, set the
Alternatively, you can customize the
IPython also works on a cluster or on multiple cores if you set the Standalone ProgramsPySpark can also be used from standalone Python scripts by creating a SparkContext in your script and running the script using Code dependencies can be deployed by listing them in the
Files listed here will be added to the You can set configuration properties by passing a SparkConf object to SparkContext:
API DocsAPI documentation for PySpark is available as Epydoc. Many of the methods also contain doctests that provide additional usage examples. LibrariesMLlib is also available in PySpark. To use it, you’ll need NumPy version 1.7 or newer, and Python 2.7. The MLlib guide contains some example applications. Where to Go from HerePySpark also includes several sample programs in the
Each program prints usage help when run without arguments. |