Initialization and Termination

As a starting point of Project Orca, you need to call init_orca_context to create or get a SparkContext for your Spark cluster (and launch Ray services across the cluster if necessary). When your application finishes, you need to call stop_orca_context to stop the SparkContext (and stop Ray services across the cluster if necessary).

from zoo.orca import init_orca_context, stop_orca_context

# At the very beginning:
sc = init_orca_context(cluster_mode="local", cores=2, memory="2g", num_nodes=1,
                       init_ray_on_spark=False, **kwargs)

# Your application goes after init_orca_context.

# When your application finishes:

Arguments for init_orca_context:

For "spark-submit", you are supposed to use spark-submit to submit the application. In this case, please set the Spark configurations through command line options or the properties file. You need to use "spark-submit" for yarn-cluster or k8s-cluster mode. To make things easier, you are recommended to use the launch scripts we provide.

For other cluster modes, you are recommended to install and run analytics-zoo through pip, which is more convenient.

Extra Configurations

Users can make extra configurations when using the functionalities of Project Orca via OrcaContext.

Import OrcaContext using from from zoo.orca import OrcaContext and then you can choose to modify the following options:

OrcaContext.log_output = False  # Default
OrcaContext.log_output = True

Whether to redirect Spark driver JVM's stdout and stderr to the current python process. This is useful when running Analytics Zoo in jupyter notebook. Default to be False. Needs to be set before initializing SparkContext.

OrcaContext.pandas_read_backend = "spark"  # Default
OrcaContext.pandas_read_backend = "pandas"

The backend for reading csv/json files. Either "spark" or "pandas". "spark" backend would call spark.read and "pandas" backend would call pandas.read. Default to be "spark".

OrcaContext.serialize_data_creation = False  # Default
OrcaContext.serialize_data_creation = True

Whether add a file lock to the data loading process for PyTorch Horovod training. This would be useful when you run multiple workers on a single node to download data to the same destination. Default to be False.