SHARP - Scalable Hierarchical Aggregation Protocol (2.1.0)
-------------------------------------------------------------------------------

Copyright (c) 2016-2019 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

License
-------------------------------------------------------------------------------

See LICENSE file.

Overview
-------------------------------------------------------------------------------
This document addresses system-level management of the Scalable Hierarchical
Aggregation Protocol (SHARP) resources. This includes system-wide resource
manager (Aggregation Manager - AM), SHARP Daemon (SD) which is local to each
compute node and provides accesses to switch-based collective communication
capabilities, libsharp, libsharp_coll - user level communication libraries,
sharp_job_quota - Job Scheduler API.


### Terminology

* __AN (Aggregation Node)__:  ASIC hardware and local firmware implemented in
  Switch-IB 2.
* __Tree (Aggregation Tree)__: A SHARP tree represents a reduction-tree.
  The tree is composed of leaves representing data sources, internal nodes representing
  aggregation nodes, with the edges entering the junction representing the association of the children with the parent node.
* __Job__: SHARP resources are allocated for a job.
* __CP__: Computation Process. OOB (Out Of Band) process, e.g. MPI process. CP#__n__  in that notation n – is process id
* __Group__: The SHARP group is an aggregation collective group describes the vertices,
   leaves and aggregation nodes, associated with a given concrete reduction operation.
  For example, the leaves of a collective group may be mapped to an MPI communicator,
  with the rest of the elements being mapped to switches. Specific reduction operations have their data sources
  on a subset of the system nodes. The subset of leaf nodes and the aggregation nodes form the reduction tree are called the aggregation
  group, and correspond to a subtree of the SHARP tree.
* __AM (Aggregation Manager)__: system wide entity responsible for SHARP resource management.
* __SD (SHARP Daemon)__: local to each compute node responsible for connection
  establishment. SD#__n__  in that notation n – is process id
  of SD in the job.
  SD created the job has special responsibilities including communication
  with AM and resource management on job level. In current implementation, Computation Process id 0 (CP#0),e.g. MPI#0
  initiates job creation, so SD#0 is the special SD.
* __libsharp API__ : a library (shared object) to instruct SD to perform actions.
* __libsharp_coll API__ : high level API exposes collective abstraction over SHARP.
* __SMX__ : communication library used for SD to AM and SD to SD messaging.
* __OST__ :  Outstanding Operation.
* __Group channel__ is a client process (Computation Process) in the node selected for
  sending collective operation to assigned AN.
* __Radix__ is a number of children in the Aggregation Node.
* __Child index__ is an index of group member in the list of node children.
* __"Job Scheduler" ("JS")__ is a system for management resources in HPC cluster. For example:SLURM , IBM Platform LSF
* __Managed environment ("managed mode")__ is an environment in which a job scheduler is running
  and an SD runs mode in which it expects notifications from the JS, respectively.

### Aggregation Manager

The Aggregation Manager (AM) is a system management component used for system
level configuration and management of the switch-based reduction capabilities.
It is used to setup the SHARP trees, manage the use of these entities.

AM is responsible for:

* SHARP resource discovery.
* Creating topology aware SHARP trees.
* Configuring SHARP switch capabilities.
* Managing SHARP resources.
* Assigning SHARP resource on request.
* Freeing SHARP resources on job termination.

AM is configured by topology file created by Subnet Manager (SM).
The file includes information about switches and HCAs.

Relevant parameters (AM):

* `fabric_lst_file` OpenSM v4.7.x-4.8.x
* `fabric_smdb_file` OpenSM v4.9 or later

Following the topology, AM discovers SHARP capabilities using MADs. During the
discovery, AM cleans SHARP resources allocated in AN.

Relevant parameters (AM):

* `clean_an_on_discovery`

Based on the topology, AM creates Aggregation Trees. Aggregation Tree is
a logical tree defines flow of collective operations. The communication capabilities
(QPs) between tree nodes are created between tree nodes during system initialization.

A user can configure pre-defined trees in AM. In the user-defined trees file,
the ANs are identified by the node names, as in the topology file created by the SM.
The file format is as follows:

```
tree <tree-id>
node {node description} [GUID:<port_guid_num>]
subNode {node description} [GUID:<port_guid_num>]
subNode {node description} [GUID:<port_guid_num>]
...
node {node description} [GUID:<port_guid_num>]
subNode {node description} [GUID:<port_guid_num>]
...
node {node description} [GUID:<port_guid_num>]
computePort {node description} [GUID:<port_guid_num>]
computePort {node description} [GUID:<port_guid_num>]
...
tree <tree-id>
node {node description} [GUID:<port_guid_num>]
````
See also [Trees Configuration Reference](doc/TreesConfigurationFile.md) .

Relevant parameters (AM):

* `trees_file`

AM computes Aggregation Trees automatically for quasi fat tree
topology (needs user-defined root guids file for OpenSM v4.7-4.8).

Relevant parameters (AM):

* `root_guids_file`

For a new job launch, AM allocates SHARP resources. The resource allocation
includes two main steps:

* __Tree matching.__ AM selects an available tree which has non-broken subtree that spans
  all job hosts. For each host, AM assigns AN which which the host may form connection.
* __Resource allocation.__ AM sets resources for each AN which serves the job. This includes
  buffers, OSTs, maximum number of groups and QPs available for children connection.

 Relevant parameters (AM):

 * `max_tree_radix`
 * `max_quota`
 * `default_quota`

 A user application may ask specific amount of SHARP resources. An application can operate with
 OSTs, user data per group and number of groups. If any of these resources is 0, AM uses default value from its
 configuration file. OSTs, user data per OST and max radix are translated into a size of buffer that AM allocates for the job.
 AM can return to the application less resources than requested and even decline the resource allocation request. If there are no
 available resources for the job, HCOLL implements failback.

 Relevant parameters (HCOLL, SHARP_COLL):

 * `HCOLL_ENABLE_SHARP`
 * `SHARP_COLL_JOB_QUOTA_OSTS`
 * `SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST`
 * `SHARP_COLL_JOB_QUOTA_MAX_GROUPS`
 * `SHARP_COLL_JOB_QUOTA_MAX_QPS_PER_PORT`
 * `SHARP_COLL_OSTS_PER_GROUP`

AM can read configuration parameters from command line, environment variables or configuration file.

Here is a simple usage example:

```
sharp_am -B --fabric_lst_file subnet.lst
```

AM supports following configuration parameters:

```
Aggregation Manager 2.1.0
-------------------------
Usage: sharp_am [OPTION]
Examples:
sharp_am -B --fabric_lst_file subnet.lst

OPTIONS:
  -O, --config_file <value>
	Configuration file.
	If specified with '-' prefix, ignore configuration file read errors
	and used default configuration file.Exit if '-' is not specified
	and fails to read configuration file.
	Default value: /etc/sharp/sharp_am.cfg
	Parameter supports update during runtime: no

  -l, --log_file <value>
	Log file
	Default value: /var/log/sharp_am.log
	Parameter supports update during runtime: no

  --log_verbosity <value>
	Log verbosity level:
	1 - Errors
	2 - Warnings
	3 - Info
	4 - Debug
	5 - Trace
	Default value: 3
	Parameter supports update during runtime: yes

  --syslog_verbosity <value>
	Syslog verbosity level:
	1 - Errors
	2 - Warnings
	3 - Info
	4 - Debug
	5 - Trace
	Note: MADs and Trace messages are logged in syslog with debug
	verbosity level
	Default value: 1
	Parameter supports update during runtime: yes

  -V, --verbose
	Run with full verbosity
	Parameter supports update during runtime: no

  --log_categories_file <value>
	The location of the log categories file, which defines log level per category. If the file does not exists AM will ignore it
	Default value: /home/dschaffer/sharp/out/etc/fabric_log_categories.cfg
	Parameter supports update during runtime: yes

  --log_max_backup_files <value>
	Number of backup log files. Used for log rotation
	Default value: 9
	Valid range: 0-1000
	Parameter supports update during runtime: no

  --log_file_max_size <value>
	Maximum size of a log file, in MBs
	If value is 0,log rotation isn't used
	Default value: 64
	Valid range: 0-4096
	Parameter supports update during runtime: no

  --accumulate_log <value>
	Accumulate log file over multiple sessions.
	If set to FALSE and log rotation is disabled, log file is
	truncated on startup.
	Default value: TRUE
	Parameter supports update during runtime: no

  -B, --daemon
	Run in daemon mode - sharp_am will run in the background
	Parameter supports update during runtime: no

  -p, --pid_file <value>
	PID file. Makes sharp_am to write its PID to the specified file when running as daemon
	Default value: /var/run/sharp_am.pid
	Parameter supports update during runtime: no

  -c, --create_config <value>
	sharp_am will dump its configuration to the specified file and exit
	Default value: (null)
	Parameter supports update during runtime: no

  --ftree_ca_order_file <value>
	Path of ftree CA order file generated by OpenSM.
	 Its contents can be used when implementing all-to-all communication
	Default value: /var/log/opensm-ftree-ca-order.dump
	Parameter supports update during runtime: yes
	This parameter is deprecated

  -t, --trees_file <value>
	SHARP trees file
	If NULL, calculate trees automatically
	Default value: (null)
	Parameter supports update during runtime: no

  --max_tree_radix <value>
	The maximum radix used in the system.
	The value should be a multiple of four.
	Default value: 252
	Valid range: 16-252
	Parameter supports update during runtime: no

  --span_all_agg_nodes <value>
	Generate trees that span all possible aggregation nodes.
	Relevant only if topology_type is tree.
	Default value: TRUE
	Parameter supports update during runtime: no

  --control_path_version <value>
	The control path version (IB: AM class version) to be set on all fabric aggregation nodes.
	If set to 0 the value will be the minimal supported version discovered on sharp_am startup.
	Aggregation nodes that does not support the selected or minimal discovered control path version, will be excluded from aggregation trees.
	1 - SHARPv1 (SwitchIB2)
	2 - SHARPv2 (Quantum)
	Default value: 0
	Parameter supports update during runtime: no

  --clean_and_exit <value>
	Clean all resources on aggregation nodes and exit
	Default value: FALSE
	Parameter supports update during runtime: no

  --fabric_smdb_file <value>
	Fabric SMDB file
	Default value: /var/log/opensm-smdb.dump
	Parameter supports update during runtime: no

  --fabric_virt_file <value>
	Fabric virtualization file
	Default value: /var/log/opensm-virtualization.dump
	Parameter supports update during runtime: no

  --lst_file_timeout <value>
	Length of timeout [in seconds] between attempts to load the LST file
	Default value: 3
	Valid range: 0-255
	Parameter supports update during runtime: yes

  --lst_file_retries <value>
	Max number of retry attempts when loading the LST file, and encountering "No such file " errors
	Default value: 0
	Valid range: 0-255
	Parameter supports update during runtime: yes

  --topology_type <value>
	Topology type
	The following topology types are supported:
	tree, hypercube, dfp, auto
	auto - set topology type according to routing engine in smdb file
	Default value: auto
	Parameter supports update during runtime: no

  --hyper_cube_coordinates_file <value>
	Hyper Cube coordinates file
	Required when running on Hyper Cube topologies
	Default value: /var/log/opensm-dor-coordinates
	Parameter supports update during runtime: no
	This parameter is deprecated

  --enable_sat <value>
	Enable Streaming Aggregation Trees (SAT) creation and usage.
	Default value: TRUE
	Parameter supports update during runtime: no
	This parameter is deprecated

  --recovery_retry_interval <value>
	Set auto recovery for trees.
	 0 - Do not recover trees
	 x - Recovery retry timeout in seconds.
	 Recommended to set to > 30 seconds.
	Default value: 300
	Valid range: 0-1800
	Parameter supports update during runtime: no

  --jobs_reconnection_timeout <value>
	A time that AM waits till port is UP.
	Upon port is UP, AM query sharp jobs to check if it are still active.
	0 - disable jobs reconnection
	x - timeout in seconds
	Default value: 30
	Valid range: 0-1800
	Parameter supports update during runtime: no

  --enable_seamless_restart <value>
	Enable Seamless restart.
	If enabled, AM tries to recover state from last SHARP AM run.
	Default value: TRUE
	Parameter supports update during runtime: no

  --persistent_dir <value>
	Path to persistent data directory
	Default value: /var/lib/sharp
	Parameter supports update during runtime: no

  --dump_dir <value>
	Path to dump files directory
	Default value: /var/log
	Parameter supports update during runtime: no

  --generate_dump_files <value>
	Dump internal state to files for debug and diagnostics
	Default value: TRUE
	Parameter supports update during runtime: yes

  --max_quota <value>
	Maximum quota that can be requested by a single job
	It is guarantee that no job will receive more than max quota
	Format: "(Trees-per-job, OSTs-per-tree, User-data-per-ost, Groups-per-tree, QPs-per-port-per-tree)"
	Default value: (4, 500, 256, 500, 180)
	Parameter supports update during runtime: no
	This parameter is deprecated

  --default_quota <value>
	Default quota to be requested for a single job
	The quota that will be requested for a job if no quota was requested explicitly
	Format: "(Trees-per-job, OSTs-per-tree, User-data-per-ost, Groups-per-tree, QPs-per-port-per-tree)"
	Default value: (1, 16, 128, 8, 64)
	Parameter supports update during runtime: no
	This parameter is deprecated

  --per_prio_max_quota <value>
	Maximum percentage of quota (OSTs, Buffers and Groups) per aggregation node per tree, that can be requested by a single job by its priority.
	It is guarantee that no job will receive more than max quota
	Format: "prio_0_quota, [prio_1_quota, ..., prio_9_quota] "
	Default value: 100
	Parameter supports update during runtime: yes

  --per_prio_default_quota <value>
	Default percentage of quota (OSTs, Buffers and Groups) per aggregation node per tree, to be requested for a single LLT job by its priority.
	The quota in percent per tree that will be requested for a job if no quota was requested explicitly
	Format: "prio_0_quota, [prio_1_quota, ..., prio_9_quota] "
	Default value: 20
	Parameter supports update during runtime: yes

  --per_prio_default_sat_quota <value>
	Default percentage of quota (OSTs, Buffers and Groups) per aggregation node per tree, to be requested for a single SAT job by its priority.
	The quota in percent per tree that will be requested for a job if no quota was requested explicitly
	Format: "prio_0_quota, [prio_1_quota, ..., prio_9_quota] "
	Default value: 3
	Parameter supports update during runtime: yes

  --low_prio_max_accumulated_quota <value>
	Maximum accumulated quota (OSTs, Buffers and Groups) percentage that can be allocated for all low priority jobs (priority = 0) on a single AN.
	Default value: 100
	Valid range: 1-100
	Parameter supports update during runtime: no

  --sat_jobs_default_absolute_osts <value>
	Default number of OSTs to be allocated for sat jobs per aggregation node per tree.
	Zero value means use the percentage default.
	Default value: 0
	Valid range: 0-1000
	Parameter supports update during runtime: yes

  --max_trees_to_build <value>
	Maximum number of trees to build
	Default value: 126
	Valid range: 63-1022
	Parameter supports update during runtime: no

  --max_trees_per_job <value>
	Maximum number of trees per job It is guarantee that no job will receive more than max trees
	Default value: 8
	Valid range: 1-8
	Parameter supports update during runtime: yes

  --default_trees_per_job <value>
	Default number of trees per job The number of trees allocated for a job if it was not requested explicitly
	Default value: 1
	Valid range: 1-8
	Parameter supports update during runtime: yes

  --dynamic_tree_allocation <value>
	Enable dynamically allocated trees for each sharp job.Default: True (trees are dynamically allocated for each sharp job)
	Default value: TRUE
	Parameter supports update during runtime: no

  --dynamic_tree_algorithm <value>
	Set which algorithm should be used by the dynamic tree mechanism.
	This parameter is ignored when dynamic_tree_allocation is false.
	0 - Super pod oriented algorithm.
	1 - Quasi Fat Tree and Dragonfly+ oriented algorithm.
	Default value: 0
	Parameter supports update during runtime: no

  --DFP_max_childless_islands <value>
	Maximum number of islands with no allocated HCA that can participate in job's tree, in Dragonfly+ topology.
	This parameter is ignored unless dynamic_tree_allocation is true and dynamic_tree_algorithm is 1.
	Default value: 1
	Valid range: 0-255
	Parameter supports update during runtime: no

  --default_reproducibility <value>
	Default value for reproducibility mode
	Default value: TRUE
	Parameter supports update during runtime: yes

  --enable_job_pkey_on_tree <value>
	Set job PKEY on aggregation tree allocate for the job.
	Default value: FALSE
	Parameter supports update during runtime: no

  --enable_exclusive_lock <value>
	Enable allocation of exclusive lock per job (if requested in BeginJob).
	Default value: TRUE
	Parameter supports update during runtime: no

  --smx_protocol <value>
	SMX default protocol:
	1 - UCX
	2 - Sockets
	3 - File (Debug mode)
	Default value: 2
	Parameter supports update during runtime: no

  --smx_enabled_protocols <value>
	Bit mask of enabled SMX protocols, multiple protocols can be enabled
	bit 0 - UCX
	bit 1 - Socket
	bit 2 - Unix socket
	Notice that Unix socket is required to communicate with UFM
	Default value: 6
	Valid range: 1-7
	Parameter supports update during runtime: no

  --smx_sock_interface <value>
	Network interface to be used by SMX for socket connections:
	empty string (default) - Use first interface found in UP state
	UFM configuration am_interface is setting this interface
	Default value: (null)
	Parameter supports update during runtime: no

  --smx_sock_port <value>
	The external port to be used by SMX
	Default value: 6126
	Valid range: 1024-49151
	Parameter supports update during runtime: no

  --smx_sock_addr_family <value>
	Determines which address family will be used in SMX's sockets, 'auto' will use both IPv4 and IPv6 if they are available
	The value needs to be one of the following: { auto, ipv4, ipv6 }
	Default value: auto
	Parameter supports update during runtime: no

  --smx_ucx_interface <value>
	Network interface to be used by SMX for UCX connections:
	empty string (default) - Use first IB interface found in Active state
	Default value: (null)
	Parameter supports update during runtime: no

  --smx_keepalive_refresh_interval <value>
	Keepalive configuration,
	Periodically check for connections that should be refreshed.
	0 - disable periodic refresh
	x - interval in [seconds]
	Default value: 30
	Valid range: 0-65535
	Parameter supports update during runtime: no

  --smx_keepalive_min_time_before_connection_refresh <value>
	Keepalive configuration,
	Defines the minimum amount of time (in seconds) a keepalive connection exists before it is refreshed.
	Valid values are between 60 seconds and 1 day (86400 seconds)
	Default value: 600
	Valid range: 60-86400
	Parameter supports update during runtime: no

  --smx_keepalive_min_percentage_of_connections_to_refresh_at_iteration <value>
	Keepalive configuration,
	Defines the percentage of connections from the total number of connection that can be refreshed at each refresh iteration.
	Default value: 10
	Valid range: 1-100
	Parameter supports update during runtime: no

  --smx_init_timeout <value>
	Maximum time [in seconds] for waiting for reply from SMX threads before terminating gracefully.
	Default value: 5
	Valid range: 0-65535
	Parameter supports update during runtime: no

  -g, --ib_port_guid <value>
	GUID of the port to which aggregation manager binds to
	Default value: 0x0
	Parameter supports update during runtime: no

  --ib_sa_key <value>
	SA key
	Default value: 0x1
	Parameter supports update during runtime: no
	This parameter is deprecated

  --ib_am_key <value>
	AMKey: In order to authenticate that vendor specific class MADs originated from a trusted source, all vendor specific MADs must include the AMKey
	Default value: 0x0
	Parameter supports update during runtime: no

  --ib_am_key_protect_bit <value>
	AMkey protection bit. Used only if ib_am_key is provided
	0 - Protection provided, but allows VSMgrs to read the KeyInfo:Key in this mode
	1 - Protection provided and does not allow anyone to read the Vend_Key in the port until the lease period has expired.
	The Vend_Key lease period is a mechanism to allow the Vend_Key to be protected only for a given amount of time
	Default value: 1
	Parameter supports update during runtime: no

  --ib_am_key_protect_lease_period <value>
	AMKey lease period. Used only if ib_am_key is provided
	Timer value used to indicate how long the AM_KeyProtectBit is to remain non zero after a AMMgtSet(AMKeyInfo) MAD that failed a AM_Key check is dropped.
	The value of the timer indicates the number of seconds for the lease period.
	With a 16 bit counter, the period can range from 1 second to approximately 18 hours. 0 shall mean infinite.
	Default value: 300
	Valid range: 0-65535
	Parameter supports update during runtime: no

  --ib_service_key <value>
	Service key for SHARP service record.
	Service key should be used when SHARP service name is associated with
	specific service key in the subnet manager.
	Service key should be specified in IPv6 format.
	Default value: 0::0
	Parameter supports update during runtime: no

  --ib_sharp_sl <value>
	SL for SHARP control path communication (MADs)
	Default value: 0
	Valid range: 0-15
	Parameter supports update during runtime: no

  --support_multicast <value>
	Support return result by multicast
	Default value: TRUE
	Parameter supports update during runtime: no

  --ib_qpc_use_grh <value>
	IB QP Context - Use GRH for AN to AN communication
	Default value: FALSE
	Parameter supports update during runtime: no

  --ib_qpc_pkey <value>
	IB QP Context - Partition Key for SHARP
	Default value: 0xFFFF
	Valid range: 0-65535
	Parameter supports update during runtime: no

  --ib_qpc_sl <value>
	IB QP Context - SL for SHARP LLT data path communication
	Default value: 3
	Valid range: 0-15
	Parameter supports update during runtime: no

  --ib_sat_qpc_sl <value>
	IB QP Context - SL for SHARP SAT (streaming) data path communication
	Default value: 2
	Valid range: 0-15
	Parameter supports update during runtime: no

  --ignore_host_guids_file <value>
	Host GUIDs file to not include in SHARP trees
	Default value: (null)
	Parameter supports update during runtime: no

  --ignore_sm_guids <value>
	Do not include SM GUIDs in SHARP trees parsed from SMDB file.
	Default value: TRUE
	Parameter supports update during runtime: no

  --reservation_mode <value>
	Enable reservations in AM.
	Default value: FALSE
	Parameter supports update during runtime: no

  --load_reservation_files <value>
	Load reservation files on startup.
	Default value: TRUE
	Parameter supports update during runtime: no

  --reservation_force_guid_assignment <value>
	Accept new reservation or update in an existing reservation,
	even if there are guids that are already assigned to a different reservation.
	Default value: FALSE
	Parameter supports update during runtime: yes

  --reservation_stop_jobs_upon_scale_in <value>
	In case a reservation is deleted,
	or some guids are removed from a reservation,
	any sharp jobs associated with the removed guids will be stopped.
	When this parameter is set to false, a request to delete or reduce guids will fail,
	if there are related active jobs
	Default value: TRUE
	Parameter supports update during runtime: yes

  --reservation_max_jobs_per_hca <value>
	The maximum number of allowed jobs that can use the same HCA.
	A value of 0 means no limit
	Default value: 1
	Valid range: 0-511
	Parameter supports update during runtime: no

  --app_resources_default_limit <value>
	Sets the default max number of trees allowed to be used in parallel by a single app.
	A value of -1 means no limit by default.
	A value of 0 means no SHARP resources are provided to any sharp reservation by default.
	This default value can be overridden per app at reservation request.
	Default value: -1
	Valid range: -1-24575
	Parameter supports update during runtime: no

  --rdma_sr_enable <value>
	When set to true, AM will run 2 additional threads:
	1. Thread listening on ibv_device waiting for connection request.
	2. Thread waiting for recv completions to post_send service record.
	Default value: TRUE
	Parameter supports update during runtime: no

  --telemetry_interval <value>
	Time interval between telemetry messages, in seconds.
	A value of 0 means no telemetry.
	Default value: 60
	Valid range: 0, 10-3600
	Parameter supports update during runtime: no

  --telemetry_file_path <value>
	Telemetry's metrics file path
	Default value: 
	Parameter supports update during runtime: no

  --force_app_id_match <value>
	When set to true, an application id must be provided at begin job request
	and it must match the application id provided at reservation request,
	otherwise the job will be denied.
	Default value: FALSE
	Parameter supports update during runtime: no

  --enable_topology_api <value>
	Enable retrieving topology information.
	Default value: FALSE
	Parameter supports update during runtime: no

  --pending_mode_timeout <value>
	A time period during which AM waiting for a job to be completed.
	-1 - no wait
	 0 - wait forever
	 x - pending mode duration in minutes
	Default value: 0
	Parameter supports update during runtime: no

  --job_info_polling_interval <value>
	While the AM is in pending mode state,
	it periodically query relevant sharp daemons to check if the jobs are still active.
	use job_info_polling_interval to set the interval between query in minutes or 0 for single query.
	0 - single msg mode
	x - interval in [minutes] between queries
	Default value: 60
	Valid range: 0-65535
	Parameter supports update during runtime: no

  --device_configuration_file <value>
	Device configuration.
	Default value: (null)
	Parameter supports update during runtime: no

  -h, --help
	Show usage and exit
	Parameter supports update during runtime: no

  -v, --version
	Prints sharp_am version and exit
	Parameter supports update during runtime: no

```
## SHARP Daemon

The SHARP Daemon is local to each node and is expected to persist as long as network available.
SD interacts with following entities:

 * AM. Job startup/termination.
 * SM. Service record fetching.
 * Other SD. Group creation and destruction.
 * libsharp communication library. Job/Group management.

Only SD#0 interacts with AM. The interaction is limited to sending resource allocation,
and requesting and receiving data from openSM's ftree-ca-order dump file.
request for a job, receiving job data and sending termination request. Job data distribution
between SD participating in the job is out of scope of SHARP software and has to be done in
OOB (Out Of Band) level using push API.
All SDs can interact with the AM to request data from the opensm-ftree-ca-order.dump file,
a file generated by the SM (subnet manager),
but it is assumed that only 1 SD per job does this, and that the data is requested only once per job.
SD#0 is responsible for resource management on communicator level. SD#n>0 interacts with
SD#0 and requests resources for a group. For each group a fraction of available resources
can be allocated.
An user application can control resource allocation policy using the following environment variables:

* `SHARP_COLL_GROUP_RESOURCE_POLICY (1 - equal 2 - take_all by first group 3. User input percent)`
* `SHARP_COLL_USER_GROUP_QUOTA_PERCENT`

SD connects local Computation Process (CP) to an Aggregation Tree. The connection is based on RC QP connected to nearest
AN. AM is responsible for the AN assignment to each compute port. The connection can be reused for
multiple collective operations. Each group should be joined to the Aggregation Tree before sending
collective operations. If multiple processes are participating in the group in the same node, HCOLL
can group these process based on socket locality and use multiple processes for sending collective
operations to network. Inside the sub-group, shared memory is used for collective. Group channel
process is a process selected for participating in sharp group. Application can ask a number of
group channels from AM. Multiple group channels affects tree radix and as result buffer allocation in AN.
If AN can't allocated asked number of group channels, computation jobs fails. See [Multi-channel group].
Communication between Computation Process (libsharp) and SD is based on UNIX domain sockets.

Detailed description for the flow between SD and CP can found in [sharp_ctl.h](src/api/sharp_ctl.h).

SD discovers AM address using Service Record fetching from SM.

SD has limited support for resiliency futures:

* If AM connection is broken, SD tries to reconnect to AM.
* SD#0 monitors CP#0 (Computation Process id 0). If the process dies, SD#0 issues job termination request to AM.
  The monitoring is based on socket hangup status and doesn't requires CPU cycles.

For any job HCOLL issues two end job requests through SD#0 and last SD. The redundant
job termination request covers SD#0 crash.


### Inter-component messaging

SMX messaging library is responsible for communication between SHARP software components.
There are three communication protocols:

* AM <-> SD#0. This protocol is used on job level. It includes following messages:

   * SHARP_MSG_TYPE_BEGIN_JOB
   * SHARP_MSG_TYPE_END_JOB
   * SHARP_MSG_TYPE_JOB_DATA

SD#0 initiates connection to AM. SD discovers AM's address using service record. No special configuration
needed in production environment. For debug purposes, SMX_AM_SERVER environment variable can be used.

* AM <-> SD. This protocol is used on job level. It includes the following messages:

   * SHARP_MSG_TYPE_REQUEST_SM_DATA
   * SHARP_MSG_TYPE_GET_SM_DATA_BUF_LEN
   * SHARP_MSG_TYPE_GET_SM_DATA

SD initiates this protocol - and it is assumed that it is used by only 1 SD per job,
and only once per job per data type.
SD discovers AM's address using service record. No special configuration needed in production environment.
For debug purposes, SMX_AM_SERVER environment variable can be used.

* SD <-> SD#0. This protocol is used on communicator level and includes following messages:

   * SHARP_MSG_TYPE_ALLOC_GROUP
   * SHARP_MSG_TYPE_GROUP_DATA
   * SHARP_MSG_TYPE_GET_JOB_DATA
   * SHARP_MSG_TYPE_RELEASE_GROUP

SD#>0 knows SD#0 address from job information distributed among SDs.

SMX wraps following underling communication mechanisms:

*   TCP socket. This is main communication mechanism used for production environment. A user
    has to configure at least one network interface.
*   Files. This mode serves debug and versification purposes.
*   UCX. This mode allows in-band message communication and uses [UCX - Unified Communication X library]
    (https://github.com/openucx/ucx). This is experimental mode and can't be used in production environment.

Relevant parameters (AM, SD):

* `smx_sock_interface`
* `smx_sock_port`
* `smx_sock_backlog`
* `smx_sock_addr_family`

### MAD communication

AM use ibis for high-performance, parallel processing: [ibis](https://github.com/Mellanox/ibis_tools).
SD is libibumad based application.

### APIs

SHARP includes APIs:

* [libsharp_coll](src/api/sharp.h) . This high-level public API
  available for third-party integration.
* [libsharp](src/api/sharp_ctl.h). This is low-level private API.
* sharp_job_quota . This is Job Sheduler API.

libsharp is interface library used for communication with local SD. UNIX domain socket is used for the communication.

Versions
-------------------------------------------------------------------------------

|SHARP version      |MOFED version         |SwitchIB-2 FW|HPCX version|UFM version|SMX protocol|
|-------------------|----------------------|-------------|------------|-----------|------------|
|v1.0               |MLNX OFED 3.3-x.x.x   |15.1100.0072 |1.6.392     |    -      |    new     |
|v1.1               |MLNX OFED 3.4-0.1.2.0 |15.1200.0102 |1.7.405     |    -      |  changed   |
|v1.2               |MLNX OFED 4.0-x.x.x   |15.1200.0102 |1.8.xxx     |  5.8-5.9  |  changed   |
|v1.3               |MLNX OFED 4.1-1.0.2.0 |15.1460.0162 |1.9.5       |           | unchanged  |
|v1.4               |MLNX OFED 4.2-1.2.0.0 |15.1500.0106 |2.0         |  5.9.5-4  | unchanged  |
|v1.5               |MLNX_OFED 4.3-1.0.1.0 |15.1600.0182 |2.1         |  5.10-x   |  changed   |
|v1.6               |MLNX OFED 4.3-3.x     |15.1630.0216 |      -     |  5.11     | unchanged  |
|v1.7               |MLNX OFED 4.4-x.x.x   |15.1630.0216 |2.2         |  6.0      |changed (SD<->SD) only|
|v2.0               |MLNX OFED 4.7-x.x.x   |15.2000.2626 |2.3         |  6.3      |changed, new SMX serialization|
|v2.1               |MLNX OFED 4.8-x.x.x   |15.2000.2626 |2.6         |  6.4      |changed|

Prerequisites
-------------------------------------------------------------------------------

 * SwitchIB-2 or Quantum based fabric.
 * SwitchIB-2, or Quantum FW (see the table above).
 * MLNX OS 3.6.1002 (for managed switches).
 * MOFED (see HCOLL prerequisites)
 * HPCx bundle (see the table above)
 * MLNX OpenSM 4.7.0 or later (available with MLNX OFED 3.3-x.x.x or UFM 5.6).
 * MLNX OpenSM 4.9 or later for eliminating manual configuration for root guids.
 * ConnectX HCA. ConnectX-6 for SAT aggregation (blocks bigger than 64K)
 * Kernel >= 2.6.22.
 * SHARP is compiled on following OS:

|Distro      |Platform |Kernel          |
|------------|---------|----------------|
|RHEL 6.1    |x86-64   |2.6.32-131.0.15 |
|RHEL 6.2    |x86-64   |2.6.32-220      |
|RHEL 6.3    |x86-64   |2.6.32-279      |
|RHEL 6.4    |x86-64   |2.6.32-358      |
|RHEL 6.5    |x86-64   |2.6.32-431      |
|RHEL 7.0    |x86-64   |3.10.0-123      |
|RHEL 7.2    |x86-64   |3.10.0-327      |
|RHEL 7.2    |ppc64le  |3.10.0-327      |
|RHEL 7.3    |aarch64  |4.5.0-15.el7    |
|RHEL 7.3    |x86-64   |3.10.0-514      |
|RHEL 7.4    |x86-64   |3.10.0-693      |
|RHEL 7.4    |aarch64  |4.11.0-44       |
|RHEL 7.5    |x86-64   |3.10.0-862      |
|Fedora14    |x86-64   |2.6.35.6-45     |
|Fedora16    |x86-64   |3.1.0-7         |
|Fedora17    |x86-64   |3.3.4-5         |
|Fedora18    |x86-64   |3.6.10-4        |
|Fedora26    |x86-64   |4.11.8-300      |
|Fedora28    |x86-64   |4.13.9-300      |
|SLES 11 SP3 |x86-64   |3.0.76-0.11     |
|SLES 11 SP4 |x86-64   |3.0.101-57      |
|SLES 12 SP1 |x86-64   |3.12.49-11      |
|SLES 12 SP2 |x86-64   |4.4.21-68       |
|SLES 12 SP3 |x86-64   |4.4.73-5        |
|SLES 15 SP0 |x86-64   |4.12.14-23      |
|SLES 18 SP0 |x86-64   |4.18.0-10       |
|Ubuntu16.10 |x86-64   |4.8.0-26        |
|Ubuntu17.10 |x86-64   |4.13.0-17       |
|Ubuntu14.4  |ppc64le  |3.13.0-32       |
|Ubuntu18.04 |x86-64   |4.15.0-20       |
|Centos6.3   |x86-64   |2.6.32-279      |
|Centos6.0   |x86-64   |2.6.32-71       |


 * SHARP is tested on:
     * Intel architecture: RHEL 7.4 (3.10.0-693).
     * PPC architecture (little-endian): Ubuntu 14.4 (3.13.0-32), Power8.

System configuration
-------------------------------------------------------------------------------

* Each compute node needs run local SHARP daemon.
* Only one instance of AM is allowed.
* AM and SM have to share the same server.

```
  +--------------------------------------+    +---------------------------------+
  |  Compute host                        |    | Dedicated server                |
  |                                      |    |                                 |
  |  +---------------+  +-------------+  |    |  +------------+  +-----------+  |
  |  | libsharp      +-->  SD         |  |    |  |  AM        |  |  SM       |  |
  |  +---------------+  |             |  |    |  |            |  |           |  |
  |  | libsharp_coll |  |             |  |    |  |            |  |           |  |
  |  +---------------+  |             |  |    |  |            |  |           |  |
  |  | hcoll         |  |             |  |    |  |            |  |           |  |
  |  +---------------+  |             |  |    |  |            |  |           |  |
  |  | Computation   |  +-------------+  |    |  +------------+  |           |  |
  |  | Process (CP)  |  |    SMX      +----------->SMX        |  |           |  |
  |  +---------------+  +-------------+  |TCP |  +------------+  +-----------+  |
  |                                      |    |                                 |
  +--------------------------------------+    +---------------------------------+
```


Installation - Linux
-------------------------------------------------------------------------------

To build SHARP from source, the following tools are needed:
 * autoconf
 * automake
 * libtool
 * pkg-config

If you get the SHARP sources from github, you have to generate
configure script at first:

```shell
% ./autogen.sh
```

To build and install SHARP run following:

```shell
% module load mofed/hpcx
% ./configure --with-mpi=$OMPI_HOME --prefix=$PWD/install
% make install
% module unload mofed/hpcx
```

To build SHARP with ROCm support:

Install ROCm following installation guide here: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html. ROCm packages are installed to /opt/rocm by default.

```shell
./configure --with-cuda=/usr/local/cuda --with-rocm=/opt/rocm
```

To compile debugging code configure the project with the --enable-debug option:
```shell
% ./configure --with-mpi=$OMPI_HOME --prefix=$PWD/install --enable-debug
```

RPM installation:
```js
# rpm -ivh <sharp.rpm>
```

DEB installation:
```js
# dpkg -i <sharp.deb>
```

After installation the following daemons will be setup:

- *sharp_am* - will be disabled in all rc

For daemons manual installation/removing (requires root permission):
```
$prefix/bin/sharp_daemons_setup.sh or $top_source_dir/contrib/sharp_daemons_setup.sh
```

How to use the script:

```
Usage: sharp_daemons_setup.sh <-s SHARP location dir> <-r> <-a> <-d daemon> <-b>
    -s - Setup SHARP daemons
    -r - Remove SHARP daemons
    -a - All daemons (sharp_am)[default]
    -d - Daemon name (sharp_am)
    -b - Enable socket based activation of the service
```
-b option is only available on systems that support socket based activation.
Socket based activation is supported on RH 7.2 and above and requires Systemd.

https://github.com/Mellanox/sharp/wiki/Socket-Based-Activation-Installation-Procedure-for-Sharpd

Example of daemons configuration:

```js
# $prefix/sharp_daemons_setup.sh -s -d sharp_am  # Setup sharp_am daemon
# $prefix/sharp_daemons_setup.sh -r             # Remove both sharp_am and sharp_am daemons
# $prefix/sharp_daemons_setup.sh -s -d sharp_am -b # Setup sharp_am as a socket based activated daemon
```

After setup procedure daemons startup scripts are:

```
/etc/init.d/sharp_am
```
Daemons config files are:
```
/etc/sysconfig/sharp_am
```

`$SHARP_OPTIONS` which was defined in sysconfig file will be passed to appropriate daemon as parameter

There is possibility to run SHARP daemons from nonstandard location by setting `SHARP_STARTUP_SCRIPT` env. var.

Example:
```js
# SHARP_STARTUP_SCRIPT=/my/script/location/sharp_am /etc/init.d/sharp_am start
```

Also `SHARP_DEVEL` may be set to force daemon script to use `$prefix` dir for `lockfile` and `pidfile` instead of default location (`/var/lock/subsys` and `/var/run` respectively).


**Unit Test**

```
% make unittest
% make gtest
```

**Run Valgrind Test**

```
% make valgrind
```

**Run MPI Test**

```
% make runtest
```

Using libsharp
-------------------------------------------------------------------------------

To compile a package using libsharp, you need provide
CFLAGS and LDFLAGS. libsharp is integrated with pkg-config.
If you use pkg-config, you can use it for for getting compilation
flags:

```
PKG_CONFIG_PATH=<sharp destination folder> pkg-config --cflags sharp # prints CFLAGS
PKG_CONFIG_PATH=<sharp destination folder> pkg-config --libs sharp # prints LDFLAGS
```

Using sharp_job_quota
-------------------------------------------------------------------------------

The sharp_job_quota executable should be run with the same user as SharpD (root).
Its purpose is to be used by the job scheduler to set, using uid,
the user allowed to run sharp jobs, as well as limit the amount of resources
that the user is allowed to request for the job - if such a limitation
is necessary.

When running sharp_job_quota - there are 2 required arguments
(They can be received from the command line, a configuration file, or from the environment)
These arguments are:

* operation
* allocation ID

In addition - if the "set" operation is chosen - then either uid argumentor user_name argument must be provided

For more information about these - see the full list of options below.

sharp_job_quota supports an options file.

Here are 2 simple usage examples:

```
sharp_job_quota --operation set --user_name --allocation_id  2017 --coll_job_quota_max_groups 10
sharp_job_quota --operation remove --allocation_id 2017
```

```
Description: sharp_job_quota - Set or remove sharp job quota

Usage: sharp_job_quota [OPTION]
Examples:
sharp_job_quota --operation set --user_name jobrunner --allocation_id 2017 --coll_job_quota_max_groups 10
sharp_job_quota --operation remove --allocation_id 2017

OPTIONS:
  -O, --config_file <value>
	Configuration file.
	If specified with '-' prefix, ignore configuration file read errors
	and used default configuration file.Exit if '-' is not specified
	and fails to read configuration file.
	default value: /etc/sharp/sharp_job_quota.cfg

  -c, --create_config <value>
	sharp_job_quota will dump its configuration to the specified file and exit
	default value: (null)

  -t, --operation <value>
	The 2 valid values are:
	"set" - set quota. Either uid or user_name must be provided for this operation
	"remove" - remove quota

  -i, --allocation_id <value>
	The Job scheduler's ID for the job. no other job in the system can have the same ID

  -u, --uid <value>
	UID of the user that will be allowed to run jobs

  -n, --user_name <value>
	Name of the user that will be allowed to run jobs

  -g, --coll_job_quota_max_groups <value>
	Maximum number of groups (comms) allowed.
	The deafult value means that all quota requests in range will be accepted
	default value: 0

  -q, --coll_job_quota_max_qps_per_port <value>
	Maximum QPs/port allowed.
	The default value means that all quota requests in range will be accepted
	default value: 0

  -p, --coll_job_quota_max_payload_per_ost <value>
	Maximum payload per OST allowed.
	The default value means all quota requests in range will be accepted
	default value: 256

  -o, --coll_job_quota_max_osts <value>
	Indicates the maximum number of OSTs allowed for job per collective operation.
	The default value means that all quota requests in range will be accepted
	default value: 0

  --coll_job_quota_max_num_trees <value>
	Indicates the maximum number of trees allowed for job.
	The default value means that all number of trees in range will be accepted
	default value: 0

  --job_priority <value>
	Priority of the job.
	default value: 0

  --coll_job_quota_percentage <value>
	Sets upper limit of SHARP resources available for the job.
	Default value 0 means, use sharp_am default according to priority.
	default value: 0

  -h, --help
	Show usage and exit

  -v, --version
	Print version and exit

```

Logging
-------------------------------------------------------------------------------

Following logs are useful for SHARP troubleshooting:

* AM log. Default location is /var/log/sharp_am.log .
*  Following parameters controls log creation in AM and SD:

  * `log_file`
  * `log_verbosity . Possible values: 1 - Errors; 2 - Warnings; 3 - Info;
    4 - Debug; 5 - Trace.`
  * `verbose`
  * `log_max_backup_files`
  * `log_file_max_size`

* SHARP_COLL logging.

   * `SHARP_COLL_LOG_LEVEL` - Messages with a level higher or equal to the selected will be printed.
     Possible values are: 0 - fatal, 1 - error, 2 - warn, 3 - info, 4 - debug, 5 - trace.

* SM log.

* SMX doesn't have own logging system. It reports messages into application log (AM or SD).

* sharp_job_quota does not currently have logging.
  It prints messages and errors to stdout and stderr, respectively

* ibis has own log. Log location and size are configured using following parameters:

  * `ibis_log_file`
  * `ibis_log_size`

Switch IB2 capabilities
-------------------------------------------------------------------------------

* Maximum node radix: 64.
* Maximum number of trees: 64. Tree number 63 is reserved.
* Maximum number of groups: 128.
* Data buffer size: 192K.
* Maximum number of QPs: 2K.
* Maximum operation size: 256 Bytes.
* Minimal MTU: 512 Bytes.
* Outstanding operations: Up 384. Actual number of OST is limited by buffer size.


Limitations
-------------------------------------------------------------------------------

### v1.7.0

* Mellanox OS is not supported. SM has to run on dedicated server.
* AM has limited support for topology updates. Switch reboot and link bouncing are supported. Fabric extension for non-root switches is  supported.
* Only one job allowed per host.
* AM in HPCX/MOFED doesn't support handover/failover. UFM supports AM handover.
* Each new instance of AM cleans SHARP resources
  in all discovered ANs. 
* Only fat-tree, quasi fat-tree, hypercube and dragonfly+ topologies are supported.
* Only homogeneous fabric are supported. (All switches must be SHARP compatible).
* Non homogeneous fabric (not all switches must be SHARP compatible) needs manual configuration for
  trees and connections between hosts and ANs. See [TreesConfigurationFile.md](doc/TreesConfigurationFile.md).
* Limited support for partial software update.  SD and AM can communicate if they use the same SMX protocol version.
* AM Key 0 only tested.
* SD v1.7 is not compliant with previous SD versions. (SD v1.7 is compliant with AM v1.67)

Known issues
-------------------------------------------------------------------------------

### v1.7

* HCOLL asks for surplus group channels in some cases. Number of asked group channels
  is a function of maximum socket id. If a socket with lower id is not used, HCOLL
  in any case asks sharp resources for it.
* 15.1430.0160 unusable with SHARP. The FW should be upgraded to latest GA (15.1630.0210)
* FW supports only one trim request at the same time. It means group trimming in the data path can fail
  if two groups run trimming through the same AN.

Releases
------------------------------------------------------------------------------

### Naming convention

 __XX.YY.ZZ-PRERELEASE__

*  __XX__: Major version
*  __YY__: Minor version
*  __ZZ__: Fix for released version
*  __PRERELEASE__: This is an optional tag that indicates a pre-release. Pre-release is not yet stable enough for production use.
                 This is an essential milestone before release.

### How to check SHARP version

* Open `<SHARP FOLDER>share/doc/sharp/SHARP_VERSION`

	```
	PACKAGE VERSION: 1.3.0
	PRERELEASE: 0dev
	SOURCE REVISION: c51b664
	IBIS SOURCE REVISION: ab5c9f9
	BUILD DATE: Jan/30/2017 11:40:30
	```

* Run `<SHARP application> --version`

	```
	sharpd (sharp) 1.3.0-0dev
	Copyright (C) 2016 Mellanox Technologies, Inc.
	License: See LICENSE file
	There is NO WARRANTY, to the extent permitted by law.

	Build Date: Jan 30 2017
	Last commit: c51b664
	```

* Search for "Version:" in log file

```
[Jan 30 15:48:26 338167][SD][14193][output] - Package: sharp-0dev
[Jan 30 15:48:26 338275][SD][14193][output] - Version: 1.3.0
[Jan 30 15:48:26 338290][SD][14193][output] - Build Date: Jan 30 2017
[Jan 30 15:48:26 338303][SD][14193][output] - Last commit: c51b664
```

### Pre-release tags

Only following names can be used in pre-release tags.

* __0dev__: The project is not in a stable state. The API may change without notice,
* __alpha__: Project is not tested, may also be many unknown bugs. Not suitable for
 production sites.
* __beta__: API should be considered frozen. Not generally suitable for production sites,
  but may be used on some production sites if the site administrator knows the project
  well, and knows how to handle any outstanding issues.
* __rc__: All features in the version are implemented. All reported, critical issues are
  fixed. From developer perspective, the project is ready for production.

### Branches

Development is done in "master" branch. Once a version is released a new maintenance branch is created.

### Examples

|SHARP version      |Git tag         | Git branch | Description                               |
|-------------------|----------------|------------|-------------------------------------------|
|v1.0               |v1.0.0          | v1.0       | v1.0. First official release for SHARP software |
|v1.1.1             |v1.1.1          | v1.1       | Update for version v1.1                   |
|v1.2.0-alpha1      |v1.2.0-alpha1   | master     | Alpha version for v1.2                    |
|v1.2.0-beta        |v1.2.0-beta     | master     | Beta version for v1.2                     |
|v1.2.0-rc2         |v1.2.0-rc2      | master     | Release candidate  for v1.2               |
|v1.2.0             |v1.2.0          | v1.2       | Release candidate v1.2                    |
|v1.3.0-0dev        |v1.3.0-0dev     | master     | v1.3 under development                    |

Contributing to the project
-------------------------------------------------------------------------------

See [CONTRIBUTING.md](.github/CONTRIBUTING.md)

References
------------------------------------------------------------------------------
[Multi-channel group]: https://github.com/Mellanox/sharp/wiki/Multi-channel-group
