SHARP - Scalable Hierarchical Aggregation Protocol (2.1.0)
-------------------------------------------------------------------------------

Copyright (c) 2016-2019 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

License
-------------------------------------------------------------------------------

See LICENSE file.

Overview
-------------------------------------------------------------------------------
This document addresses system-level management of the Scalable Hierarchical
Aggregation Protocol (SHARP) resources. This includes system-wide resource
manager (Aggregation Manager - AM), SHARP Daemon (SD) which is local to each
compute node and provides accesses to switch-based collective communication
capabilities, libsharp, libsharp_coll - user level communication libraries,
sharp_job_quota - Job Scheduler API.


### Terminology

* __AN (Aggregation Node)__:  ASIC hardware and local firmware implemented in
  Switch-IB 2.
* __Tree (Aggregation Tree)__: A SHARP tree represents a reduction-tree.
  The tree is composed of leaves representing data sources, internal nodes representing
  aggregation nodes, with the edges entering the junction representing the association of the children with the parent node.
* __Job__: SHARP resources are allocated for a job.
* __CP__: Computation Process. OOB (Out Of Band) process, e.g. MPI process. CP#__n__  in that notation n – is process id
* __Group__: The SHARP group is an aggregation collective group describes the vertices,
   leaves and aggregation nodes, associated with a given concrete reduction operation.
  For example, the leaves of a collective group may be mapped to an MPI communicator,
  with the rest of the elements being mapped to switches. Specific reduction operations have their data sources
  on a subset of the system nodes. The subset of leaf nodes and the aggregation nodes form the reduction tree are called the aggregation
  group, and correspond to a subtree of the SHARP tree.
* __AM (Aggregation Manager)__: system wide entity responsible for SHARP resource management.
* __SD (SHARP Daemon)__: local to each compute node responsible for connection
  establishment. SD#__n__  in that notation n – is process id
  of SD in the job.
  SD created the job has special responsibilities including communication
  with AM and resource management on job level. In current implementation, Computation Process id 0 (CP#0),e.g. MPI#0
  initiates job creation, so SD#0 is the special SD.
* __libsharp API__ : a library (shared object) to instruct SD to perform actions.
* __libsharp_coll API__ : high level API exposes collective abstraction over SHARP.
* __SMX__ : communication library used for SD to AM and SD to SD messaging.
* __OST__ :  Outstanding Operation.
* __Group channel__ is a client process (Computation Process) in the node selected for
  sending collective operation to assigned AN.
* __Radix__ is a number of children in the Aggregation Node.
  SwitchIB 2 limitsthe number by 64.
* __Child index__ is an index of group member in the list of node children.
* __"Job Scheduler" ("JS")__ is a system for management resources in HPC cluster. For example:SLURM , IBM Platform LSF
* __Managed environment ("managed mode")__ is an environment in which a job scheduler is running
  and an SD runs mode in which it expects notifications from the JS, respectively.

### Aggregation Manager

The Aggregation Manager (AM) is a system management component used for system
level configuration and management of the switch-based reduction capabilities.
It is used to setup the SHARP trees, manage the use of these entities.

AM is responsible for:

* SHARP resource discovery.
* Creating topology aware SHARP trees.
* Configuring SHARP switch capabilities.
* Managing SHARP resources.
* Assigning SHARP resource on request.
* Freeing SHARP resources on job termination.

AM is configured by topology file created by Subnet Manager (SM).
The file includes information about switches and HCAs.

Relevant parameters (AM):

* `fabric_lst_file` OpenSM v4.7.x-4.8.x
* `fabric_smdb_file` OpenSM v4.9 or later

Following the topology, AM discovers SHARP capabilities using MADs. During the
discovery, AM cleans SHARP resources allocated in AN.

Relevant parameters (AM):

* `clean_an_on_discovery`

Based on the topology, AM creates Aggregation Trees. Aggregation Tree is
a logical tree defines flow of collective operations. The communication capabilities
(QPs) between tree nodes are created between tree nodes during system initialization.

A user can configure pre-defined trees in AM. In the user-defined trees file,
the ANs are identified by the node names, as in the topology file created by the SM.
The file format is as follows:

```
tree <tree-id>
node {node description} [GUID:<port_guid_num>]
subNode {node description} [GUID:<port_guid_num>]
subNode {node description} [GUID:<port_guid_num>]
...
node {node description} [GUID:<port_guid_num>]
subNode {node description} [GUID:<port_guid_num>]
...
node {node description} [GUID:<port_guid_num>]
computePort {node description} [GUID:<port_guid_num>]
computePort {node description} [GUID:<port_guid_num>]
...
tree <tree-id>
node {node description} [GUID:<port_guid_num>]
````
See also [Trees Configuration Reference](doc/TreesConfigurationFile.md) .

Relevant parameters (AM):

* `trees_file`

AM computes Aggregation Trees automatically for quasi fat tree
topology (needs user-defined root guids file for OpenSM v4.7-4.8).

Relevant parameters (AM):

* `root_guids_file`

For a new job launch, AM allocates SHARP resources. The resource allocation
includes two main steps:

* __Tree matching.__ AM selects an available tree which has non-broken subtree that spans
  all job hosts. For each host, AM assigns AN which which the host may form connection.
* __Resource allocation.__ AM sets resources for each AN which serves the job. This includes
  buffers, OSTs, maximum number of groups and QPs available for children connection.

 Relevant parameters (AM):

 * `max_tree_radix`
 * `max_quota`
 * `default_quota`

 A user application may ask specific amount of SHARP resources. An application can operate with
 OSTs, user data per group and number of groups. If any of these resources is 0, AM uses default value from its
 configuration file. OSTs, user data per OST and max radix are translated into a size of buffer that AM allocates for the job.
 AM can return to the application less resources than requested and even decline the resource allocation request. If there are no
 available resources for the job, HCOLL implements failback.

 Relevant parameters (HCOLL, SHARP_COLL):

 * `HCOLL_ENABLE_SHARP`
 * `SHARP_COLL_JOB_QUOTA_OSTS`
 * `SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST`
 * `SHARP_COLL_JOB_QUOTA_MAX_GROUPS`
 * `SHARP_COLL_JOB_QUOTA_MAX_QPS_PER_PORT`
 * `SHARP_COLL_OSTS_PER_GROUP`

AM can read configuration parameters from command line, environment variables or configuration file.

Here is a simple usage example:

```
sharp_am -B --fabric_lst_file subnet.lst
```

AM supports following configuration parameters:

```
Aggregation Manager 2.1.0
-------------------------
Usage: sharp_am [OPTION]
Examples:
sharp_am -B --fabric_lst_file subnet.lst

OPTIONS:
  -O, --config_file <value>
	Configuration file.
	If specified with '-' prefix, ignore configuration file read errors
	and used default configuration file.Exit if '-' is not specified
	and fails to read configuration file.
	default value: /etc/sharp/sharp_am.cfg

  -l, --log_file <value>
	Log file
	default value: /var/log/sharp_am.log

  --log_verbosity <value>
	Log verbosity level:
	1 - Errors
	2 - Warnings
	3 - Info
	4 - Debug
	5 - Mad
	6 - Trace
	default value: 2
	can be updated in run-time through the configuration file

  --syslog_verbosity <value>
	Syslog verbosity level:
	1 - Errors
	2 - Warnings
	default value: 1
	can be updated in run-time through the configuration file

  -V, --verbose
	Run with full verbosity

  --log_max_backup_files <value>
	Number of backup log files. Used for log rotation
	default value: 9

  --log_file_max_size <value>
	Maximum size of a log file, in MBs
	If value is 0,log rotation isn't used
	default value: 64

  --accumulate_log <value>
	Accumulate log file over multiple sessions.
	If set to FALSE and log rotation is disabled, log file is
	truncated on startup.
	default value: TRUE

  -B, --daemon
	Run in daemon mode - sharp_am will run in the background

  -p, --pid_file <value>
	PID file. Makes sharp_am to write its PID to the specified file when running as daemon
	default value: /var/run/sharp_am.pid

  -c, --create_config <value>
	sharp_am will dump its configuration to the specified file and exit
	default value: (null)

  --ftree_ca_order_file <value>
	Path of ftree CA order file generated by OpenSM.
	 Its contents can be used when implementing all-to-all communication
	default value: /var/log/opensm/opensm-ftree-ca-order.dump
	can be updated in run-time through the configuration file

  -t, --trees_file <value>
	SHArP trees file
	If NULL, calculate trees automatically
	default value: (null)

  --max_tree_radix <value>
	The maximum radix used in the system.
	The value should be a multiple of four.
	default value: 64

  --span_all_agg_nodes <value>
	Generate trees that span all possible aggregation nodes.
	Relevant only if topology_type is tree.
	default value: TRUE

  --control_path_version <value>
	The control path version (IB: AM class version) to be set on all fabric aggregation nodes.
	If set to 0 the value will be the minimal supported version discovered on sharp_am startup.
	Aggregation nodes that does not support the selected or minimal discovered control path version, will be excluded from aggregation trees.
	1 - SHARPv1 (SwitchIB2)
	2 - SHARPv2 (Quantum)
	default value: 0

  --fabric_lst_file <value>
	Fabric LST file
	default value: /var/log/opensm-subnet.lst
	this parameter is deprecated

  --fabric_smdb_file <value>
	Fabric SMDB file
	default value: /var/log/opensm-smdb.dump

  --fabric_virt_file <value>
	Fabric virtualization file
	default value: /var/log/opensm-virtualization.dump

  --lst_file_timeout <value>
	Length of timeout [in seconds] between attempts to load the LST file
	default value: 3
	can be updated in run-time through the configuration file

  --lst_file_retries <value>
	Max number of retry attempts when loading the LST file, and encountering "No such file " errors
	default value: 0
	can be updated in run-time through the configuration file

  --topology_type <value>
	Topology type
	The following topology types are supported:
	tree, hypercube, dfp, auto
	auto - set topology type according to routing engine in smdb file
	default value: auto

  --hyper_cube_coordinates_file <value>
	Hyper Cube coordinates file
	Required when running on Hyper Cube topologies
	default value: /var/log/opensm-dor-coordinates
	this parameter is deprecated

  --root_guids_file <value>
	Root guids file
	default value: ./root_guid.cfg
	this parameter is deprecated

  --enable_sat <value>
	Enable Streaming Aggregation Trees (SAT) creation and usage.
	default value: TRUE

  --persistent_dir <value>
	Path to persistent data directory
	default value: /var/lib/sharp

  --dump_dir <value>
	Path to dump files directory
	default value: .

  --generate_dump_files <value>
	Dump internal state to files for debug and diagnostics
	default value: FALSE
	can be updated in run-time through the configuration file

  --max_quota <value>
	Maximum quota that can be requested by a single job
	It is guarantee that no job will receive more than max quota
	Format: "(Trees-per-job, OSTs-per-tree, User-data-per-ost, Groups-per-tree, QPs-per-port-per-tree)"
	default value: (4, 500, 256, 500, 180)
	this parameter is deprecated

  --default_quota <value>
	Default quota to be requested for a single job
	The quota that will be requested for a job if no quota was requested explicitly
	Format: "(Trees-per-job, OSTs-per-tree, User-data-per-ost, Groups-per-tree, QPs-per-port-per-tree)"
	default value: (1, 16, 128, 8, 64)
	this parameter is deprecated

  --per_prio_max_quota <value>
	Maximum percentage of quota (OSTs, Buffers and Groups) per aggregation node per tree, that can be requested by a single job by its priority.
	It is guarantee that no job will receive more than max quota
	Format: "prio_0_quota, [prio_1_quota, ..., prio_9_quota] "
	default value: 100
	can be updated in run-time through the configuration file

  --per_prio_default_quota <value>
	Default percentage of quota (OSTs, Buffers and Groups) per aggregation node per tree, to be requested for a single job by its priority.
	The quota in percent per tree that will be requested for a job if no quota was requested explicitly
	Format: "prio_0_quota, [prio_1_quota, ..., prio_9_quota] "
	default value: 20
	can be updated in run-time through the configuration file

  --low_prio_max_accumulated_quota <value>
	Maximum accumulated quota (OSTs, Buffers and Groups) percentage that can be allocated for all low priority jobs (priority = 0) on a single AN.
	default value: 100

  --max_trees_per_job <value>
	Maximum number of trees per job It is guarantee that no job will receive more than max trees
	default value: 2
	can be updated in run-time through the configuration file

  --default_trees_per_job <value>
	Default number of trees per job The number of trees allocated for a job if it was not requested explicitly
	default value: 1
	can be updated in run-time through the configuration file

  --max_compute_ports_per_agg_node <value>
	Maximum number of compute ports connected to the same aggregation node.
	Used for calculating the number of maximum QPs-per-port-per-tree
	When set to 0, AM determines maximum number if compute ports according to topology structure.
	default value: 0
	can be updated in run-time through the configuration file

  --default_reproducibility <value>
	Default value for reproducibility mode
	default value: TRUE
	can be updated in run-time through the configuration file

  --smx_sock_interface <value>
	Network interface to be used by SMX:
	empty string (default) - Use first interface found in UP state
	default value: 

  --smx_sock_port <value>
	The external port to be used by SMX
	default value: 6126

  --smx_sock_backlog <value>
	Defines the maximum length to which the queue of pending connections for the SMX listen socket may grow.
	default value: 5

  --smx_sock_addr_family <value>
	Determines which address family will be used in SMX's sockets
	The value needs to be one of the following: { ipv4, ipv6 }
	IPv4 support is required even when choosing the ipv6 option
	default value: ipv6

  -g, --ib_port_guid <value>
	GUID of the port to which aggregation manager binds to
	default value: 0x0

  --ib_max_mads_on_wire <value>
	Maximum number of MADs that can be sent before waiting for respond
	default value: 100

  --ib_sa_key <value>
	SA key
	default value: 0x1

  --ib_am_key <value>
	AM key
	default value: 0x0

  --ib_sharp_sl <value>
	SL for SHArP control path communication (MADs)
	default value: 0

  --support_multicast <value>
	Support return result by multicast
	default value: TRUE

  --ib_qpc_transport_service <value>
	IB QP Context - transport service
	0 - Reliable connection
	1 - Unreliable connection
	2 - Reliable datagram
	3 - Unreliable datagram
	4 - Dynamically connected
	default value: 0

  --ib_qpc_use_grh <value>
	IB QP Context - Use GRH for AN to AN communication
	default value: FALSE

  --ib_qpc_pkey <value>
	IB QP Context - Partition Key for SHArP
	default value: 0xFFFF

  --ib_qpc_sl <value>
	IB QP Context - SL for SHArP data path communication
	default value: 0

  --ib_sat_qpc_sl <value>
	IB QP Context - SL for SHArP streaming data path communication
	default value: 0

  --ib_qpc_traffic_class <value>
	IB QP Context - Traffic class for SHArP
	default value: 0

  --ib_qpc_rq_psn <value>
	IB QP Context - The transport Packet Sequence Number at which
	the remote end of the QP shall begin transmitting over the
	newly established channel. This value should be chosen to
	minimize the chance that a packet from a previous connection
	could fall within the valid PSN window
	default value: 0

  --ib_sat_qpc_rq_psn <value>
	IB Streaming Aggregation QP Context - The transport Packet Sequence Number at which
	the remote end of the QP shall begin transmitting over the
	newly established channel. This value should be chosen to
	minimize the chance that a packet from a previous connection
	could fall within the valid PSN window
	default value: 0

  --ib_qpc_sq_psn <value>
	IB QP Context - The transport Packet Sequence Number at which
	the local end of the QP shall begin transmitting over the newly
	established channel. This value should be chosen to minimize
	the chance that a packet from a previous connection could fall
	within the valid PSN window
	default value: 0

  --ib_sat_qpc_sq_psn <value>
	IB Streaming Aggregation QP Context - The transport Packet Sequence Number at which
	the local end of the QP shall begin transmitting over the newly
	established channel. This value should be chosen to minimize
	the chance that a packet from a previous connection could fall
	within the valid PSN window
	default value: 0

  --ib_qpc_rnr_mode <value>
	IB QP Context - RNR mode
	0 - SHArP level resources does not apply for RNR
	1 - SHArP level resources apply to the IB transport RNR NACK
	default value: 0

  --ib_sat_qpc_rnr_mode <value>
	IB Streaming Aggregation QP Context - RNR mode
	0 - SHArP level resources does not apply for RNR
	1 - SHArP level resources apply to the IB transport RNR NACK
	default value: 1

  --ib_qpc_rnr_retry_limit <value>
	IB QP Context - RNR retry limit
	The total number of times that the sender wishes the receiver to
	retry RNR NAK errors before posting a completion error
	default value: 0x7

  --ib_sat_qpc_rnr_retry_limit <value>
	IB Streaming Aggregation QP Context - RNR retry limit
	The total number of times that the sender wishes the receiver to
	retry RNR NAK errors before posting a completion error
	default value: 0x7

  --ib_qpc_local_ack_timeout <value>
	IB QP Context - Local ACK timeout
	Value representing the transport (ACK) timeout for use by the
	remote end.expressed as (4.096 µS*2Local ACK Timeout)
	default value: 0x1F

  --ib_sat_qpc_local_ack_timeout <value>
	IB Streaming Aggregation QP Context - Local ACK timeout
	Value representing the transport (ACK) timeout for use by the
	remote end.expressed as (4.096 µS*2Local ACK Timeout)
	default value: 0x1F

  --ib_qpc_timeout_retry_limit <value>
	IB QP Context - Timeout retry limit
	The total number of times that the sender wishes the receiver to
	retry timeout, packet sequence, etc. errors before posting a 
	completion error
	default value: 7

  --ib_sat_qpc_timeout_retry_limit <value>
	IB Streaming Aggregation QP Context - Timeout retry limit
	The total number of times that the sender wishes the receiver to
	retry timeout, packet sequence, etc. errors before posting a 
	completion error
	default value: 7

  --trimming_mode <value>
	Configured trimming mode:
	0 - No Trimming
	1 - Data Path Trimming
	default value: 1
	can be updated in run-time through the configuration file

  --pending_mode_timeout <value>
	A time period during which AM waiting for a job to be completed.
	-1 - no wait
	 0 - wait forever
	 x - pending mode duration in minutes
	default value: 0

  --job_info_polling_interval <value>
	While the AM is in pending mode state,
	it periodically query relevant sharp daemons to check if the jobs are still active.
	use job_info_polling_interval to set the interval between query in minutes or 0 for single query.
	0 - single msg mode
	x - interval in [minutes] between queries
	default value: 60

  -h, --help
	Show usage and exit

  -v, --version
	Prints sharp_am version and exit

```
## SHARP Daemon

The SHARP Daemon is local to each node and is expected to persist as long as network available.
SD interacts with following entities:

 * AM. Job startup/termination.
 * SM. Service record fetching.
 * Other SD. Group creation and destruction.
 * libsharp communication library. Job/Group management.

Only SD#0 interacts with AM. The interaction is limited to sending resource allocation,
and requesting and receiving data from openSM's ftree-ca-order dump file.
request for a job, receiving job data and sending termination request. Job data distribution
between SD participating in the job is out of scope of SHARP software and has to be done in
OOB (Out Of Band) level using push API.
All SDs can interact with the AM to request data from the opensm-ftree-ca-order.dump file,
a file generated by the SM (subnet manager),
but it is assumed that only 1 SD per job does this, and that the data is requested only once per job.
SD#0 is responsible for resource management on communicator level. SD#n>0 interacts with
SD#0 and requests resources for a group. For each group a fraction of available resources
can be allocated.
An user application can control resource allocation policy using the following environment variables:

* `SHARP_COLL_GROUP_RESOURCE_POLICY (1 - equal 2 - take_all by first group 3. User input percent)`
* `SHARP_COLL_USER_GROUP_QUOTA_PERCENT`

SD connects local Computation Process (CP) to an Aggregation Tree. The connection is based on RC QP connected to nearest
AN. AM is responsible for the AN assignment to each compute port. The connection can be reused for
multiple collective operations. Each group should be joined to the Aggregation Tree before sending
collective operations. If multiple processes are participating in the group in the same node, HCOLL
can group these process based on socket locality and use multiple processes for sending collective
operations to network. Inside the sub-group, shared memory is used for collective. Group channel
process is a process selected for participating in sharp group. Application can ask a number of
group channels from AM. Multiple group channels affects tree radix and as result buffer allocation in AN.
If AN can't allocated asked number of group channels, computation jobs fails. See [Multi-channel group].
Communication between Computation Process (libsharp) and SD is based on UNIX domain sockets.

Detailed description for the flow between SD and CP can found in [sharp_ctl.h](src/api/sharp_ctl.h).

SD discovers AM address using Service Record fetching from SM.

SD has limited support for resiliency futures:

* If AM connection is broken, SD tries to reconnect to AM.
* SD#0 monitors CP#0 (Computation Process id 0). If the process dies, SD#0 issues job termination request to AM.
  The monitoring is based on socket hangup status and doesn't requires CPU cycles.

For any job HCOLL issues two end job requests through SD#0 and last SD. The redundant
job termination request covers SD#0 crash.


### Inter-component messaging

SMX messaging library is responsible for communication between SHARP software components.
There are three communication protocols:

* AM <-> SD#0. This protocol is used on job level. It includes following messages:

   * SHARP_MSG_TYPE_BEGIN_JOB
   * SHARP_MSG_TYPE_END_JOB
   * SHARP_MSG_TYPE_JOB_DATA

SD#0 initiates connection to AM. SD discovers AM's address using service record. No special configuration
needed in production environment. For debug purposes, SMX_AM_SERVER environment variable can be used.

* AM <-> SD. This protocol is used on job level. It includes the following messages:

   * SHARP_MSG_TYPE_REQUEST_SM_DATA
   * SHARP_MSG_TYPE_GET_SM_DATA_BUF_LEN
   * SHARP_MSG_TYPE_GET_SM_DATA

SD initiates this protocol - and it is assumed that it is used by only 1 SD per job,
and only once per job per data type.
SD discovers AM's address using service record. No special configuration needed in production environment.
For debug purposes, SMX_AM_SERVER environment variable can be used.

* SD <-> SD#0. This protocol is used on communicator level and includes following messages:

   * SHARP_MSG_TYPE_ALLOC_GROUP
   * SHARP_MSG_TYPE_GROUP_DATA
   * SHARP_MSG_TYPE_GET_JOB_DATA
   * SHARP_MSG_TYPE_RELEASE_GROUP

SD#>0 knows SD#0 address from job information distributed among SDs.

SMX wraps following underling communication mechanisms:

*   TCP socket. This is main communication mechanism used for production environment. A user
    has to configure at least one network interface.
*   Files. This mode serves debug and versification purposes.
*   UCX. This mode allows in-band message communication and uses [UCX - Unified Communication X library]
    (https://github.com/openucx/ucx). This is experimental mode and can't be used in production environment.

Relevant parameters (AM, SD):

* `smx_sock_interface`
* `smx_sock_port`
* `smx_sock_backlog`
* `smx_sock_addr_family`

### MAD communication

AM use ibis for high-performance, parallel processing: [ibis](https://github.com/Mellanox/ibis_tools).
SD is libibumad based application.

### APIs

SHARP includes APIs:

* [libsharp_coll](src/api/sharp.h) . This high-level public API
  available for third-party integration.
* [libsharp](src/api/sharp_ctl.h). This is low-level private API.
* sharp_job_quota . This is Job Sheduler API.

libsharp is interface library used for communication with local SD. UNIX domain socket is used for the communication.

Versions
-------------------------------------------------------------------------------

|SHARP version      |MOFED version         |SwitchIB-2 FW|HPCX version|UFM version|SMX protocol|
|-------------------|----------------------|-------------|------------|-----------|------------|
|v1.0               |MLNX OFED 3.3-x.x.x   |15.1100.0072 |1.6.392     |    -      |    new     |
|v1.1               |MLNX OFED 3.4-0.1.2.0 |15.1200.0102 |1.7.405     |    -      |  changed   |
|v1.2               |MLNX OFED 4.0-x.x.x   |15.1200.0102 |1.8.xxx     |  5.8-5.9  |  changed   |
|v1.3               |MLNX OFED 4.1-1.0.2.0 |15.1460.0162 |1.9.5       |           | unchanged  |
|v1.4               |MLNX OFED 4.2-1.2.0.0 |15.1500.0106 |2.0         |  5.9.5-4  | unchanged  |
|v1.5               |MLNX_OFED 4.3-1.0.1.0 |15.1600.0182 |2.1         |  5.10-x   |  changed   |
|v1.6               |MLNX OFED 4.3-3.x     |15.1630.0216 |      -     |  5.11     | unchanged  |
|v1.7               |MLNX OFED 4.4-x.x.x   |15.1630.0216 |2.2         |  6.0      |changed (SD<->SD) only|
|v2.0               |MLNX OFED 4.7-x.x.x   |15.2000.2626 |2.3         |  6.3      |changed, new SMX serialization|
|v2.1               |MLNX OFED 4.8-x.x.x   |15.2000.2626 |2.6         |  6.4      |changed|

Prerequisites
-------------------------------------------------------------------------------

 * SwitchIB-2 or Quantum based fabric.
 * SwitchIB-2, or Quantum FW (see the table above).
 * MLNX OS 3.6.1002 (for managed switches).
 * MOFED (see HCOLL prerequisites)
 * HPCx bundle (see the table above)
 * MLNX OpenSM 4.7.0 or later (available with MLNX OFED 3.3-x.x.x or UFM 5.6).
 * MLNX OpenSM 4.9 or later for eliminating manual configuration for root guids.
 * ConnectX HCA. ConnectX-6 for SAT aggregation (blocks bigger than 64K)
 * Kernel >= 2.6.22.
 * SHARP is compiled on following OS:

|Distro      |Platform |Kernel          |
|------------|---------|----------------|
|RHEL 6.1    |x86-64   |2.6.32-131.0.15 |
|RHEL 6.2    |x86-64   |2.6.32-220      |
|RHEL 6.3    |x86-64   |2.6.32-279      |
|RHEL 6.4    |x86-64   |2.6.32-358      |
|RHEL 6.5    |x86-64   |2.6.32-431      |
|RHEL 7.0    |x86-64   |3.10.0-123      |
|RHEL 7.2    |x86-64   |3.10.0-327      |
|RHEL 7.2    |ppc64le  |3.10.0-327      |
|RHEL 7.3    |aarch64  |4.5.0-15.el7    |
|RHEL 7.3    |x86-64   |3.10.0-514      |
|RHEL 7.4    |x86-64   |3.10.0-693      |
|RHEL 7.4    |aarch64  |4.11.0-44       |
|RHEL 7.5    |x86-64   |3.10.0-862      |
|Fedora14    |x86-64   |2.6.35.6-45     |
|Fedora16    |x86-64   |3.1.0-7         |
|Fedora17    |x86-64   |3.3.4-5         |
|Fedora18    |x86-64   |3.6.10-4        |
|Fedora26    |x86-64   |4.11.8-300      |
|Fedora28    |x86-64   |4.13.9-300      |
|SLES 11 SP3 |x86-64   |3.0.76-0.11     |
|SLES 11 SP4 |x86-64   |3.0.101-57      |
|SLES 12 SP1 |x86-64   |3.12.49-11      |
|SLES 12 SP2 |x86-64   |4.4.21-68       |
|SLES 12 SP3 |x86-64   |4.4.73-5        |
|SLES 15 SP0 |x86-64   |4.12.14-23      |
|SLES 18 SP0 |x86-64   |4.18.0-10       |
|Ubuntu16.10 |x86-64   |4.8.0-26        |
|Ubuntu17.10 |x86-64   |4.13.0-17       |
|Ubuntu14.4  |ppc64le  |3.13.0-32       |
|Ubuntu18.04 |x86-64   |4.15.0-20       |
|Centos6.3   |x86-64   |2.6.32-279      |
|Centos6.0   |x86-64   |2.6.32-71       |


 * SHARP is tested on:
     * Intel architecture: RHEL 7.4 (3.10.0-693).
     * PPC architecture (little-endian): Ubuntu 14.4 (3.13.0-32), Power8.

System configuration
-------------------------------------------------------------------------------

* Each compute node needs run local SHARP daemon.
* Only one instance of AM is allowed.
* AM and SM have to share the same server.

```
  +--------------------------------------+    +---------------------------------+
  |  Compute host                        |    | Dedicated server                |
  |                                      |    |                                 |
  |  +---------------+  +-------------+  |    |  +------------+  +-----------+  |
  |  | libsharp      +-->  SD         |  |    |  |  AM        |  |  SM       |  |
  |  +---------------+  |             |  |    |  |            |  |           |  |
  |  | libsharp_coll |  |             |  |    |  |            |  |           |  |
  |  +---------------+  |             |  |    |  |            |  |           |  |
  |  | hcoll         |  |             |  |    |  |            |  |           |  |
  |  +---------------+  |             |  |    |  |            |  |           |  |
  |  | Computation   |  +-------------+  |    |  +------------+  |           |  |
  |  | Process (CP)  |  |    SMX      +----------->SMX        |  |           |  |
  |  +---------------+  +-------------+  |TCP |  +------------+  +-----------+  |
  |                                      |    |                                 |
  +--------------------------------------+    +---------------------------------+
```


Installation - Linux
-------------------------------------------------------------------------------

To build SHARP from source, the following tools are needed:
 * autoconf
 * automake
 * libtool
 * pkg-config

If you get the SHARP sources from github, you have to generate
configure script at first:

```shell
% ./autogen.sh
```

To build and install SHARP run following:

```shell
% module load mofed/hpcx
% ./configure --with-mpi=$OMPI_HOME --prefix=$PWD/install
% make install
% module unload mofed/hpcx
```

To build SHARP with ROCm support:

Install ROCm following installation guide here: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html. ROCm packages are installed to /opt/rocm by default.

```shell
./configure --with-cuda=/usr/local/cuda --with-rocm=/opt/rocm
```

To compile debugging code configure the project with the --enable-debug option:
```shell
% ./configure --with-mpi=$OMPI_HOME --prefix=$PWD/install --enable-debug
```

RPM installation:
```js
# rpm -ivh <sharp.rpm>
```

DEB installation:
```js
# dpkg -i <sharp.deb>
```

After installation the following daemons will be setup:

- *sharp_am* - will be disabled in all rc

For daemons manual installation/removing (requires root permission):
```
$prefix/bin/sharp_daemons_setup.sh or $top_source_dir/contrib/sharp_daemons_setup.sh
```

How to use the script:

```
Usage: sharp_daemons_setup.sh <-s SHARP location dir> <-r> <-a> <-d daemon> <-b>
    -s - Setup SHARP daemons
    -r - Remove SHARP daemons
    -a - All daemons (sharp_am)[default]
    -d - Daemon name (sharp_am)
    -b - Enable socket based activation of the service
```
-b option is only available on systems that support socket based activation.
Socket based activation is supported on RH 7.2 and above and requires Systemd.

https://github.com/Mellanox/sharp/wiki/Socket-Based-Activation-Installation-Procedure-for-Sharpd

Example of daemons configuration:

```js
# $prefix/sharp_daemons_setup.sh -s -d sharp_am  # Setup sharp_am daemon
# $prefix/sharp_daemons_setup.sh -r             # Remove both sharp_am and sharp_am daemons
# $prefix/sharp_daemons_setup.sh -s -d sharp_am -b # Setup sharp_am as a socket based activated daemon
```

After setup procedure daemons startup scripts are:

```
/etc/init.d/sharp_am
```
Daemons config files are:
```
/etc/sysconfig/sharp_am
```

`$SHARP_OPTIONS` which was defined in sysconfig file will be passed to appropriate daemon as parameter

There is possibility to run SHARP daemons from nonstandard location by setting `SHARP_STARTUP_SCRIPT` env. var.

Example:
```js
# SHARP_STARTUP_SCRIPT=/my/script/location/sharp_am /etc/init.d/sharp_am start
```

Also `SHARP_DEVEL` may be set to force daemon script to use `$prefix` dir for `lockfile` and `pidfile` instead of default location (`/var/lock/subsys` and `/var/run` respectively).


**Unit Test**

```
% make unittest
% make gtest
```

**Run Valgrind Test**

```
% make valgrind
```

**Run MPI Test**

```
% make runtest
```

Using libsharp
-------------------------------------------------------------------------------

To compile a package using libsharp, you need provide
CFLAGS and LDFLAGS. libsharp is integrated with pkg-config.
If you use pkg-config, you can use it for for getting compilation
flags:

```
PKG_CONFIG_PATH=<sharp destination folder> pkg-config --cflags sharp # prints CFLAGS
PKG_CONFIG_PATH=<sharp destination folder> pkg-config --libs sharp # prints LDFLAGS
```

Using sharp_job_quota
-------------------------------------------------------------------------------

The sharp_job_quota executable should be run with the same user as SharpD (root).
Its purpose is to be used by the job scheduler to set, using uid,
the user allowed to run sharp jobs, as well as limit the amount of resources
that the user is allowed to request for the job - if such a limitation
is necessary.

When running sharp_job_quota - there are 2 required arguments
(They can be received from the command line, a configuration file, or from the environment)
These arguments are:

* operation
* allocation ID

In addition - if the "set" operation is chosen - then either uid argumentor user_name argument must be provided

For more information about these - see the full list of options below.

sharp_job_quota supports an options file.

Here are 2 simple usage examples:

```
sharp_job_quota --operation set --user_name --allocation_id  2017 --coll_job_quota_max_groups 10
sharp_job_quota --operation remove --allocation_id 2017
```

```
Description: sharp_job_quota - Set or remove sharp job quota

Usage: sharp_job_quota [OPTION]
Examples:
sharp_job_quota --operation set --user_name jobrunner --allocation_id 2017 --coll_job_quota_max_groups 10
sharp_job_quota --operation remove --allocation_id 2017

OPTIONS:
  -O, --config_file <value>
	Configuration file.
	If specified with '-' prefix, ignore configuration file read errors
	and used default configuration file.Exit if '-' is not specified
	and fails to read configuration file.
	default value: /etc/sharp/sharp_job_quota.cfg

  -c, --create_config <value>
	sharp_job_quota will dump its configuration to the specified file and exit
	default value: (null)

  -t, --operation <value>
	The 2 valid values are:
	"set" - set quota. Either uid or user_name must be provided for this operation
	"remove" - remove quota

  -i, --allocation_id <value>
	The Job scheduler's ID for the job. no other job in the system can have the same ID

  -u, --uid <value>
	UID of the user that will be allowed to run jobs

  -n, --user_name <value>
	Name of the user that will be allowed to run jobs

  -g, --coll_job_quota_max_groups <value>
	Maximum number of groups (comms) allowed.
	The deafult value means that all quota requests in range will be accepted
	default value: 0

  -q, --coll_job_quota_max_qps_per_port <value>
	Maximum QPs/port allowed.
	The default value means that all quota requests in range will be accepted
	default value: 0

  -p, --coll_job_quota_max_payload_per_ost <value>
	Maximum payload per OST allowed.
	The default value means all quota requests in range will be accepted
	default value: 256

  -o, --coll_job_quota_max_osts <value>
	Indicates the maximum number of OSTs allowed for job per collective operation.
	The default value means that all quota requests in range will be accepted
	default value: 0

  --coll_job_quota_max_num_trees <value>
	Indicates the maximum number of trees allowed for job.
	The default value means that all number of trees in range will be accepted
	default value: 0

  --job_priority <value>
	Priority of the job.
	default value: 0

  --coll_job_quota_percentage <value>
	Sets upper limit of SHARP resources available for the job.
	Default value 0 means, use sharp_am default according to priority.
	default value: 0

  -h, --help
	Show usage and exit

  -v, --version
	Print version and exit

```

Logging
-------------------------------------------------------------------------------

Following logs are useful for SHARP troubleshooting:

* AM log. Default location is /var/log/sharp_am.log .
*  Following parameters controls log creation in AM and SD:

  * `log_file`
  * `log_verbosity . Possible values: 1 - Errors; 2 - Warnings; 3 - Info;
    4 - Debug; 5 - Trace.`
  * `verbose`
  * `log_max_backup_files`
  * `log_file_max_size`

* SHARP_COLL logging.

   * `SHARP_COLL_LOG_LEVEL` - Messages with a level higher or equal to the selected will be printed.
     Possible values are: 0 - fatal, 1 - error, 2 - warn, 3 - info, 4 - debug, 5 - trace.

* SM log.

* SMX doesn't have own logging system. It reports messages into application log (AM or SD).

* sharp_job_quota does not currently have logging.
  It prints messages and errors to stdout and stderr, respectively

* ibis has own log. Log location and size are configured using following parameters:

  * `ibis_log_file`
  * `ibis_log_size`

Switch IB2 capabilities
-------------------------------------------------------------------------------

* Maximum node radix: 64.
* Maximum number of trees: 64. Tree number 63 is reserved.
* Maximum number of groups: 128.
* Data buffer size: 192K.
* Maximum number of QPs: 2K.
* Maximum operation size: 256 Bytes.
* Minimal MTU: 512 Bytes.
* Outstanding operations: Up 384. Actual number of OST is limited by buffer size.


Limitations
-------------------------------------------------------------------------------

### v1.7.0

* Mellanox OS is not supported. SM has to run on dedicated server.
* AM has limited support for topology updates. Switch reboot and link bouncing are supported. Fabric extension for non-root switches is  supported.
* Only one job allowed per host.
* AM in HPCX/MOFED doesn't support handover/failover. UFM supports AM handover.
* Each new instance of AM cleans SHARP resources
  in all discovered ANs. 
* Only fat-tree, quasi fat-tree, hypercube and dragonfly+ topologies are supported.
* Only homogeneous fabric are supported. (All switches must be SHARP compatible).
* Non homogeneous fabric (not all switches must be SHARP compatible) needs manual configuration for
  trees and connections between hosts and ANs. See [TreesConfigurationFile.md](doc/TreesConfigurationFile.md).
* Limited support for partial software update.  SD and AM can communicate if they use the same SMX protocol version.
* AM Key 0 only tested.
* SD v1.7 is not compliant with previous SD versions. (SD v1.7 is compliant with AM v1.67)

Known issues
-------------------------------------------------------------------------------

### v1.7

* HCOLL asks for surplus group channels in some cases. Number of asked group channels
  is a function of maximum socket id. If a socket with lower id is not used, HCOLL
  in any case asks sharp resources for it.
* 15.1430.0160 unusable with SHARP. The FW should be upgraded to latest GA (15.1630.0210)
* FW supports only one trim request at the same time. It means group trimming in the data path can fail
  if two groups run trimming through the same AN.

Releases
------------------------------------------------------------------------------

### Naming convention

 __XX.YY.ZZ-PRERELEASE__

*  __XX__: Major version
*  __YY__: Minor version
*  __ZZ__: Fix for released version
*  __PRERELEASE__: This is an optional tag that indicates a pre-release. Pre-release is not yet stable enough for production use.
                 This is an essential milestone before release.

### How to check SHARP version

* Open `<SHARP FOLDER>share/doc/sharp/SHARP_VERSION`

	```
	PACKAGE VERSION: 1.3.0
	PRERELEASE: 0dev
	SOURCE REVISION: c51b664
	IBIS SOURCE REVISION: ab5c9f9
	BUILD DATE: Jan/30/2017 11:40:30
	```

* Run `<SHARP application> --version`

	```
	sharpd (sharp) 1.3.0-0dev
	Copyright (C) 2016 Mellanox Technologies, Inc.
	License: See LICENSE file
	There is NO WARRANTY, to the extent permitted by law.

	Build Date: Jan 30 2017
	Last commit: c51b664
	```

* Search for "Version:" in log file

```
[Jan 30 15:48:26 338167][SD][14193][output] - Package: sharp-0dev
[Jan 30 15:48:26 338275][SD][14193][output] - Version: 1.3.0
[Jan 30 15:48:26 338290][SD][14193][output] - Build Date: Jan 30 2017
[Jan 30 15:48:26 338303][SD][14193][output] - Last commit: c51b664
```

### Pre-release tags

Only following names can be used in pre-release tags.

* __0dev__: The project is not in a stable state. The API may change without notice,
* __alpha__: Project is not tested, may also be many unknown bugs. Not suitable for
 production sites.
* __beta__: API should be considered frozen. Not generally suitable for production sites,
  but may be used on some production sites if the site administrator knows the project
  well, and knows how to handle any outstanding issues.
* __rc__: All features in the version are implemented. All reported, critical issues are
  fixed. From developer perspective, the project is ready for production.

### Branches

Development is done in "master" branch. Once a version is released a new maintenance branch is created.

### Examples

|SHARP version      |Git tag         | Git branch | Description                               |
|-------------------|----------------|------------|-------------------------------------------|
|v1.0               |v1.0.0          | v1.0       | v1.0. First official release for SHARP software |
|v1.1.1             |v1.1.1          | v1.1       | Update for version v1.1                   |
|v1.2.0-alpha1      |v1.2.0-alpha1   | master     | Alpha version for v1.2                    |
|v1.2.0-beta        |v1.2.0-beta     | master     | Beta version for v1.2                     |
|v1.2.0-rc2         |v1.2.0-rc2      | master     | Release candidate  for v1.2               |
|v1.2.0             |v1.2.0          | v1.2       | Release candidate v1.2                    |
|v1.3.0-0dev        |v1.3.0-0dev     | master     | v1.3 under development                    |

Contributing to the project
-------------------------------------------------------------------------------

See [CONTRIBUTING.md](.github/CONTRIBUTING.md)

References
------------------------------------------------------------------------------
[Multi-channel group]: https://github.com/Mellanox/sharp/wiki/Multi-channel-group
