| Coherent Accelerator Interface (CXL) |
| ==================================== |
| |
| Introduction |
| ============ |
| |
| The coherent accelerator interface is designed to allow the |
| coherent connection of accelerators (FPGAs and other devices) to a |
| POWER system. These devices need to adhere to the Coherent |
| Accelerator Interface Architecture (CAIA). |
| |
| IBM refers to this as the Coherent Accelerator Processor Interface |
| or CAPI. In the kernel it's referred to by the name CXL to avoid |
| confusion with the ISDN CAPI subsystem. |
| |
| Coherent in this context means that the accelerator and CPUs can |
| both access system memory directly and with the same effective |
| addresses. |
| |
| |
| Hardware overview |
| ================= |
| |
| POWER8/9 FPGA |
| +----------+ +---------+ |
| | | | | |
| | CPU | | AFU | |
| | | | | |
| | | | | |
| | | | | |
| +----------+ +---------+ |
| | PHB | | | |
| | +------+ | PSL | |
| | | CAPP |<------>| | |
| +---+------+ PCIE +---------+ |
| |
| The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP) |
| unit which is part of the PCIe Host Bridge (PHB). This is managed |
| by Linux by calls into OPAL. Linux doesn't directly program the |
| CAPP. |
| |
| The FPGA (or coherently attached device) consists of two parts. |
| The POWER Service Layer (PSL) and the Accelerator Function Unit |
| (AFU). The AFU is used to implement specific functionality behind |
| the PSL. The PSL, among other things, provides memory address |
| translation services to allow each AFU direct access to userspace |
| memory. |
| |
| The AFU is the core part of the accelerator (eg. the compression, |
| crypto etc function). The kernel has no knowledge of the function |
| of the AFU. Only userspace interacts directly with the AFU. |
| |
| The PSL provides the translation and interrupt services that the |
| AFU needs. This is what the kernel interacts with. For example, if |
| the AFU needs to read a particular effective address, it sends |
| that address to the PSL, the PSL then translates it, fetches the |
| data from memory and returns it to the AFU. If the PSL has a |
| translation miss, it interrupts the kernel and the kernel services |
| the fault. The context to which this fault is serviced is based on |
| who owns that acceleration function. |
| |
| POWER8 <-----> PSL Version 8 is compliant to the CAIA Version 1.0. |
| POWER9 <-----> PSL Version 9 is compliant to the CAIA Version 2.0. |
| This PSL Version 9 provides new features such as: |
| * Interaction with the nest MMU on the P9 chip. |
| * Native DMA support. |
| * Supports sending ASB_Notify messages for host thread wakeup. |
| * Supports Atomic operations. |
| * .... |
| |
| Cards with a PSL9 won't work on a POWER8 system and cards with a |
| PSL8 won't work on a POWER9 system. |
| |
| AFU Modes |
| ========= |
| |
| There are two programming modes supported by the AFU. Dedicated |
| and AFU directed. AFU may support one or both modes. |
| |
| When using dedicated mode only one MMU context is supported. In |
| this mode, only one userspace process can use the accelerator at |
| time. |
| |
| When using AFU directed mode, up to 16K simultaneous contexts can |
| be supported. This means up to 16K simultaneous userspace |
| applications may use the accelerator (although specific AFUs may |
| support fewer). In this mode, the AFU sends a 16 bit context ID |
| with each of its requests. This tells the PSL which context is |
| associated with each operation. If the PSL can't translate an |
| operation, the ID can also be accessed by the kernel so it can |
| determine the userspace context associated with an operation. |
| |
| |
| MMIO space |
| ========== |
| |
| A portion of the accelerator MMIO space can be directly mapped |
| from the AFU to userspace. Either the whole space can be mapped or |
| just a per context portion. The hardware is self describing, hence |
| the kernel can determine the offset and size of the per context |
| portion. |
| |
| |
| Interrupts |
| ========== |
| |
| AFUs may generate interrupts that are destined for userspace. These |
| are received by the kernel as hardware interrupts and passed onto |
| userspace by a read syscall documented below. |
| |
| Data storage faults and error interrupts are handled by the kernel |
| driver. |
| |
| |
| Work Element Descriptor (WED) |
| ============================= |
| |
| The WED is a 64-bit parameter passed to the AFU when a context is |
| started. Its format is up to the AFU hence the kernel has no |
| knowledge of what it represents. Typically it will be the |
| effective address of a work queue or status block where the AFU |
| and userspace can share control and status information. |
| |
| |
| |
| |
| User API |
| ======== |
| |
| 1. AFU character devices |
| |
| For AFUs operating in AFU directed mode, two character device |
| files will be created. /dev/cxl/afu0.0m will correspond to a |
| master context and /dev/cxl/afu0.0s will correspond to a slave |
| context. Master contexts have access to the full MMIO space an |
| AFU provides. Slave contexts have access to only the per process |
| MMIO space an AFU provides. |
| |
| For AFUs operating in dedicated process mode, the driver will |
| only create a single character device per AFU called |
| /dev/cxl/afu0.0d. This will have access to the entire MMIO space |
| that the AFU provides (like master contexts in AFU directed). |
| |
| The types described below are defined in include/uapi/misc/cxl.h |
| |
| The following file operations are supported on both slave and |
| master devices. |
| |
| A userspace library libcxl is available here: |
| https://github.com/ibm-capi/libcxl |
| This provides a C interface to this kernel API. |
| |
| open |
| ---- |
| |
| Opens the device and allocates a file descriptor to be used with |
| the rest of the API. |
| |
| A dedicated mode AFU only has one context and only allows the |
| device to be opened once. |
| |
| An AFU directed mode AFU can have many contexts, the device can be |
| opened once for each context that is available. |
| |
| When all available contexts are allocated the open call will fail |
| and return -ENOSPC. |
| |
| Note: IRQs need to be allocated for each context, which may limit |
| the number of contexts that can be created, and therefore |
| how many times the device can be opened. The POWER8 CAPP |
| supports 2040 IRQs and 3 are used by the kernel, so 2037 are |
| left. If 1 IRQ is needed per context, then only 2037 |
| contexts can be allocated. If 4 IRQs are needed per context, |
| then only 2037/4 = 509 contexts can be allocated. |
| |
| |
| ioctl |
| ----- |
| |
| CXL_IOCTL_START_WORK: |
| Starts the AFU context and associates it with the current |
| process. Once this ioctl is successfully executed, all memory |
| mapped into this process is accessible to this AFU context |
| using the same effective addresses. No additional calls are |
| required to map/unmap memory. The AFU memory context will be |
| updated as userspace allocates and frees memory. This ioctl |
| returns once the AFU context is started. |
| |
| Takes a pointer to a struct cxl_ioctl_start_work: |
| |
| struct cxl_ioctl_start_work { |
| __u64 flags; |
| __u64 work_element_descriptor; |
| __u64 amr; |
| __s16 num_interrupts; |
| __s16 reserved1; |
| __s32 reserved2; |
| __u64 reserved3; |
| __u64 reserved4; |
| __u64 reserved5; |
| __u64 reserved6; |
| }; |
| |
| flags: |
| Indicates which optional fields in the structure are |
| valid. |
| |
| work_element_descriptor: |
| The Work Element Descriptor (WED) is a 64-bit argument |
| defined by the AFU. Typically this is an effective |
| address pointing to an AFU specific structure |
| describing what work to perform. |
| |
| amr: |
| Authority Mask Register (AMR), same as the powerpc |
| AMR. This field is only used by the kernel when the |
| corresponding CXL_START_WORK_AMR value is specified in |
| flags. If not specified the kernel will use a default |
| value of 0. |
| |
| num_interrupts: |
| Number of userspace interrupts to request. This field |
| is only used by the kernel when the corresponding |
| CXL_START_WORK_NUM_IRQS value is specified in flags. |
| If not specified the minimum number required by the |
| AFU will be allocated. The min and max number can be |
| obtained from sysfs. |
| |
| reserved fields: |
| For ABI padding and future extensions |
| |
| CXL_IOCTL_GET_PROCESS_ELEMENT: |
| Get the current context id, also known as the process element. |
| The value is returned from the kernel as a __u32. |
| |
| |
| mmap |
| ---- |
| |
| An AFU may have an MMIO space to facilitate communication with the |
| AFU. If it does, the MMIO space can be accessed via mmap. The size |
| and contents of this area are specific to the particular AFU. The |
| size can be discovered via sysfs. |
| |
| In AFU directed mode, master contexts are allowed to map all of |
| the MMIO space and slave contexts are allowed to only map the per |
| process MMIO space associated with the context. In dedicated |
| process mode the entire MMIO space can always be mapped. |
| |
| This mmap call must be done after the START_WORK ioctl. |
| |
| Care should be taken when accessing MMIO space. Only 32 and 64-bit |
| accesses are supported by POWER8. Also, the AFU will be designed |
| with a specific endianness, so all MMIO accesses should consider |
| endianness (recommend endian(3) variants like: le64toh(), |
| be64toh() etc). These endian issues equally apply to shared memory |
| queues the WED may describe. |
| |
| |
| read |
| ---- |
| |
| Reads events from the AFU. Blocks if no events are pending |
| (unless O_NONBLOCK is supplied). Returns -EIO in the case of an |
| unrecoverable error or if the card is removed. |
| |
| read() will always return an integral number of events. |
| |
| The buffer passed to read() must be at least 4K bytes. |
| |
| The result of the read will be a buffer of one or more events, |
| each event is of type struct cxl_event, of varying size. |
| |
| struct cxl_event { |
| struct cxl_event_header header; |
| union { |
| struct cxl_event_afu_interrupt irq; |
| struct cxl_event_data_storage fault; |
| struct cxl_event_afu_error afu_error; |
| }; |
| }; |
| |
| The struct cxl_event_header is defined as: |
| |
| struct cxl_event_header { |
| __u16 type; |
| __u16 size; |
| __u16 process_element; |
| __u16 reserved1; |
| }; |
| |
| type: |
| This defines the type of event. The type determines how |
| the rest of the event is structured. These types are |
| described below and defined by enum cxl_event_type. |
| |
| size: |
| This is the size of the event in bytes including the |
| struct cxl_event_header. The start of the next event can |
| be found at this offset from the start of the current |
| event. |
| |
| process_element: |
| Context ID of the event. |
| |
| reserved field: |
| For future extensions and padding. |
| |
| If the event type is CXL_EVENT_AFU_INTERRUPT then the event |
| structure is defined as: |
| |
| struct cxl_event_afu_interrupt { |
| __u16 flags; |
| __u16 irq; /* Raised AFU interrupt number */ |
| __u32 reserved1; |
| }; |
| |
| flags: |
| These flags indicate which optional fields are present |
| in this struct. Currently all fields are mandatory. |
| |
| irq: |
| The IRQ number sent by the AFU. |
| |
| reserved field: |
| For future extensions and padding. |
| |
| If the event type is CXL_EVENT_DATA_STORAGE then the event |
| structure is defined as: |
| |
| struct cxl_event_data_storage { |
| __u16 flags; |
| __u16 reserved1; |
| __u32 reserved2; |
| __u64 addr; |
| __u64 dsisr; |
| __u64 reserved3; |
| }; |
| |
| flags: |
| These flags indicate which optional fields are present in |
| this struct. Currently all fields are mandatory. |
| |
| address: |
| The address that the AFU unsuccessfully attempted to |
| access. Valid accesses will be handled transparently by the |
| kernel but invalid accesses will generate this event. |
| |
| dsisr: |
| This field gives information on the type of fault. It is a |
| copy of the DSISR from the PSL hardware when the address |
| fault occurred. The form of the DSISR is as defined in the |
| CAIA. |
| |
| reserved fields: |
| For future extensions |
| |
| If the event type is CXL_EVENT_AFU_ERROR then the event structure |
| is defined as: |
| |
| struct cxl_event_afu_error { |
| __u16 flags; |
| __u16 reserved1; |
| __u32 reserved2; |
| __u64 error; |
| }; |
| |
| flags: |
| These flags indicate which optional fields are present in |
| this struct. Currently all fields are Mandatory. |
| |
| error: |
| Error status from the AFU. Defined by the AFU. |
| |
| reserved fields: |
| For future extensions and padding |
| |
| |
| 2. Card character device (powerVM guest only) |
| |
| In a powerVM guest, an extra character device is created for the |
| card. The device is only used to write (flash) a new image on the |
| FPGA accelerator. Once the image is written and verified, the |
| device tree is updated and the card is reset to reload the updated |
| image. |
| |
| open |
| ---- |
| |
| Opens the device and allocates a file descriptor to be used with |
| the rest of the API. The device can only be opened once. |
| |
| ioctl |
| ----- |
| |
| CXL_IOCTL_DOWNLOAD_IMAGE: |
| CXL_IOCTL_VALIDATE_IMAGE: |
| Starts and controls flashing a new FPGA image. Partial |
| reconfiguration is not supported (yet), so the image must contain |
| a copy of the PSL and AFU(s). Since an image can be quite large, |
| the caller may have to iterate, splitting the image in smaller |
| chunks. |
| |
| Takes a pointer to a struct cxl_adapter_image: |
| struct cxl_adapter_image { |
| __u64 flags; |
| __u64 data; |
| __u64 len_data; |
| __u64 len_image; |
| __u64 reserved1; |
| __u64 reserved2; |
| __u64 reserved3; |
| __u64 reserved4; |
| }; |
| |
| flags: |
| These flags indicate which optional fields are present in |
| this struct. Currently all fields are mandatory. |
| |
| data: |
| Pointer to a buffer with part of the image to write to the |
| card. |
| |
| len_data: |
| Size of the buffer pointed to by data. |
| |
| len_image: |
| Full size of the image. |
| |
| |
| Sysfs Class |
| =========== |
| |
| A cxl sysfs class is added under /sys/class/cxl to facilitate |
| enumeration and tuning of the accelerators. Its layout is |
| described in Documentation/ABI/testing/sysfs-class-cxl |
| |
| |
| Udev rules |
| ========== |
| |
| The following udev rules could be used to create a symlink to the |
| most logical chardev to use in any programming mode (afuX.Yd for |
| dedicated, afuX.Ys for afu directed), since the API is virtually |
| identical for each: |
| |
| SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b" |
| SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \ |
| KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b" |