eCos SMP Support ================ eCos contains support for limited Symmetric Multi-Processing (SMP). This is only available on selected architectures and platforms. This first part of this document describes the platform-independent parts of the SMP support. Annexes at the end of this document describe any details that are specific to a particular platform. Target Hardware Limitations --------------------------- To allow a reasonable implementation of SMP, and to reduce the disruption to the existing source base, a number of assumptions have been made about the features of the target hardware. - Modest multiprocessing. The typical number of CPUs supported is two to four, with an upper limit around eight. While there are no inherent limits in the code, hardware and algorithmic limitations will probably become significant beyond this point. - SMP synchronization support. The hardware must supply a mechanism to allow software on two CPUs to synchronize. This is normally provided as part of the instruction set in the form of test-and-set, compare-and-swap or load-link/store-conditional instructions. An alternative approach is the provision of hardware semaphore registers which can be used to serialize implementations of these operations. Whatever hardware facilities are available, they are used in eCos to implement spinlocks. - Coherent caches. It is assumed that no extra effort will be required to access shared memory from any processor. This means that either there are no caches, they are shared by all processors, or are maintained in a coherent state by the hardware. It would be too disruptive to the eCos sources if every memory access had to be bracketed by cache load/flush operations. Any hardware that requires this is not supported. - Uniform addressing. It is assumed that all memory that is shared between CPUs is addressed at the same location from all CPUs. Like non-coherent caches, dealing with CPU-specific address translation is considered too disruptive to the eCos source base. This does not, however, preclude systems with non-uniform access costs for different CPUs. - Uniform device addressing. As with access to memory, it is assumed that all devices are equally accessible to all CPUs. Since device access is often made from thread contexts, it is not possible to restrict access to device control registers to certain CPUs, since there is currently no support for binding or migrating threads to CPUs. - Interrupt routing. The target hardware must have an interrupt controller that can route interrupts to specific CPUs. It is acceptable for all interrupts to be delivered to just one CPU, or for some interrupts to be bound to specific CPUs, or for some interrupts to be local to each CPU. At present dynamic routing, where a different CPU may be chosen each time an interrupt is delivered, is not supported. ECos cannot support hardware where all interrupts are delivered to all CPUs simultaneously with the expectation that software will resolve any conflicts. - Inter-CPU interrupts. A mechanism to allow one CPU to interrupt another is needed. This is necessary so that events on one CPU can cause rescheduling on other CPUs. - CPU Identifiers. Code running on a CPU must be able to determine which CPU it is running on. The CPU Id is usually provided either in a CPU status register, or in a register associated with the inter-CPU interrupt delivery subsystem. Ecos expects CPU Ids to be small positive integers, although alternative representations, such as bitmaps, can be converted relatively easily. Complex mechanisms for getting the CPU Id cannot be supported. Getting the CPU Id must be a cheap operation, since it is done often, and in performance critical places such as interrupt handlers and the scheduler. Kernel Support -------------- This section describes how SMP is handled in the kernel, and where system behaviour differs from a single CPU system. System Startup ~~~~~~~~~~~~~~ System startup takes place on only one CPU, called the primary CPU. All other CPUs, the secondary CPUs, are either placed in suspended state at reset, or are captured by the HAL and put into a spin as they start up. The primary CPU is responsible for copying the DATA segment and zeroing the BSS (if required), calling HAL variant and platform initialization routines and invoking constructors. It then calls cyg_start() to enter the application. The application may then create extra threads and other objects. It is only when the application calls Cyg_Scheduler::start() that the secondary CPUs are initialized. This routine scans the list of available secondary CPUs and calls HAL_SMP_CPU_START() to start each one. Finally it calls Cyg_Scheduler::start_cpu(). Each secondary CPU starts in the HAL, where it completes any per-CPU initialization before calling into the kernel at cyg_kernel_cpu_startup(). Here it claims the scheduler lock and calls Cyg_Scheduler::start_cpu(). Cyg_Scheduler::start_cpu() is common to both the primary and secondary CPUs. The first thing this code does is to install an interrupt object for this CPU's inter-CPU interrupt. From this point on the code is the same as for the single CPU case: an initial thread is chosen and entered. From this point on the CPUs are all equal, eCos makes no further distinction between the primary and secondary CPUs. However, the hardware may still distinguish them as far as interrupt delivery is concerned. Scheduling ~~~~~~~~~~ To function correctly an operating system kernel must protect its vital data structures, such as the run queues, from concurrent access. In a single CPU system the only concurrent activities to worry about are asynchronous interrupts. The kernel can easily guard its data structures against these by disabling interrupts. However, in a multi-CPU system, this is inadequate since it does not block access by other CPUs. The eCos kernel protects its vital data structures using the scheduler lock. In single CPU systems this is a simple counter that is atomically incremented to acquire the lock and decremented to release it. If the lock is decremented to zero then the scheduler may be invoked to choose a different thread to run. Because interrupts may continue to be serviced while the scheduler lock is claimed, ISRs are not allowed to access kernel data structures, or call kernel routines that can. Instead all such operations are deferred to an associated DSR routine that is run during the lock release operation, when the data structures are in a consistent state. By choosing a kernel locking mechanism that does not rely on interrupt manipulation to protect data structures, it is easier to convert eCos to SMP than would otherwise be the case. The principal change needed to make eCos SMP-safe is to convert the scheduler lock into a nestable spin lock. This is done by adding a spinlock and a CPU id to the original counter. The algorithm for acquiring the scheduler lock is very simple. If the scheduler lock's CPU id matches the current CPU then it can increment the counter and continue. If it does not match, the CPU must spin on the spinlock, after which it may increment the counter and store its own identity in the CPU id. To release the lock, the counter is decremented. If it goes to zero the CPU id value must be set to NONE and the spinlock cleared. To protect these sequences against interrupts, they must be performed with interrupts disabled. However, since these are very short code sequences, they will not have an adverse effect on the interrupt latency. Beyond converting the scheduler lock, further preparing the kernel for SMP is a relatively minor matter. The main changes are to convert various scalar housekeeping variables into arrays indexed by CPU id. These include the current thread pointer, the need_reschedule flag and the timeslice counter. At present only the Multi-Level Queue (MLQ) scheduler is capable of supporting SMP configurations. The main change made to this scheduler is to cope with having several threads in execution at the same time. Running threads are marked with the CPU they are executing on. When scheduling a thread, the scheduler skips past any running threads until it finds a thread that is pending. While not a constant-time algorithm, as in the single CPU case, this is still deterministic, since the worst case time is bounded by the number of CPUs in the system. A second change to the scheduler is in the code used to decide when the scheduler should be called to choose a new thread. The scheduler attempts to keep the *n* CPUs running the *n* highest priority threads. Since an event or interrupt on one CPU may require a reschedule on another CPU, there must be a mechanism for deciding this. The algorithm currently implemented is very simple. Given a thread that has just been awakened (or had its priority changed), the scheduler scans the CPUs, starting with the one it is currently running on, for a current thread that is of lower priority than the new one. If one is found then a reschedule interrupt is sent to that CPU and the scan continues, but now using the current thread of the rescheduled CPU as the candidate thread. In this way the new thread gets to run as quickly as possible, hopefully on the current CPU, and the remaining CPUs will pick up the remaining highest priority threads as a consequence of processing the reschedule interrupt. The final change to the scheduler is in the handling of timeslicing. Only one CPU receives timer interrupts, although all CPUs must handle timeslicing. To make this work, the CPU that receives the timer interrupt decrements the timeslice counter for all CPUs, not just its own. If the counter for a CPU reaches zero, then it sends a timeslice interrupt to that CPU. On receiving the interrupt the destination CPU enters the scheduler and looks for another thread at the same priority to run. This is somewhat more efficient than distributing clock ticks to all CPUs, since the interrupt is only needed when a timeslice occurs. Device Drivers ~~~~~~~~~~~~~~ The main area where the SMP nature of a system will be most apparent is in device drivers. It is quite possible for the ISR, DSR and thread components of a device driver to execute on different CPUs. For this reason it is much more important that SMP-capable device drivers use the driver API routines correctly. Synchronization between threads and DSRs continues to require that the thread-side code use cyg_drv_dsr_lock() and cyg_drv_dsr_unlock() to protect access to shared data. Synchronization between ISRs and DSRs or threads requires that access to sensitive data be protected, in all places, by calls to cyg_drv_isr_lock() and cyg_drv_isr_unlock(). The ISR lock, for SMP systems, not only disables local interrupts, but also acquires a spinlock to protect against concurrent access from other CPUs. This is necessary because ISRs are not run with the scheduler lock claimed. Hence they can run in parallel with other components of the device driver. The ISR lock provided by the driver API is just a shared spinlock that is available for use by all drivers. If a driver needs to implement a finer grain of locking, it can use private spinlocks, accessed via the cyg_drv_spinlock_*() functions (see API later). API Extensions -------------- In general, the SMP support is invisible to application code. All synchronization and communication operations function exactly as before. The main area where code needs to be SMP aware is in the handling of interrupt routing, and in the synchronization of ISRs, DSRs and threads. The following sections contain brief descriptions of the API extensions added for SMP support. More details will be found in the Kernel C API and Device Driver API documentation. Interrupt Routing ~~~~~~~~~~~~~~~~~ Two new functions have been added to the Kernel API and the device driver API to do interrupt routing. These are: void cyg_interrupt_set_cpu( cyg_vector_t vector, cyg_cpu_t cpu ); void cyg_drv_interrupt_set_cpu( cyg_vector_t vector, cyg_cpu_t cpu ); cyg_cpu_t cyg_interrupt_get_cpu( cyg_vector_t vector ); cyg_cpu_t cyg_drv_interrupt_get_cpu( cyg_vector_t vector ); the *_set_cpu() functions cause the given interrupt to be handled by the nominated CPU. The *_get_cpu() functions return the CPU to which the vector is routed. Although not currently supported, special values for the cpu argument may be used to indicate that the interrupt is being routed dynamically or is CPU-local. Once a vector has been routed to a new CPU, all other interrupt masking and configuration operations are relative to that CPU, where relevant. Synchronization ~~~~~~~~~~~~~~~ All existing synchronization mechanisms work as before in an SMP system. Additional synchronization mechanisms have been added to provide explicit synchronization for SMP. A set of functions have been added to the Kernel and device driver APIs to provide spinlocks: void cyg_spinlock_init( cyg_spinlock_t *lock, cyg_bool_t locked ); void cyg_drv_spinlock_init( cyg_spinlock_t *lock, cyg_bool_t locked ); void cyg_spinlock_destroy( cyg_spinlock_t *lock ); void cyg_drv_spinlock_destroy( cyg_spinlock_t *lock ); void cyg_spinlock_spin( cyg_spinlock_t *lock ); void cyg_drv_spinlock_spin( cyg_spinlock_t *lock ); void cyg_spinlock_clear( cyg_spinlock_t *lock ); void cyg_drv_spinlock_clear( cyg_spinlock_t *lock ); cyg_bool_t cyg_spinlock_try( cyg_spinlock_t *lock ); cyg_bool_t cyg_drv_spinlock_try( cyg_spinlock_t *lock ); cyg_bool_t cyg_spinlock_test( cyg_spinlock_t *lock ); cyg_bool_t cyg_drv_spinlock_test( cyg_spinlock_t *lock ); void cyg_spinlock_spin_intsave( cyg_spinlock_t *lock, cyg_addrword_t *istate ); void cyg_drv_spinlock_spin_intsave( cyg_spinlock_t *lock, cyg_addrword_t *istate ); void cyg_spinlock_clear_intsave( cyg_spinlock_t *lock, cyg_addrword_t istate ); void cyg_drv_spinlock_clear_intsave( cyg_spinlock_t *lock, cyg_addrword_t istate ); The *_init() functions initialize the lock, to either locked or clear, and the *_destroy() functions destroy the lock. Init() should be called before the lock is used and destroy() should be called when it is finished with. The *_spin() functions will cause the calling CPU to spin until it can claim the lock and the *_clear() functions clear the lock so that the next CPU can claim it. The *_try() functions attempts to claim the lock but returns false if it cannot. The *_test() functions simply return the state of the lock. None of these functions will necessarily block interrupts while they spin. If the spinlock is only to be used between threads on different CPUs, or in circumstances where it is known that the relevant interrupts are disabled, then these functions will suffice. However, if the spinlock is also to be used from an ISR, which may be called at any point, a straightforward spinlock may result in deadlock. Hence the *_intsave() variants are supplied to disable interrupts while the lock is held. The *_spin_intsave() function disables interrupts, saving the current state in *istate, and then claims the lock. The *_clear_intsave() function clears the spinlock and restores the interrupt enable state from *istate. HAL Support ----------- SMP support in any platform depends on the HAL supplying the appropriate operations. All HAL SMP support is defined in the hal_smp.h header (and if necessary var_smp.h and plf_smp.h). SMP support falls into a number of functional groups. CPU Control ~~~~~~~~~~~ This group consists of descriptive and control macros for managing the CPUs in an SMP system. HAL_SMP_CPU_TYPE A type that can contain a CPU id. A CPU id is usually a small integer that is used to index arrays of variables that are managed on an per-CPU basis. HAL_SMP_CPU_MAX The maximum number of CPUs that can be supported. This is used to provide the size of any arrays that have an element per CPU. HAL_SMP_CPU_COUNT() Returns the number of CPUs currently operational. This may differ from HAL_SMP_CPU_MAX depending on the runtime environment. HAL_SMP_CPU_THIS() Returns the CPU id of the current CPU. HAL_SMP_CPU_NONE A value that does not match any real CPU id. This is uses where a CPU type variable must be set to a nul value. HAL_SMP_CPU_START( cpu ) Starts the given CPU executing at a defined HAL entry point. After performing any HAL level initialization, the CPU calls up into the kernel at cyg_kernel_cpu_startup(). HAL_SMP_CPU_RESCHEDULE_INTERRUPT( cpu, wait ) Sends the CPU a reschedule interrupt, and if _wait_ is non-zero, waits for an acknowledgment. The interrupted CPU should call cyg_scheduler_set_need_reschedule() in its DSR to cause the reschedule to occur. HAL_SMP_CPU_TIMESLICE_INTERRUPT( cpu, wait ) Sends the CPU a timeslice interrupt, and if _wait_ is non-zero, waits for an acknowledgment. The interrupted CPU should call cyg_scheduler_timeslice_cpu() to cause the timeslice event to be processed. Test-and-set Support ~~~~~~~~~~~~~~~~~~~~ Test-and-set is the foundation of the SMP synchronization mechanisms. HAL_TAS_TYPE The type for all test-and-set variables. The test-and-set macros only support operations on a single bit (usually the least significant bit) of this location. This allows for maximum flexibility in the implementation. HAL_TAS_SET( tas, oldb ) Performs a test and set operation on the location _tas_. _oldb_ will contain *true* if the location was already set, and *false* if it was clear. HAL_TAS_CLEAR( tas, oldb ) Performs a test and clear operation on the location _tas_. _oldb_ will contain *true* if the location was already set, and *false* if it was clear. Spinlocks ~~~~~~~~~ Spinlocks provide inter-CPU locking. Normally they will be implemented on top of the test-and-set mechanism above, but may also be implemented by other means if, for example, the hardware has more direct support for spinlocks. HAL_SPINLOCK_TYPE The type for all spinlock variables. HAL_SPINLOCK_INIT_CLEAR A value that may be assigned to a spinlock variable to initialize it to clear. HAL_SPINLOCK_INIT_SET A value that may be assigned to a spinlock variable to initialize it to set. HAL_SPINLOCK_SPIN( lock ) The caller spins in a busy loop waiting for the lock to become clear. It then sets it and continues. This is all handled atomically, so that there are no race conditions between CPUs. HAL_SPINLOCK_CLEAR( lock ) The caller clears the lock. One of any waiting spinners will then be able to proceed. HAL_SPINLOCK_TRY( lock, val ) Attempts to set the lock. The value put in _val_ will be *true* if the lock was claimed successfully, and *false* if it was not. HAL_SPINLOCK_TEST( lock, val ) Tests the current value of the lock. The value put in _val_ will be *true* if the lock is claimed and *false* of it is clear. Scheduler Lock ~~~~~~~~~~~~~~ The scheduler lock is the main protection for all kernel data structures. By default the kernel implements the scheduler lock itself using a spinlock. However, if spinlocks cannot be supported by the hardware, or there is a more efficient implementation available, the HAL may provide macros to implement the scheduler lock. HAL_SMP_SCHEDLOCK_DATA_TYPE A data type, possibly a structure, that contains any data items needed by the scheduler lock implementation. A variable of this type will be instantiated as a static member of the Cyg_Scheduler_SchedLock class and passed to all the following macros. HAL_SMP_SCHEDLOCK_INIT( lock, data ) Initialize the scheduler lock. The _lock_ argument is the scheduler lock counter and the _data_ argument is a variable of HAL_SMP_SCHEDLOCK_DATA_TYPE type. HAL_SMP_SCHEDLOCK_INC( lock, data ) Increment the scheduler lock. The first increment of the lock from zero to one for any CPU may cause it to wait until the lock is zeroed by another CPU. Subsequent increments should be less expensive since this CPU already holds the lock. HAL_SMP_SCHEDLOCK_ZERO( lock, data ) Zero the scheduler lock. This operation will also clear the lock so that other CPUs may claim it. HAL_SMP_SCHEDLOCK_SET( lock, data, new ) Set the lock to a different value, in _new_. This is only called when the lock is already known to be owned by the current CPU. It is never called to zero the lock, or to increment it from zero. Interrupt Routing ~~~~~~~~~~~~~~~~~ The routing of interrupts to different CPUs is supported by two new interfaces in hal_intr.h. Once an interrupt has been routed to a new CPU, the existing vector masking and configuration operations should take account of the CPU routing. For example, if the operation is not invoked on the destination CPU itself, then the HAL may need to arrange to transfer the operation to the destination CPU for correct application. HAL_INTERRUPT_SET_CPU( vector, cpu ) Route the interrupt for the given _vector_ to the given _cpu_. HAL_INTERRUPT_GET_CPU( vector, cpu ) Set _cpu_ to the id of the CPU to which this vector is routed. Annex 1 - Pentium SMP Support ============================= ECos supports SMP working on Pentium class IA32 CPUs with integrated SMP support. It uses the per-CPU APIC's and the IOAPIC to provide CPU control and identification, and to distribute interrupts. Only PCI interrupts that map into the ISA interrupt space are currently supported. The code relies on the MP Configuration Table supplied by the BIOS to discover the number of CPUs, IOAPIC location and interrupt assignments - hardware based MP configuration discovery is not currently supported. Inter-CPU interrupts are mapped into interrupt vectors from 64 up. Each CPU has its own vector at 64+CPUID. Interrupt delivery is initially configured to deliver all interrupts to the initial CPU. HAL_INTERRUPT_SET_CPU() currently only supports the ability to deliver interrupts to specific CPUs, dynamic CPU selection is not currently supported. eCos has only been tested in a dual processor configuration. While the code has been written to handle an arbitrary number of CPUs, this has not been tested.