| KVM-specific MSRs. |
| Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 |
| ===================================================== |
| |
| KVM makes use of some custom MSRs to service some requests. |
| |
| Custom MSRs have a range reserved for them, that goes from |
| 0x4b564d00 to 0x4b564dff. There are MSRs outside this area, |
| but they are deprecated and their use is discouraged. |
| |
| Custom MSR list |
| -------- |
| |
| The current supported Custom MSR list is: |
| |
| MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00 |
| |
| data: 4-byte alignment physical address of a memory area which must be |
| in guest RAM. This memory is expected to hold a copy of the following |
| structure: |
| |
| struct pvclock_wall_clock { |
| u32 version; |
| u32 sec; |
| u32 nsec; |
| } __attribute__((__packed__)); |
| |
| whose data will be filled in by the hypervisor. The hypervisor is only |
| guaranteed to update this data at the moment of MSR write. |
| Users that want to reliably query this information more than once have |
| to write more than once to this MSR. Fields have the following meanings: |
| |
| version: guest has to check version before and after grabbing |
| time information and check that they are both equal and even. |
| An odd version indicates an in-progress update. |
| |
| sec: number of seconds for wallclock at time of boot. |
| |
| nsec: number of nanoseconds for wallclock at time of boot. |
| |
| In order to get the current wallclock time, the system_time from |
| MSR_KVM_SYSTEM_TIME_NEW needs to be added. |
| |
| Note that although MSRs are per-CPU entities, the effect of this |
| particular MSR is global. |
| |
| Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid |
| leaf prior to usage. |
| |
| MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 |
| |
| data: 4-byte aligned physical address of a memory area which must be in |
| guest RAM, plus an enable bit in bit 0. This memory is expected to hold |
| a copy of the following structure: |
| |
| struct pvclock_vcpu_time_info { |
| u32 version; |
| u32 pad0; |
| u64 tsc_timestamp; |
| u64 system_time; |
| u32 tsc_to_system_mul; |
| s8 tsc_shift; |
| u8 flags; |
| u8 pad[2]; |
| } __attribute__((__packed__)); /* 32 bytes */ |
| |
| whose data will be filled in by the hypervisor periodically. Only one |
| write, or registration, is needed for each VCPU. The interval between |
| updates of this structure is arbitrary and implementation-dependent. |
| The hypervisor may update this structure at any time it sees fit until |
| anything with bit0 == 0 is written to it. |
| |
| Fields have the following meanings: |
| |
| version: guest has to check version before and after grabbing |
| time information and check that they are both equal and even. |
| An odd version indicates an in-progress update. |
| |
| tsc_timestamp: the tsc value at the current VCPU at the time |
| of the update of this structure. Guests can subtract this value |
| from current tsc to derive a notion of elapsed time since the |
| structure update. |
| |
| system_time: a host notion of monotonic time, including sleep |
| time at the time this structure was last updated. Unit is |
| nanoseconds. |
| |
| tsc_to_system_mul: multiplier to be used when converting |
| tsc-related quantity to nanoseconds |
| |
| tsc_shift: shift to be used when converting tsc-related |
| quantity to nanoseconds. This shift will ensure that |
| multiplication with tsc_to_system_mul does not overflow. |
| A positive value denotes a left shift, a negative value |
| a right shift. |
| |
| The conversion from tsc to nanoseconds involves an additional |
| right shift by 32 bits. With this information, guests can |
| derive per-CPU time by doing: |
| |
| time = (current_tsc - tsc_timestamp) |
| if (tsc_shift >= 0) |
| time <<= tsc_shift; |
| else |
| time >>= -tsc_shift; |
| time = (time * tsc_to_system_mul) >> 32 |
| time = time + system_time |
| |
| flags: bits in this field indicate extended capabilities |
| coordinated between the guest and the hypervisor. Availability |
| of specific flags has to be checked in 0x40000001 cpuid leaf. |
| Current flags are: |
| |
| flag bit | cpuid bit | meaning |
| ------------------------------------------------------------- |
| | | time measures taken across |
| 0 | 24 | multiple cpus are guaranteed to |
| | | be monotonic |
| ------------------------------------------------------------- |
| | | guest vcpu has been paused by |
| 1 | N/A | the host |
| | | See 4.70 in api.txt |
| ------------------------------------------------------------- |
| |
| Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid |
| leaf prior to usage. |
| |
| |
| MSR_KVM_WALL_CLOCK: 0x11 |
| |
| data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. |
| |
| This MSR falls outside the reserved KVM range and may be removed in the |
| future. Its usage is deprecated. |
| |
| Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid |
| leaf prior to usage. |
| |
| MSR_KVM_SYSTEM_TIME: 0x12 |
| |
| data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. |
| |
| This MSR falls outside the reserved KVM range and may be removed in the |
| future. Its usage is deprecated. |
| |
| Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid |
| leaf prior to usage. |
| |
| The suggested algorithm for detecting kvmclock presence is then: |
| |
| if (!kvm_para_available()) /* refer to cpuid.txt */ |
| return NON_PRESENT; |
| |
| flags = cpuid_eax(0x40000001); |
| if (flags & 3) { |
| msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; |
| msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; |
| return PRESENT; |
| } else if (flags & 0) { |
| msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; |
| msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; |
| return PRESENT; |
| } else |
| return NON_PRESENT; |
| |
| MSR_KVM_ASYNC_PF_EN: 0x4b564d02 |
| data: Bits 63-6 hold 64-byte aligned physical address of a |
| 64 byte memory area which must be in guest RAM and must be |
| zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1 |
| when asynchronous page faults are enabled on the vcpu 0 when |
| disabled. Bit 1 is 1 if asynchronous page faults can be injected |
| when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults |
| are delivered to L1 as #PF vmexits. Bit 2 can be set only if |
| KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID. |
| |
| First 4 byte of 64 byte memory location will be written to by |
| the hypervisor at the time of asynchronous page fault (APF) |
| injection to indicate type of asynchronous page fault. Value |
| of 1 means that the page referred to by the page fault is not |
| present. Value 2 means that the page is now available. Disabling |
| interrupt inhibits APFs. Guest must not enable interrupt |
| before the reason is read, or it may be overwritten by another |
| APF. Since APF uses the same exception vector as regular page |
| fault guest must reset the reason to 0 before it does |
| something that can generate normal page fault. If during page |
| fault APF reason is 0 it means that this is regular page |
| fault. |
| |
| During delivery of type 1 APF cr2 contains a token that will |
| be used to notify a guest when missing page becomes |
| available. When page becomes available type 2 APF is sent with |
| cr2 set to the token associated with the page. There is special |
| kind of token 0xffffffff which tells vcpu that it should wake |
| up all processes waiting for APFs and no individual type 2 APFs |
| will be sent. |
| |
| If APF is disabled while there are outstanding APFs, they will |
| not be delivered. |
| |
| Currently type 2 APF will be always delivered on the same vcpu as |
| type 1 was, but guest should not rely on that. |
| |
| MSR_KVM_STEAL_TIME: 0x4b564d03 |
| |
| data: 64-byte alignment physical address of a memory area which must be |
| in guest RAM, plus an enable bit in bit 0. This memory is expected to |
| hold a copy of the following structure: |
| |
| struct kvm_steal_time { |
| __u64 steal; |
| __u32 version; |
| __u32 flags; |
| __u8 preempted; |
| __u8 u8_pad[3]; |
| __u32 pad[11]; |
| } |
| |
| whose data will be filled in by the hypervisor periodically. Only one |
| write, or registration, is needed for each VCPU. The interval between |
| updates of this structure is arbitrary and implementation-dependent. |
| The hypervisor may update this structure at any time it sees fit until |
| anything with bit0 == 0 is written to it. Guest is required to make sure |
| this structure is initialized to zero. |
| |
| Fields have the following meanings: |
| |
| version: a sequence counter. In other words, guest has to check |
| this field before and after grabbing time information and make |
| sure they are both equal and even. An odd version indicates an |
| in-progress update. |
| |
| flags: At this point, always zero. May be used to indicate |
| changes in this structure in the future. |
| |
| steal: the amount of time in which this vCPU did not run, in |
| nanoseconds. Time during which the vcpu is idle, will not be |
| reported as steal time. |
| |
| preempted: indicate the vCPU who owns this struct is running or |
| not. Non-zero values mean the vCPU has been preempted. Zero |
| means the vCPU is not preempted. NOTE, it is always zero if the |
| the hypervisor doesn't support this field. |
| |
| MSR_KVM_EOI_EN: 0x4b564d04 |
| data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 |
| when disabled. Bit 1 is reserved and must be zero. When PV end of |
| interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned |
| physical address of a 4 byte memory area which must be in guest RAM and |
| must be zeroed. |
| |
| The first, least significant bit of 4 byte memory location will be |
| written to by the hypervisor, typically at the time of interrupt |
| injection. Value of 1 means that guest can skip writing EOI to the apic |
| (using MSR or MMIO write); instead, it is sufficient to signal |
| EOI by clearing the bit in guest memory - this location will |
| later be polled by the hypervisor. |
| Value of 0 means that the EOI write is required. |
| |
| It is always safe for the guest to ignore the optimization and perform |
| the APIC EOI write anyway. |
| |
| Hypervisor is guaranteed to only modify this least |
| significant bit while in the current VCPU context, this means that |
| guest does not need to use either lock prefix or memory ordering |
| primitives to synchronise with the hypervisor. |
| |
| However, hypervisor can set and clear this memory bit at any time: |
| therefore to make sure hypervisor does not interrupt the |
| guest and clear the least significant bit in the memory area |
| in the window between guest testing it to detect |
| whether it can skip EOI apic write and between guest |
| clearing it to signal EOI to the hypervisor, |
| guest must both read the least significant bit in the memory area and |
| clear it using a single CPU instruction, such as test and clear, or |
| compare and exchange. |