summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2014-03-24powerpc/perf: Make some new raw event codes available in sysfsAnshuman Khandual
This patchset adds some missing event list for POWER7 PMU raw events which are exported through sysfs interface. Also updates the ABI documentation to add all the sysfs exported raw events. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-24powerpc: Update ppc4xx maintainerJosh Boyer
Alistair Popple has volunteered to take over maintainership of the ppc4xx stuff upstream. Switch the MAINTAINERS entry over to him. Signed-off-by: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-24powerpc/powernv: hwmon driver for power values, fan rpm and temperatureShivaprasad G Bhat
This patch adds basic kernel enablement for reading power values, fan speed rpm and temperature values on powernv platforms which will be exported to user space through sysfs interface. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com> Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-24powerpc/powernv: Enable fetching of platform sensor dataNeelesh Gupta
This patch enables fetching of various platform sensor data through OPAL and expects a sensor handle from the driver to pass to OPAL. Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-24powerpc/powernv: Enable reading and updating of system parametersNeelesh Gupta
This patch enables reading and updating of system parameters through OPAL call. Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-24powerpc/powernv: Infrastructure to support OPAL async completionNeelesh Gupta
This patch adds support for notifying the clients of their request completion. Clients request for the token before making OPAL call and then wait for the response. This patch uses messaging infrastructure to pull the data to linux by registering itself for the message type OPAL_MSG_ASYNC_COMP. Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc/powernv Platform dump interfaceStewart Smith
This enables support for userspace to fetch and initiate FSP and Platform dumps from the service processor (via firmware) through sysfs. Based on original patch from Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Flow: - We register for OPAL notification events. - OPAL sends new dump available notification. - We make information on dump available via sysfs - Userspace requests dump contents - We retrieve the dump via OPAL interface - User copies the dump data - userspace sends ack for dump - We send ACK to OPAL. sysfs files: - We add the /sys/firmware/opal/dump directory - echoing 1 (well, anything, but in future we may support different dump types) to /sys/firmware/opal/dump/initiate_dump will initiate a dump. - Each dump that we've been notified of gets a directory in /sys/firmware/opal/dump/ with a name of the dump type and ID (in hex, as this is what's used elsewhere to identify the dump). - Each dump has files: id, type, dump and acknowledge dump is binary and is the dump itself. echoing 'ack' to acknowledge (currently any string will do) will acknowledge the dump and it will soon after disappear from sysfs. OPAL APIs: - opal_dump_init() - opal_dump_info() - opal_dump_read() - opal_dump_ack() - opal_dump_resend_notification() Currently we are only ever notified for one dump at a time (until the user explicitly acks the current dump, then we get a notification of the next dump), but this kernel code should "just work" when OPAL starts notifying us of all the dumps present. Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc/powernv: Read OPAL error log and export it through sysfsStewart Smith
Based on a patch by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> This patch adds support to read error logs from OPAL and export them to userspace through a sysfs interface. We export each log entry as a directory in /sys/firmware/opal/elog/ Currently, OPAL will buffer up to 128 error log records, we don't need to have any knowledge of this limit on the Linux side as that is actually largely transparent to us. Each error log entry has the following files: id, type, acknowledge, raw. Currently we just export the raw binary error log in the 'raw' attribute. In a future patch, we may parse more of the error log to make it a bit easier for userspace (e.g. to be able to display a brief summary in petitboot without having to have a full parser). If we have >128 logs from OPAL, we'll only be notified of 128 until userspace starts acknowledging them. This limitation may be lifted in the future and with this patch, that should "just work" from the linux side. A userspace daemon should: - wait for error log entries using normal mechanisms (we announce creation) - read error log entry - save error log entry safely to disk - acknowledge the error log entry - rinse, repeat. On the Linux side, we read the error log when we're notified of it. This possibly isn't ideal as it would be better to only read them on-demand. However, this doesn't really work with current OPAL interface, so we read the error log immediately when notified at the moment. I've tested this pretty extensively and am rather confident that the linux side of things works rather well. There is currently an issue with the service processor side of things for >128 error logs though. Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc/book3e: Fix check for linear mapping in TLB miss handlerBenjamin Krill
The previous code added wrong TLBs and causes machine check errors if a driver accessed passed the end of the linear mapping instead of a clean page fault. Signed-off-by: Ralph E. Bellofatto <ralphbel@us.ibm.com> Signed-off-by: Benjamin Krill <ben@codiert.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc: Add "force config cmd line" Kconfig optionSebastian Siewior
powerpc uses early_init_dt_scan_chosen() from common fdt code. By enabling this option, the common code can take the built in command line over the one that is comming from bootloader / DT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc/pseries: Expose in kernel device tree update to drmgrTyrel Datwyler
Traditionally it has been drmgr's responsibilty to update the device tree through the /proc/ppc64/ofdt interface after a suspend/resume operation. This patchset however has modified suspend/resume ops to preform an update entirely in the kernel during the resume. Therefore, a mechanism is required to expose that information to drmgr. This patch adds a show function to the "hibernate" attribute that returns 1 if the kernel performs a device tree update after the resume and 0 otherwise. This allows newer versions of drmgr to avoid doing a second unnecessary device tree update. Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc/pseries: Update dynamic cache nodes for suspend/resume operationHaren Myneni
pHyp can change cache nodes for suspend/resume operation. Currently the device tree is updated by drmgr in userspace after all non boot CPUs are enabled. Hence, we do not modify the cache list based on the latest cache nodes. Also we do not remove cache entries for the primary CPU. This patch removes the cache list for the boot CPU, updates the device tree before enabling nonboot CPUs and adds cache list for the boot cpu. This patch also has the side effect that older versions of drmgr will perform a second device tree update from userspace. While this is a redundant waste of a couple cycles it is harmless since firmware returns the same data for the subsequent update-nodes/properties rtas calls. Signed-off-by: Haren Myneni <hbabu@us.ibm.com> Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc/pseries: Device tree should only be updated once after suspend/migrateHaren Myneni
The current code makes rtas calls for update-nodes, activate-firmware and then update-nodes again. The FW provides the same data for both update-nodes calls. As a result a proc entry exists error is reported for the second update while adding device nodes. This patch makes a single rtas call for update-nodes after activating the FW. It also add rtas_busy delay for the activate-firmware rtas call. Signed-off-by: Haren Myneni <hbabu@us.ibm.com> Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07macintosh/adb: Change platform power management to use dev_pm_opsShuah Khan
Change adb platform driver to register pm ops using dev_pm_ops instead of legacy pm_ops. .pm hooks call existing legacy suspend and resume interfaces by passing in the right pm state. Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc: Delete old PrPMC 280/2800 supportPaul Gortmaker
This processor/memory module was mostly used on ATCA blades and before that, on cPCI blades. It wasn't really user friendly, with custom non u-boot bootloaders (powerboot/motload) and no real way to recover corrupted boot flash (which was a common problem). As such, it had its day back before the big ppc --> powerpc move to device trees, and that was largely through commercial BSPs that started to dry up around 2007. Systems using one were largely in a "deploy and sustain" mode, so interest in upgrading to new kernels in the field was nil. Also, requiring 50A, 48V power supplies and a 2'x2'x2' ATCA chassis largely rules out any hobbyist/enthusiast interest. The point of all this, is that we might as well delete the in kernel files relating to this platform. No point in continuing to build it via walking the defconfigs or via linux-next testing. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07macintosh/adb: Fixed some coding style problemsBrandon Stewart
Signed-off-by: Brandon Stewart <stewartb2@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc/pseries: Use remove_memory() to remove memoryNathan Fontenot
The memory remove code for powerpc/pseries should call remove_memory() so that we are holding the hotplug_memory lock during memory remove operations. This patch updates the memory node remove handler to call remove_memory() and adds a ppc_md.remove_memory() entry to handle pseries specific work that is called from arch_remove_memory(). During memory remove in pseries_remove_memblock() we have to stay with removing memory one section at a time. This is needed because of how memory resources are handled. During memory add for pseries (via the probe file in sysfs) we add memory one section at a time which gives us a memory resource for each section. Future patches will aim to address this so will not have to remove memory one section at a time. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07selftests/powerpc: Import Anton's memcpy / copy_tofrom_user testsMichael Ellerman
Turn Anton's memcpy / copy_tofrom_user test into something that can live in tools/testing/selftests. It requires one turd in arch/powerpc/lib/memcpy_64.S, but it's pretty harmless IMHO. We are sailing very close to the wind with the feature macros. We define them to nothing, which currently means we get a few extra nops and include the unaligned calls. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc/book3s: Recover from MC in sapphire on SCOM read via MMIO.Mahesh Salgaonkar
Detect and recover from machine check when inside opal on a special scom load instructions. On specific SCOM read via MMIO we may get a machine check exception with SRR0 pointing inside opal. To recover from MC in this scenario, get a recovery instruction address and return to it from MC. OPAL will export the machine check recoverable ranges through device tree node mcheck-recoverable-ranges under ibm,opal: # hexdump /proc/device-tree/ibm,opal/mcheck-recoverable-ranges 0000000 0000 0000 3000 2804 0000 000c 0000 0000 0000010 3000 2814 0000 0000 3000 27f0 0000 000c 0000020 0000 0000 3000 2814 xxxx xxxx xxxx xxxx 0000030 llll llll yyyy yyyy yyyy yyyy ... ... # where: xxxx xxxx xxxx xxxx = Starting instruction address llll llll = Length of the address range. yyyy yyyy yyyy yyyy = recovery address Each recoverable address range entry is (start address, len, recovery address), 2 cells each for start and recovery address, 1 cell for len, totalling 5 cells per entry. During kernel boot time, build up the recovery table with the list of recovery ranges from device-tree node which will be used during machine check exception to recover from MMIO SCOM UE. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc/pseries: Don't try to register pseries cpu hotplug on non-pseriesBenjamin Herrenschmidt
This results in oddball messages at boot on other platforms telling us that CPU hotplug isn't supported even when it is. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc: Fix xmon disassembler for little-endianPhilippe Bergheaud
This patch fixes the disassembler of the powerpc kernel debugger xmon, for little-endian. Signed-off-by: Philippe Bergheaud <felix@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc: Revert c6102609 and replace it with the correct fix for vio dma ↵Li Zhong
mask setting This patch reverts my previous "fix", and replace it with the correct fix from Russell. And as Russell pointed out -- dma_set_mask_and_coherent() (and the other dma_set_mask() functions) are really supposed to be used by drivers only. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc: : Kill CONFIG_MTD_PARTITIONS송은봉
This patch removes CONFIG_MTD_PARTITIONS in config files for powerpc. Because CONFIG_MTD_PARTITIONS was removed by commit 6a8a98b22b10f1560d5f90aded4a54234b9b2724. Signed-off-by: Eunbong Song <eunb.song@samsung.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc: Align p_dyn, p_rela and p_st symbolsAnton Blanchard
The 64bit relocation code places a few symbols in the text segment. These symbols are only 4 byte aligned where they need to be 8 byte aligned. Add an explicit alignment. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@vger.kernel.org Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-07powerpc/tm: Fix crash when forking inside a transactionMichael Neuling
When we fork/clone we currently don't copy any of the TM state to the new thread. This results in a TM bad thing (program check) when the new process is switched in as the kernel does a tmrechkpt with TEXASR FS not set. Also, since R1 is from userspace, we trigger the bad kernel stack pointer detection. So we end up with something like this: Bad kernel stack pointer 0 at c0000000000404fc cpu 0x2: Vector: 700 (Program Check) at [c00000003ffefd40] pc: c0000000000404fc: restore_gprs+0xc0/0x148 lr: 0000000000000000 sp: 0 msr: 9000000100201030 current = 0xc000001dd1417c30 paca = 0xc00000000fe00800 softe: 0 irq_happened: 0x01 pid = 0, comm = swapper/2 WARNING: exception is not recoverable, can't continue The below fixes this by flushing the TM state before we copy the task_struct to the clone. To do this we go through the tmreclaim patch, which removes the checkpointed registers from the CPU and transitions the CPU out of TM suspend mode. Hence we need to call tmrechkpt after to restore the checkpointed state and the TM mode for the current task. To make this fail from userspace is simply: tbegin li r0, 2 sc <boom> Kudos to Adhemerval Zanella Neto for finding this. Signed-off-by: Michael Neuling <mikey@neuling.org> cc: Adhemerval Zanella Neto <azanella@br.ibm.com> cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-28powerpc/powernv: Fix indirect XSCOM unmanglingBenjamin Herrenschmidt
We need to unmangle the full address, not just the register number, and we also need to support the real indirect bit being set for in-kernel uses. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: <stable@vger.kernel.org> [v3.13]
2014-02-28powerpc/powernv: Fix opal_xscom_{read,write} prototypeBenjamin Herrenschmidt
The OPAL firmware functions opal_xscom_read and opal_xscom_write take a 64-bit argument for the XSCOM (PCB) address in order to support the indirect mode on P8. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: <stable@vger.kernel.org> [v3.13]
2014-02-28powerpc/powernv: Refactor PHB diag-data dumpGavin Shan
As Ben suggested, the patch prints PHB diag-data with multiple fields in one line and omits the line if the fields of that line are all zero. With the patch applied, the PHB3 diag-data dump looks like: PHB3 PHB#3 Diag-data (Version: 1) brdgCtl: 00000002 RootSts: 0000000f 00400000 b0830008 00100147 00002000 nFir: 0000000000000000 0030006e00000000 0000000000000000 PhbSts: 0000001c00000000 0000000000000000 Lem: 0000000000100000 42498e327f502eae 0000000000000000 InAErr: 8000000000000000 8000000000000000 0402030000000000 0000000000000000 PE[ 8] A/B: 8480002b00000000 8000000000000000 [ The current diag data is so big that it overflows the printk buffer pretty quickly in cases when we get a handful of errors at once which can happen. --BenH ] Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> CC: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-28powerpc/powernv: Dump PHB diag-data immediatelyGavin Shan
The PHB diag-data is important to help locating the root cause for EEH errors such as frozen PE or fenced PHB. However, the EEH core enables IO path by clearing part of HW registers before collecting this data causing it to be corrupted. This patch fixes this by dumping the PHB diag-data immediately when frozen/fenced state on PE or PHB is detected for the first time in eeh_ops::get_state() or next_error() backend. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> CC: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-28powerpc: Increase stack redzone for 64-bit userspace to 512 bytesPaul Mackerras
The new ELFv2 little-endian ABI increases the stack redzone -- the area below the stack pointer that can be used for storing data -- from 288 bytes to 512 bytes. This means that we need to allow more space on the user stack when delivering a signal to a 64-bit process. To make the code a bit clearer, we define new USER_REDZONE_SIZE and KERNEL_REDZONE_SIZE symbols in ptrace.h. For now, we leave the kernel redzone size at 288 bytes, since increasing it to 512 bytes would increase the size of interrupt stack frames correspondingly. Gcc currently only makes use of 288 bytes of redzone even when compiling for the new little-endian ABI, and the kernel cannot currently be compiled with the new ABI anyway. In the future, hopefully gcc will provide an option to control the amount of redzone used, and then we could reduce it even more. This also changes the code in arch_compat_alloc_user_space() to preserve the expanded redzone. It is not clear why this function would ever be used on a 64-bit process, though. Signed-off-by: Paul Mackerras <paulus@samba.org> CC: <stable@vger.kernel.org> [v3.13] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-28powerpc/ftrace: bugfix for test_24bit_addrLiu Ping Fan
The branch target should be the func addr, not the addr of func_descr_t. So using ppc_function_entry() to generate the right target addr. Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-28powerpc/crashdump : Fix page frame number check in copy_oldmem_pageLaurent Dufour
In copy_oldmem_page, the current check using max_pfn and min_low_pfn to decide if the page is backed or not, is not valid when the memory layout is not continuous. This happens when running as a QEMU/KVM guest, where RTAS is mapped higher in the memory. In that case max_pfn points to the end of RTAS, and a hole between the end of the kdump kernel and RTAS is not backed by PTEs. As a consequence, the kdump kernel is crashing in copy_oldmem_page when accessing in a direct way the pages in that hole. This fix relies on the memblock's service memblock_is_region_memory to check if the read page is part or not of the directly accessible memory. Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> CC: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-28powerpc/le: Ensure that the 'stop-self' RTAS token is handled correctlyTony Breeds
Currently we're storing a host endian RTAS token in rtas_stop_self_args.token. We then pass that directly to rtas. This is fine on big endian however on little endian the token is not what we expect. This will typically result in hitting: panic("Alas, I survived.\n"); To fix this we always use the stop-self token in host order and always convert it to be32 before passing this to rtas. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> Cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17powerpc/eeh: Disable EEH on rebootGavin Shan
We possiblly detect EEH errors during reboot, particularly in kexec path, but it's impossible for device drivers and EEH core to handle or recover them properly. The patch registers one reboot notifier for EEH and disable EEH subsystem during reboot. That means the EEH errors is going to be cleared by hardware reset or second kernel during early stage of PCI probe. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17powerpc/eeh: Cleanup on eeh_subsystem_enabledGavin Shan
The patch cleans up variable eeh_subsystem_enabled so that we needn't refer the variable directly from external. Instead, we will use function eeh_enabled() and eeh_set_enable() to operate the variable. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17powerpc/powernv: Rework EEH resetGavin Shan
When doing reset in order to recover the affected PE, we issue hot reset on PE primary bus if it's not root bus. Otherwise, we issue hot or fundamental reset on root port or PHB accordingly. For the later case, we didn't cover the situation where PE only includes root port and it potentially causes kernel crash upon EEH error to the PE. The patch reworks the logic of EEH reset to improve the code readability and also avoid the kernel crash. Cc: stable@vger.kernel.org Reported-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17powerpc: Use unstripped VDSO image for more accurate profiling dataAnton Blanchard
We are seeing a lot of hits in the VDSO that are not resolved by perf. A while(1) gettimeofday() loop shows the issue: 27.64% [vdso] [.] 0x000000000000060c 22.57% [vdso] [.] 0x0000000000000628 16.88% [vdso] [.] 0x0000000000000610 12.39% [vdso] [.] __kernel_gettimeofday 6.09% [vdso] [.] 0x00000000000005f8 3.58% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18 2.94% [vdso] [.] __kernel_datapage_offset 2.90% test [.] main We are using a stripped VDSO image which means only symbols with relocation info can be resolved. There isn't a lot of point to stripping the VDSO, the debug info is only about 1kB: 4680 arch/powerpc/kernel/vdso64/vdso64.so 5815 arch/powerpc/kernel/vdso64/vdso64.so.dbg By using the unstripped image, we can resolve all the symbols in the VDSO and the perf profile data looks much better: 76.53% [vdso] [.] __do_get_tspec 12.20% [vdso] [.] __kernel_gettimeofday 5.05% [vdso] [.] __get_datapage 3.20% test [.] main 2.92% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18 Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17powerpc: Link VDSOs at 0x0Anton Blanchard
perf is failing to resolve symbols in the VDSO. A while (1) gettimeofday() loop shows: 93.99% [vdso] [.] 0x00000000000005e0 3.12% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18 2.81% test [.] main The reason for this is that we are linking our VDSO shared libraries at 1MB, which is a little weird. Even though this is uncommon, Alan points out that it is valid and we should probably fix perf userspace. Regardless, I can't see a reason why we are doing this. The code is all position independent and we never rely on the VDSO ending up at 1M (and we never place it there on 64bit tasks). Changing our link address to 0x0 fixes perf VDSO symbol resolution: 73.18% [vdso] [.] 0x000000000000060c 12.39% [vdso] [.] __kernel_gettimeofday 3.58% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18 2.94% [vdso] [.] __kernel_datapage_offset 2.90% test [.] main We still have some local symbol resolution issues that will be fixed in a subsequent patch. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bitAneesh Kumar K.V
Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using a hash table MMU for various reasons (the flush is handled as part of the PTE modification when necessary). ppc64 thus doesn't implement flush_tlb_range for hash based MMUs. Additionally ppc64 require the tlb flushing to be batched within ptl locks. The reason to do that is to ensure that the hash page table is in sync with linux page table. We track the hpte index in linux pte and if we clear them without flushing hash and drop the ptl lock, we can have another cpu update the pte and can end up with duplicate entry in the hash table, which is fatal. We also want to keep set_pte_at simpler by not requiring them to do hash flush for performance reason. We do that by assuming that set_pte_at() is never *ever* called on a PTE that is already valid. This was the case until the NUMA code went in which broke that assumption. Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a way similar to ptep/pmdp_set_wrprotect(), with a generic implementation using set_pte_at() and a powerpc specific one using the appropriate mechanism needed to keep the hash table in sync. Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17mm: Dirty accountable change only apply to non prot numa caseAneesh Kumar K.V
So move it within the if loop Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17powerpc/mm: Add new "set" flag argument to pte/pmd update functionAneesh Kumar K.V
pte_update() is a powerpc-ism used to change the bits of a PTE when the access permission is being restricted (a flush is potentially needed). It uses atomic operations on when needed and handles the hash synchronization on hash based processors. It is currently only used to clear PTE bits and so the current implementation doesn't provide a way to also set PTE bits. The new _PAGE_NUMA bit, when set, is actually restricting access so it must use that function too, so this change adds the ability for pte_update() to also set bits. We will use this later to set the _PAGE_NUMA bit. Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17powerpc/pseries: Add Gen3 definitions for PCIE link speedKleber Sacilotto de Souza
Rev3 of the PCI Express Base Specification defines a Supported Link Speeds Vector where the bit definitions within this field are: Bit 0 - 2.5 GT/s Bit 1 - 5.0 GT/s Bit 2 - 8.0 GT/s This vector definition is used by the platform firmware to export the maximum and current link speeds of the PCI bus via the "ibm,pcie-link-speed-stats" device-tree property. This patch updates pseries_root_bridge_prepare() to detect Gen3 speed buses (defined by 0x04). Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17powerpc/pseries: Fix regression on PCI link speedKleber Sacilotto de Souza
Commit 5091f0c (powerpc/pseries: Fix PCIE link speed endian issue) introduced a regression on the PCI link speed detection using the device-tree property. The ibm,pcie-link-speed-stats property is composed of two 32-bit integers, the first one being the maxinum link speed and the second the current link speed. The changes introduced by the aforementioned commit are considering just the first integer. Fix this issue by changing how the property is accessed, using the helper functions to properly access the array of values. The explicit byte swapping is not needed anymore here, since it's done by the helper functions. Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17powerpc: Set the correct ksp_limit on ppc32 when switching to irq stackKevin Hao
Guenter Roeck has got the following call trace on a p2020 board: Kernel stack overflow in process eb3e5a00, r1=eb79df90 CPU: 0 PID: 2838 Comm: ssh Not tainted 3.13.0-rc8-juniper-00146-g19eca00 #4 task: eb3e5a00 ti: c0616000 task.ti: ef440000 NIP: c003a420 LR: c003a410 CTR: c0017518 REGS: eb79dee0 TRAP: 0901 Not tainted (3.13.0-rc8-juniper-00146-g19eca00) MSR: 00029000 <CE,EE,ME> CR: 24008444 XER: 00000000 GPR00: c003a410 eb79df90 eb3e5a00 00000000 eb05d900 00000001 65d87646 00000000 GPR08: 00000000 020b8000 00000000 00000000 44008442 NIP [c003a420] __do_softirq+0x94/0x1ec LR [c003a410] __do_softirq+0x84/0x1ec Call Trace: [eb79df90] [c003a410] __do_softirq+0x84/0x1ec (unreliable) [eb79dfe0] [c003a970] irq_exit+0xbc/0xc8 [eb79dff0] [c000cc1c] call_do_irq+0x24/0x3c [ef441f20] [c00046a8] do_IRQ+0x8c/0xf8 [ef441f40] [c000e7f4] ret_from_except+0x0/0x18 --- Exception: 501 at 0xfcda524 LR = 0x10024900 Instruction dump: 7c781b78 3b40000a 3a73b040 543c0024 3a800000 3b3913a0 7ef5bb78 48201bf9 5463103a 7d3b182e 7e89b92e 7c008146 <3ba00000> 7e7e9b78 48000014 57fff87f Kernel panic - not syncing: kernel stack overflow CPU: 0 PID: 2838 Comm: ssh Not tainted 3.13.0-rc8-juniper-00146-g19eca00 #4 Call Trace: The reason is that we have used the wrong register to calculate the ksp_limit in commit cbc9565ee826 (powerpc: Remove ksp_limit on ppc64). Just fix it. As suggested by Benjamin Herrenschmidt, also add the C prototype of the function in the comment in order to avoid such kind of errors in the future. Cc: stable@vger.kernel.org # 3.12 Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-11powerpc/powernv: Add iommu DMA bypass support for IODA2Benjamin Herrenschmidt
This patch adds the support for to create a direct iommu "bypass" window on IODA2 bridges (such as Power8) allowing to bypass iommu page translation completely for 64-bit DMA capable devices, thus significantly improving DMA performances. Additionally, this adds a hook to the struct iommu_table so that the IOMMU API / VFIO can disable the bypass when external ownership is requested, since in that case, the device will be used by an environment such as userspace or a KVM guest which must not be allowed to bypass translations. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-11powerpc: Fix endian issues in kexec and crash dump codeAnton Blanchard
We expose a number of OF properties in the kexec and crash dump code and these need to be big endian. Cc: stable@vger.kernel.org # v3.13 Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-11powerpc/ppc32: Fix the bug in the init of non-base exception stack for UPKevin Hao
We would allocate one specific exception stack for each kind of non-base exceptions for every CPU. For ppc32 the CPU hard ID is used as the subscript to get the specific exception stack for one CPU. But for an UP kernel, there is only one element in the each kind of exception stack array. We would get stuck if the CPU hard ID is not equal to '0'. So in this case we should use the subscript '0' no matter what the CPU hard ID is. Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-11powerpc/xmon: Don't signal we've entered until we're finished printingMichael Ellerman
Currently we set our cpu's bit in cpus_in_xmon, and then we take the output lock and print the exception information. This can race with the master cpu entering the command loop and printing the backtrace. The result is that the backtrace gets garbled with another cpu's exception print out. Fix it by delaying the set of cpus_in_xmon until we are finished printing. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-11powerpc/xmon: Fix timeout loop in get_output_lock()Michael Ellerman
As far as I can tell, our 70s era timeout loop in get_output_lock() is generating no code. This leads to the hostile takeover happening more or less simultaneously on all cpus. The result is "interesting", some example output that is more readable than most: cpu 0x1: Vector: 100 (Scypsut e0mx bR:e setV)e catto xc0p:u[ c 00 c0:0 000t0o0V0erc0td:o5 rfc28050000]0c00 0 0 0 6t(pSrycsV1ppuot uxe 1m 2 0Rx21e3:0s0ce000c00000t00)00 60602oV2SerucSayt0y 0p 1sxs Fix it by using udelay() in the timeout loop. The wait time and check frequency are arbitrary, but seem to work OK. We already rely on udelay() working so this is not a new dependency. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-11powerpc/xmon: Don't loop forever in get_output_lock()Michael Ellerman
If we enter with xmon_speaker != 0 we skip the first cmpxchg(), we also skip the while loop because xmon_speaker != last_speaker (0) - meaning we skip the second cmpxchg() also. Following that code path the compiler sees no memory barriers and so is within its rights to never reload xmon_speaker. The end result is we loop forever. This manifests as all cpus being in xmon ('c' command), but they refuse to take control when you switch to them ('c x' for cpu # x). I have seen this deadlock in practice and also checked the generated code to confirm this is what's happening. The simplest fix is just to always try the cmpxchg(). Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>