These files were getting the moduleparam infrastructure from the
implicit presence of module.h being everywhere, but that is going
away soon.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
These were getting it implicitly via device.h --> module.h but
we are going to stop that when we clean up the headers.
Fix these in advance so the tree remains biscect-clean.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
They had been getting it implicitly via device.h but we can't
rely on that for the future, due to a pending cleanup so fix
it now.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Fix an issue where the link would come up after replugging a cable
even if it has been DISABLED manually.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Hold the link state machine until the tuning data is read from the
QSFP EEPROM so correct tuning settings are applied before the state
machine attempts to bring the link up. Link is also held on cable
unplug in case a different cable is used.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This was probably present from initial submission.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Review of qib_ruc_check_hdr() shows that the s_lock is not required in
the normal case. The r_lock is held in all cases, and protects the qp
fields that are read.
The s_lock will be needed to around the call to qib_migrate_qp() to
insure that the send engine sees a consistent set of fields.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
A new field is added to qib_qp called timeout_jiffies. It is
initialized upon create and modify.
The field is now used instead of a computation based on qp->timeout.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The heavy weight spinlock in qib_lookup_qpn() is replaced with RCU.
The hash list itself is now accessed via jhash functions instead of mod.
The changes should benefit multiple receive contexts in different
processors by not contending for the lock just to read the hash
structures.
The patch also adds a lookaside_qp (pointer) and a lookaside_qpn in
the context. The interrupt handler will test the current packet's qpn
against lookaside_qpn if the lookaside_qp pointer is non-NULL. The
pointer is NULL'ed when the interrupt handler exits.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The context init now saves a shift from rcvegrbufs_perchunk
rcvegrbufs_perchunk_shift using ilog2. A BUG_ON() protects the
power of 2 assumption.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Store both the encoded and decoded MTU in the QP structure as a minor
optimization for UC/RC receive routines.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The memset for zeroing work completions had been unconditional.
This patch removes the memset and moves the zeroing into the work
completion with a more explicit field by field set. With this patch,
non-ONLY/non-LAST packets will avoid the overhead since they will not
generate a completion.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, there is only a single ("basic") type of SRQ, but with XRC
support we will add a second. Prepare for this by defining an SRQ type
and setting all current users to IB_SRQT_BASIC.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The code that was recently introduced to report the number
of free contexts is flawed for multiple HCAs:
/* Return the number of free user ports (contexts) available. */
return scnprintf(buf, PAGE_SIZE, "%u\n", dd->cfgctxts -
dd->first_user_ctxt - (u32)qib_stats.sps_ctxts);
The qib_stats is global to the module, not per HCA, so the code is broken
for multiple HCAs.
This patch adds a qib_devdata field, freectxts, that reflects the free
contexts for this HCA.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Reviewed-by: Ram Vepa <ram.vepa@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
With ib_qib options:
options ib_qib krcvqs=1 pcie_caps=0x51 rcvhdrcnt=4096 singleport=1 ibmtu=4
a run of ib_write_bw -a yields the following:
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
1048576 5000 2910.64 229.80
------------------------------------------------------------------
The top cpu use in a profile is:
CPU: Intel Architectural Perfmon, speed 2400.15 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask
of 0x00 (No unit mask) count 1002300
Counted LLC_MISSES events (Last level cache demand requests from this core that
missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000
samples % samples % app name symbol name
15237 29.2642 964 17.1195 ib_qib.ko qib_7322intr
12320 23.6618 1040 18.4692 ib_qib.ko handle_7322_errors
4106 7.8860 0 0 vmlinux vsnprintf
Analysis of the stats, profile, the code, and the annotated profile indicate:
- All of the overflow interrupts (one per packet overflow) are
serviced on CPU0 with no mitigation on the frequency.
- All of the receive interrupts are being serviced by CPU0. (That is
the way truescale.cmds statically allocates the kctx IRQs to CPU)
- The code is spending all of its time servicing QIB_I_C_ERROR
RcvEgrFullErr interrupts on CPU0, starving the packet receive
processing.
- The decode_err routine is very inefficient, using a printf variant
to format a "%s" and continues to loop when the errs mask has been
cleared.
- Both qib_7322intr and handle_7322_errors read pci registers, which
is very inefficient.
The fix does the following:
- Adds a tasklet to service QIB_I_C_ERROR
- Replaces the very inefficient scnprintf() with a memcpy(). A field
is added to qib_hwerror_msgs to save the sizeof("string") at
compile time so that a strlen is not needed during err_decode().
- The most frequent errors (Overflows) are serviced first to exit the
loop as early as possible.
- The loop now exits as soon as the errs mask is clear rather than
fruitlessly looping through the msp array.
With this fix the performance changes to:
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
1048576 5000 2990.64 2941.35
------------------------------------------------------------------
During testing of the error handling overflow patch, it was determined
that some CPU's were slower when servicing both overflow and receive
interrupts on CPU0 with different MSI interrupt vectors.
This patch adds an option (krcvq01_no_msi) to not use a dedicated MSI
interrupt for kctx's < 2 and to service them on the default interrupt.
For some CPUs, the cost of the interrupt enter/exit is more costly
than then the additional PCI read in the default handler.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Move the various definitions and mad structures needed for software
implementation of IBA PM agent from the ipath and qib drivers into a
single include file, which in turn could be used by more consumers.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Update the active link width on QLE7220 chips when link goes down if
chip width does not match shadowed width.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There is a possibility of a deadlock due to the way locks are
acquired and released in qib_set_uevent_bits(). The function
qib_set_uevent_bits() is called in process context and it uses
spin_lock() and spin_unlock(). This same lock is acquired/released
in interrupt context which can lead to a deadlock when running on
the same cpu.
The fix is to replace spin_lock() and spin_unlock() with
spin_lock_irqsave() and spin_unlock_irqrestore() respectively in
qib_set_uevent_bits().
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Indicate the number of free user contexts via the sysfs file
/sys/class/infiniband/qib0/nfreectxts as required for PSM.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The PCIE capability offset is saved during PCI bus walking. It will
remove an unnecessary search in the PCI configuration space if this
value is referenced instead of reacquiring it.
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Adapt to use new APIs. We plan to remove old one later and plan to
change current->cpus_allowed implementation.
No functional change.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Due to timing, it is possible for the LOS and DFE to remain on. This
is due to the link progressing to LinkUP prior to the driver getting
the first Status Changed interrupt. By expanding the conditions under
which LOS is turned off and DFE timeout is being set, timing is no
longer an issue.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add basic RDMA netlink infrastructure that allows for registration of
RDMA clients for which data is to be exported and supplies message
construction callbacks.
Signed-off-by: Nir Muchtar <nirm@voltaire.com>
[ Reorganize a few things, add CONFIG_NET dependency. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
The driver reads PCI revision ID from the PCI configuration register
while it's already stored by PCI subsystem in the revision field of
struct pci_dev.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The time limit test now correctly checks against current jiffies to
avoid the hang.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In most cases, get_user_pages and get_user_pages_fast should be used
to pin user pages in memory. But sometimes, some special flags except
FOLL_GET, FOLL_WRITE and FOLL_FORCE are needed, for example in
following patch, KVM needs FOLL_HWPOISON. To support these users,
__get_user_pages is exported directly.
There are some symbol name conflicts in infiniband driver, fixed them too.
Signed-off-by: Huang Ying <ying.huang@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Michel Lespinasse <walken@google.com>
CC: Roland Dreier <roland@kernel.org>
CC: Ralph Campbell <infinipath@qlogic.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Set the M_Key field in SubnGet and SugnGetResp MADs based on correctly
interpreting the protection level specified in the M_KeyProtBits field.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For active and far-EQ cables use an LE2 value of 0 for improved SI.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix a bug which causes the driver to return incorrect MADs as a
response to Set(PortInfo) which sets the link width to 0xFF or link
speed to 0xF.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There is a double completion associated with error handling for RC QPs.
The sequence is:
- The do_rc_ack() routine fields an RNR nack and there are 0
rnr_retries configured on the QP.
- qib_error_qp() stops the pending timer
- qib_rc_send_complete() is called from sdma_complete()
- qib_rc_send_complete() starts the timer because the msb of the psn
just completed says an ack is needed.
- a bunch of flushes occur as ipoib posts WQEs to an error'ed QP
- rc_timeout() calls qib_restart_rc()
- qib_restart_rc() calls qib_send_complete() with a
IB_WC_RETRY_EXC_ERR on a wqe that has already been completed in the
past
The fix avoids starting the timer since another packet will never
arrive.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The following panic BUG_ON occurs during qib testing:
Kernel BUG at include/linux/timer.h:82
RIP [<ffffffff881f7109>] :ib_qib:start_timer+0x73/0x89
RSP <ffffffff80425bd0>
<0>Kernel panic - not syncing: Fatal exception
<0>Dumping qib trace buffer from panic
qib_set_lid INFO: IB0:1 got a lid: 0xf8
Done dumping qib trace buffer
BUG: warning at kernel/panic.c:137/panic() (Tainted: G
The flaw is due to a missing state test when processing responses that
results in an add_timer() call when the same timer is already queued.
This code was executing in parallel with a QP destroy on another CPU
that had changed the state to reset, but the missing test caused to
response handling code to run on into the panic.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Hold the IB link at DISABLED until we get the correct TX settings
on mezz boards.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA: Update workqueue usage
RDMA/nes: Fix incorrect SFP+ link status detection on driver init
RDMA/nes: Fix SFP+ link down detection issue with switch port disable
RDMA/nes: Generate IB_EVENT_PORT_ERR/PORT_ACTIVE events
RDMA/nes: Fix bonding on iw_nes
IB/srp: Test only once whether iu allocation succeeded
IB/mlx4: Handle protocol field in multicast table
RDMA: Use vzalloc() to replace vmalloc()+memset(0)
mlx4_{core, ib, en}: Fix driver when sizeof (phys_addr_t) > sizeof (long)
IB/mthca: Fix driver when sizeof (phys_addr_t) > sizeof (long)
* ib_wq is added, which is used as the common workqueue for infiniband
instead of the system workqueue. All system workqueue usages
including flush_scheduled_work() callers are converted to use and
flush ib_wq.
* cancel_delayed_work() + flush_scheduled_work() converted to
cancel_delayed_work_sync().
* qib_wq is removed and ib_wq is used instead.
This is to prepare for deprecation of flush_scheduled_work().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (42 commits)
IB/qib: Fix refcount leak in lkey/rkey validation
IB/qib: Improve SERDES tunning on QMH boards
IB/qib: Unnecessary delayed completions on RC connection
IB/qib: Issue pre-emptive NAKs on eager buffer overflow
IB/qib: RDMA lkey/rkey validation is inefficient for large MRs
IB/qib: Change QPN increment
IB/qib: Add fix missing from earlier patch
IB/qib: Change receive queue/QPN selection
IB/qib: Fix interrupt mitigation
IB/qib: Avoid duplicate writes to the rcv head register
IB/qib: Add a few new SERDES tunings
IB/qib: Reset packet list after freeing
IB/qib: New SERDES init routine and improvements to SI quality
IB/qib: Clear WAIT_SEND flags when setting QP to error state
IB/qib: Fix context allocation with multiple HCAs
IB/qib: Fix multi-Florida HCA host panic on reboot
IB/qib: Handle transitions from ACTIVE_DEFERRED to ACTIVE better
IB/qib: UD send with immediate receive completion has wrong size
IB/qib: Set port physical state even if other fields are invalid
IB/qib: Generate completion callback on errors
...
The mr optimization introduced a reference count leak on an exception
test. The lock/refcount manipulation is moved down and the problematic
exception test now calls bail to insure that the lock is released.
Additional fixes as suggested by Ralph Campbell <ralph.campbell@qlogic.org>:
- reduce lock scope of dma regions
- use explicit values on returns vs. automatic ret value
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Improve the QMH SERDES tunning on initial driver load by having the
driver go through a link state change.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Currently on receipt of a response message (ACKs, RDMA Response,
Atomic Responses etc.) if the SDMA completion counter is not advanced
the driver delays the completion of the WQE. In most cases this is
overly pessimistic as the response (ACK) to a previously transmitted
send implies that the send is complete. Ensure that SDMA queue is
progressed appropriately before determining if a send has delayed
completions.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Under congestion resulting in eager buffer overflow attempt to send
pre-emptive NAKs if header queue entries with TID errors are generated
and a valid header is present. This prevents long timeouts and flow
restarts if a trailing set of packets are dropped due to eager
overflows. Pre-emptive NAKs are currently only supported for RDMA
writes.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The current code loops during rkey/lkey validiation to isolate the MR
for the RDMA, which is expensive when the current operation is inside
a very large memory region.
This fix optimizes rkey/lkey validation routines for user memory
regions and fast memory regions. The MR entry can be isolated by
shifts/mods instead of looping. The existing loop is preserved for
phys memory regions for now.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Changing from +1 to +2 allows for better QP distribution across
receive contexts.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The upstream code was missing part of a receive/error race fix from
the internal tree. Add the missing part, which makes future merges
possible.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The basic idea is that on SusieQ, the difficult part of mapping QPN to
context is handled by the mapping registers so the generic QPN
allocation doesn't need to worry about chip specifics. For Monty and
Linda, there is no mapping table so the qpt->mask (same as
dd->qpn_mask), is used to see if the QPN to context falls within
[zero..dd->n_krcv_queues).
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For SusieQ we need to write to the interrupt timer register before
updating the header queue head with interrupt count. This is to
ensure that the timer is enabled properly and a receive available
interrupt is delivered. Otherwise this interrupt can be lost if the
receiver header/eager queues are full before the timer is enabled.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>