TL;DR This appears to be a bug in the emulation of the PTX instruction mul.hi.s64 that is specific to sm_5x platforms, so filing a bug report with NVIDIA is the recommended course of action.
Generally, NVIDIA GPUs are 32-bit architectures, so all 64-bit integer instructions require emulation sequences. In the particular case of 64-bit integer multiplies, for sm_2x and sm_3x platforms, these are constructed from the machine code instruction IMAD.U32, which is a 32-bit integer multiply-add instruction.
For the Maxwell architecture (that is, sm_5x), a high-throughput, but lower-width, integer multiply-add instruction XMAD was introduced, although a low-throughput legacy 32-bit integer multipy IMUL was apparently retained. Inspection of disassembled machine code generated for sm_5x by the CUDA 7.5 toolchain with cuobjdump --dumpsass shows that for ptxas optimization level -O0 (which is used for debug builds), the 64-bit multiplies are emulated with the IMUL instruction, while for optimization level -O1 and higher XMAD is used. I cannot think of a reason why two fundamentally different emulation sequences are employed.
As it turns out, the IMUL-based emulation for mul.hi.s64 for sm_5x is broken while the XMAD-based emulation works fine. Therefore, one possible workaround is to utilize an optimization level of at least -O1 for ptxas, by specifying -Xptxas -O1 on the nvcc command line. Note that release builds use -Xptxas -O3 by default, so no corrective action is necessary for release builds.
From code analysis, the emulation for mul.hi.s64 is implemented as a wrapper around the emulation for mul.hi.u64, and this latter emulation seems to work fine on all platforms including sm_5x. Thus another possible workaround is to use our own wrapper around mul.hi.u64. Coding with inline PTX is unnecessary in this case, since mul.hi.s64 and mul.hi.u64 are accessible via the device intrinsics __mul64hi() and __umul64hi(). As can be seen from the code below, the adjustments to convert a result from unsigned to signed multiplication are fairly trivial.
long long int m1, m2, result;
#if 0 // broken on sm_5x at optimization level -O0
asm(" mul.hi.s64 %0, %1, %2; \n\t"
: "=l"(result)
: "l"(m1), "l"(m2));
#else
result = __umul64hi (m1, m2);
if (m1 < 0LL) result -= m2;
if (m2 < 0LL) result -= m1;
#endif