Denys Vlasenko [Sat, 28 Jan 2012 00:46:33 +0000 (01:46 +0100)]
Use process_vm_readv instead of PTRACE_PEEKDATA to read data blocks
Currently, we use PTRACE_PEEKDATA to read things like filenames and
data passed by I/O syscalls.
PTRACE_PEEKDATA gets one word per syscall. This is VERY expensive.
For example, in order to print fstat syscall, we need to perform
more than twenty trips into kernel to fetch one struct stat!
Kernel 3.2 got a new syscall, process_vm_readv(), which can be used to
copy data blocks out of process' address space.
This change uses it in umoven() and umovestr() functions if possible,
with fallback to old method if process_vm_readv() fails.
If it returns ENOSYS, we don't try to use it anymore, eliminating
overhead of trying it on older kernels.
Result of "time strace -oLOG ls -l /usr/lib >/dev/null":
before patch: 0.372s
After patch: 0.262s
* util.c (process_vm_readv): Wrapper to call process_vm_readv syscall.
(umoven): Use process_vm_readv for block reads of tracee memory.
(umovestr): Likewise.
* linux/syscall.h: Declare new function sys_process_vm_readv.
* process.c (sys_process_vm_readv): Decoder for new syscall.
* linux/i386/syscallent.h: Add process_vm_readv, process_vm_writev syscalls.
* linux/x86_64/syscallent.h: Likewise.
* linux/powerpc/syscallent.h: Likewise.
Denys Vlasenko [Sat, 28 Jan 2012 00:25:03 +0000 (01:25 +0100)]
Fix a case of broken output if last seen syscall was exit
* defs.h: Rename tcp_last to printing_tcp. Explain what it means.
Remove printtrailer() function.
* process.c (sys_exit): Convert printtrailer() call to "printing_tcp = NULL".
* strace.c: Add new variable printing_tcp.
(cleanup): Convert printtrailer() call to "printing_tcp = NULL".
(trace): Likewise.
(trace): Fix checks for incomplete line - it was working wrongly if last syscall was exit.
(printleader): Set printing_tcp.
(printtrailer): Remove this function.
* syscall.c: Remove tcp_last variable.
(trace_syscall_entering): Don't set printing_tcp, printleader call now does it.
(trace_syscall_exiting): Convert printtrailer() call to "printing_tcp = NULL".
Denys Vlasenko [Sat, 28 Jan 2012 00:16:02 +0000 (01:16 +0100)]
Fix handling of test/threaded_execve.c testcase
Since 3.0, Linux has a way to identify which thread execve'ed.
This patch makes use of it in order to properly dispose
of disappeared ("superseded") thread leader,
and replace it with execve'ed thread.
Before this patch, strace was "leaking" thread which exec'ed.
It was thinking that it still runs. It would look like this:
18460 pause( <unfinished ...> <=== thread leader
18466 execve("/proc/self/exe", ["exe", "exe"], [/* 47 vars */] <unfinished ...>
18465 +++ exited with 0 +++ <=== exits from other threads
18460 <... pause resumed> ) = 0
The last line is wrong: it's not pause resumed, it's execve resumed.
If thread leader would do exit instead of pause, it is much worse:
strace panics because it thinks it sees return from exit syscall!
And strace isn't aware 18466 (exec'ed thread) is gone.
It still thinks it's executes execve syscall.
* strace.c: New variable "static char *os_release".
(get_os_release): New static function.
(main): Call get_os_release to retrieve Linux version.
(trace): If we see PTRACE_EVENT_EXEC, retrieve old pid, and if it
differs from new one, free one of tcbs and print correct messages.
In code before commit 44d0532, single fprintf was used on purpose:
we want to send entire message as one write() call. Since stderr
is unbuffered, separate fprintf's to it always result in separate
writes, they are not coalesced. If we aren't the only program
which writes to this particular stderr, this may result
in interleaved messages.
Since this function is not performance critical, I guess
it's ok to make it less efficient.
* strace.c (verror_msg): Attempt to print the message in single
write operation. Use separate fprintfs as a fallback if malloc fails.
Denys Vlasenko [Sat, 21 Jan 2012 03:01:56 +0000 (04:01 +0100)]
Improve code readability (logic is unchanged)
* util.c (umoven): Move assignment out of function call. Make assignment
to a flag variable later, closer to the place where it will be used.
(umovestr): Likewise.
(uload): Likewise.
Denys Vlasenko [Fri, 20 Jan 2012 10:56:00 +0000 (11:56 +0100)]
Change umovestr API: return > 0 instead of 0 if NUL was seen
* pathtrace.c (upathmatch): Adjust umovestr return value check for new API.
* util.c (printpathn): Use umovestr() > 0 return value for more efficient
(and robust - we don't depend on "no overwrote past NUL" behavior anymore)
handling of terminating NUL.
(printstr): Remove useless NUL placement before umovestr() call.
Allocate 1 byte more to outstr[] array - for NUL.
(umovestr): Change to return 1 if NUL was seen.
Denys Vlasenko [Fri, 20 Jan 2012 10:04:04 +0000 (11:04 +0100)]
Eliminate code duplication in time printing, reduce a few static buffers
text data bss dec hex filename
238454 664 28772 267890 41672 strace.before
238106 664 28676 267446 414b6 strace
* defs.h: Add TIMESPEC_TEXT_BUFSIZE and TIMEVAL_TEXT_BUFSIZE defines.
Add 'int special' parameter to sprinttv().
* time.c (sprinttv): Add 'int special' parameter, and use it
similarly to 'int special' parameter of printtv_bitness().
(printtv_bitness): Use sprinttv() instead of duplicating its code.
(print_timespec): Use sprint_timespec() instead of duplicating
its code.
* desc.c (decode_select): Use TIMEVAL_TEXT_BUFSIZE instead of 128
when checking remaining buffer size.
* net.c (sys_recvmsg): Use TIMESPEC_TEXT_BUFSIZE instead of 128
for static buffer size.
* stream.c (decode_poll): Use TIMESPEC_TEXT_BUFSIZE instead of 128
when checking remaining buffer size.
Denys Vlasenko [Thu, 19 Jan 2012 16:20:23 +0000 (17:20 +0100)]
Reduce bss usage and speed up string printing
text data bss dec hex filename
237913 660 49284 287857 46471 strace.before
237973 660 28772 267405 4148d strace
This reduces L1 D-cache pressure a bit: instead of dirtying
20k of bss, we will reuse already dirty stack area.
* util.c (printpathn): Use on-stack buffers instead of static ones.
Saves 5*MAXPATHLEN in bss.
(printstr): Use tprints() instead of tprintf("%s") when printing
formatted string. May be a bit faster, depending on libc.
Denys Vlasenko [Wed, 18 Jan 2012 15:30:47 +0000 (16:30 +0100)]
Get rid of TCB_SIGTRAPPED
On attempts to block or set SIGTRAP handler,
for example, using sigaction syscall, we generate
an additional SIGSTOP.
This change gets rid of this SIGSTOP sending/ignoring.
It appears to work just fine.
It also works if I force strace to not use PTRACE_O_TRACESYSGOOD,
which means strace stops will be marked with SIGTRAP,
not (SIGTRAP | 0x80) - I wondered maybe that's when
this hack is needed.
So, why we even have TCB_SIGTRAPPED? No one knows. It predates
version control: this code was present in the initial commit,
in 1999. No adequate comments, either.
Moreover, TCB_SIGTRAPPED is not set in sys_rt_sigaction
and sys_sigprocmask syscalls - the ones which are most usually
used to implement signal blocking, it is only set in obsolete
sys_signal, sys_sigaction, sys_sigsetmask, and in some dead
non-Linux code.
I think whatever bug it was fixing is gone long ago -
at least as long as sys_rt_sigaction is used by glibc.
Again, since glibc (and uclibc) uses sys_rt_sigaction
and sys_sigprocmask, modified code paths are not used
by most programs anyway.
* defs.h: Remove definition of TCB_SIGTRAPPED.
* signal.c (sys_sigvec): Don't set TCB_SIGTRAPPED and don't send SIGSTOP.
(sys_sigsetmask): Likewise.
(sys_sigaction): Likewise.
(sys_signal): Likewise.
* strace.c (trace): Remove code which executes if TCB_SIGTRAPPED is set.
Denys Vlasenko [Thu, 12 Jan 2012 10:26:34 +0000 (11:26 +0100)]
Make ERESTARTxyz messages more descriptive
There is widespread confusion about exact meaning
of ERESTARTxyz codes. Before this change, we were showing
all four of them the same: as "(To be restarted)".
This change prints better explanations for these codes,
and contains verbose comments which explain *why* we display
codes that way - or else someone confused
is bound to come later and mangle them again.
New messages are:
ERESTARTSYS (To be restarted if SA_RESTART is set)
ERESTARTNOINTR (To be restarted)
ERESTARTNOHAND (Interrupted by signal)
ERESTART_RESTARTBLOCK (Interrupted by signal)
* syscall.c (trace_syscall_exiting): Make ERESTARTxyz messages
more descriptive.
Denys Vlasenko [Tue, 10 Jan 2012 15:40:35 +0000 (16:40 +0100)]
Display mask on enter to sigreturn, not on exit
sys_sigreturn() performs ugly manipulations in order to show
signal mask which is restored by this syscall: on syscall entry,
fetches it from the stack, saves it in tcp->u_arg[]
(where it used to overflow this array - fixed sometime ago),
then retrieves the mask and displays it on syscall exit.
Apparently, the motivation is to make it slightly more obvious
to user that signal mask is restored only when this syscall returns.
IMO, this hardly justifies the necessary hacks. It is much easier
to display the mask at the point when we fetch it - on syscall entry.
While at it, I made it so that we do display returned value/errno.
I see no point in hiding it and showing uninformative "= ?" instead.
Example of pause() being interrupted by ALRM which has installed handler
which re-arms ALRM:
* defs.h: Declare struct pt_regs i386_regs and struct pt_regs x86_64_regs.
* syscall.c: Remove "static" keywork from these structures' definitions.
* signal.c (sys_sigreturn): Display mask on enter, not on exit.
Denys Vlasenko [Wed, 4 Jan 2012 14:15:26 +0000 (15:15 +0100)]
Do not detach from tracee which experienced ptrace error.
Before this patch, if a thread got nuked by exit in another thread
and we happened to poke it at the same time, we print "????(" thingy
and detach the thread. Since we removed "detach before death" logic,
this no longer matches the behavior of other threads.
Before patch:
[pid 1780] exit_group(1) = ?
[pid 1778] ????( <unfinished ...>
Process 1778 detached
[pid 5860] +++ exited with 1 +++
After:
[pid 17765] exit_group(1) = ?
[pid 21680] ????( <unfinished ...>
[pid 17791] +++ exited with 1 +++
[pid 21680] +++ exited with 1 +++
* strace (trace): Do not detach from tracee which experienced ptrace error.
Dmitry V. Levin [Mon, 26 Dec 2011 20:12:02 +0000 (20:12 +0000)]
Enhance decoding for personalities with small wordsize
* util.c (umoven, umovestr) [SUPPORTED_PERSONALITIES > 1]: If current
personality's wordsize is less than sizeof(long), use only significant
bits of the given address.
Dmitry V. Levin [Fri, 23 Dec 2011 00:50:49 +0000 (00:50 +0000)]
Enhance personality switching
On syscall entry, save current personality in the tcb structure
along with scno.
On syscall exit, restore current personality from the tcb structure.
* defs.h (struct tcb) [SUPPORTED_PERSONALITIES > 1]: Add currpers
field.
* strace.c (alloc_tcb) [SUPPORTED_PERSONALITIES > 1]: Initialize
tcp->currpers.
* syscall.c (update_personality) [SUPPORTED_PERSONALITIES > 1]: New
function.
(get_scno, trace_syscall_exiting): Use it.
Reported-by: Michael A Fetterman <mafetter@nvidia.com>
Heiko Carstens [Wed, 30 Nov 2011 12:16:29 +0000 (13:16 +0100)]
Fix sys_ipc/sys_semtimedop decoding on s390
The s390 kernel sys_ipc system call only takes five arguments instead of
six arguments which the common code sys_ipc implementation takes.
One of the arguments of the sys_semtimedop subcall is therefore passed in
a different register than in the common code implementation.
This leads to broken decoding of the timespec argument:
Dmitry V. Levin [Wed, 12 Oct 2011 19:03:29 +0000 (19:03 +0000)]
Add names for dummy parsers. No code changes
* linux/dummy.h: Add aliases to printargs() for those of dummy parsers
that had no own names before.
* linux/*/syscallent.h: Use these new names instead of printargs.
Dmitry V. Levin [Tue, 11 Oct 2011 17:07:05 +0000 (17:07 +0000)]
Implement decoding of splice, tee and vmsplice(2) syscalls
* io.c (print_loff_t): New function.
(sys_sendfile64): Use it.
(splice_flags): New xlat structure.
(sys_tee, sys_splice, sys_vmsplice): New functions.
* linux/syscall.h (sys_tee, sys_splice, sys_vmsplice): Declare them.
* linux/*/syscallent.h: Use them.
Do post-attach initialization earlier; fix "we ignore SIGSTOP on NOMMU" bug
We set ptrace options when we see post-attach SIGSTOP.
This is wrong: it's better to set them right away on the very first
stop (whichever it will be). It also will make adding SEIZE support easier,
since SEIZE has no post-attach SIGSTOP.
We do it by adding a new bit, TCB_IGNORE_ONE_SIGSTOP, and treating
TCB_STARTUP and TCB_IGNORE_ONE_SIGSTOP as two slightly different things.
* defs.h: Add a new flag bit, TCB_IGNORE_ONE_SIGSTOP.
* process.c (internal_fork): Set TCB_IGNORE_ONE_SIGSTOP on a newly added child.
* strace.c (startup_attach): Set TCB_IGNORE_ONE_SIGSTOP after attach.
Fix a case when "strace -p PID" found PID dead but sone other of its threads
still alive.
(startup_child): Set TCB_IGNORE_ONE_SIGSTOP after attach, _if needed_.
This fixes a bogus case where we can ignore a _real_ SIGSTOP on NOMMU.
(detach): Perform anti-SIGSTOP dance only if TCB_IGNORE_ONE_SIGSTOP is set,
not if TCB_STARTUP is set.
(trace): Set TCB_IGNORE_ONE_SIGSTOP after attach.
Clear TCB_STARTUP and initialize tracee on the very first tracee stop.
Clear TCB_IGNORE_ONE_SIGSTOP when SIGSTOP is seen.
* defs.h: Remove TCB_ATTACH_DONE constant.
* strace.c (startup_attach): Use TCB_STARTUP instead of TCB_ATTACH_DONE
to distinquish attached from not-yet-attached threads.
This fixes logic in detach() which thinks that TCB_STARTUP
means that we are already attached, but did not see SIGSTOP yet.
This also allows to get rid of TCB_ATTACH_DONE flag.
* process.c (internal_fork): Set TCB_STARTUP after attach.
* strace.c (startup_attach): Likewise.
(startup_child): Likewise.
(alloc_tcb): Do not set TCB_STARTUP on tcb allocation - we are
not attached yet.
(trace): Set TCB_STARTUP when we detech an auto-attached child.
After recent change, select(2^31-1, NULL, NULL, NULL)
would make strace exit. This change caps fdsize so that
it is always in [0, 1025*1024], IOW: we will try to allocate at most
1 megabyte, which in practice will almost always work,
unlike malloc(2Gig).
* desc.c (decode_select): Cap fdsize to 1024*1024.
* pathtrace.c (pathtrace_match): Cap fdsize to 1024*1024.
* file.c (sys_getdents): Cap len to 1024*1024.
(sys_getdents64): Cap len to 1024*1024.
* util.c (dumpiov): Refuse to process iov with more than 1024*1024
elements. Don't die on malloc failure.
(dumpstr): Don't die on malloc failure.
Minor tweaks in startup_child(). Logic isn't changed (but code is)
* strace.c (startup_attach): Tweak comment.
(startup_child): Move common code out of ifdef.
Indent nested ifdefs. Tweak comments. Remove two
unnecessary calls to getpid().
Denys Vlasenko [Wed, 31 Aug 2011 13:58:06 +0000 (15:58 +0200)]
Add README-linux-ptrace file
I tried to push this doc to Michael Kerrisk <mtk.manpages@gmail.com>,
but got no reply. To avoid losing the document, let it live
in strace tree for now.
Denys Vlasenko [Wed, 31 Aug 2011 10:26:03 +0000 (12:26 +0200)]
Optimization: eliminate all remaining usages of strcat()
After this change, we don't use strcat() anywhere.
* defs.h: Change sprinttv() return type to char *.
* time.c (sprinttv): Return pointer past last stored char.
* desc.c (decode_select): Change printing logic in order to eliminate
usage of strcat() - use stpcpy(), *outptr++ = ch, sprintf() instead.
Also reduce usage of strlen().
* stream.c (decode_poll): Likewise.
Denys Vlasenko [Wed, 31 Aug 2011 10:22:56 +0000 (12:22 +0200)]
Optimize string_quote() for speed
* util.c (string_quote): Speed up check for terminating NUL.
Replace strintf() with open-coded binary to hex/oct conversions -
we potentially do them for every single byte, need to be fast.
Denys Vlasenko [Tue, 30 Aug 2011 16:53:49 +0000 (18:53 +0200)]
On X86_64 and I386, use PTRACE_GETREGS to fetch all registers
Before this change, registers were read with PTRACE_PEEKUSER
ptrace operation, one per register. This is slower than
fetching them all in one ptrace operation.
* defs.h: include asm/ptrace.h on X86_64 and I386.
* syscall.c: New static variables i386_regs and x86_64_regs.
Remove static eax/rax variables.
(get_scno): Fetch all registers with single PTRACE_GETREGS operation.
(get_syscall_result): Likewise.
(syscall_fixup_on_sysenter): Use PTRACE_GETREGS results in i386/x86_64_regs.
(syscall_enter): Set tcp->u_arg[i] from PTRACE_GETREGS results.
(get_error): Set tcp->u_rval, tcp->u_error from PTRACE_GETREGS results.
Denys Vlasenko [Thu, 25 Aug 2011 08:25:35 +0000 (10:25 +0200)]
Simplify syscall_fixup[_on_sysexit]
* syscall.c (syscall_fixup): Remove checks for entering(tcp).
Remove code which executes if exiting(tcp).
(syscall_fixup_on_sysexit): Remove code which executes
if entering(tcp). Remove checks for exiting(tcp).
Denys Vlasenko [Thu, 25 Aug 2011 08:23:00 +0000 (10:23 +0200)]
Split syscall_fixup into enter/exit pair of functions
* syscall.c: Create syscall_fixup_on_sysexit() which is a copy of
syscall_fixup().
(trace_syscall_exiting): Call syscall_fixup_on_sysexit() instead of
syscall_fixup().
Denys Vlasenko [Wed, 24 Aug 2011 23:27:59 +0000 (01:27 +0200)]
Optimize tabto()
tabto is used in many lines of strace output.
On glibc, tprintf("%*s", col - curcol, "") is noticeably slow
compared to tprintf(" "). Use the latter.
Observed ~15% reduction of time spent in userspace.
* defs.h: Drop extern declaration of acolumn. Make tabto()
take no parameters.
* process.c (sys_exit): Call tabto() with no parameters.
* syscall.c (trace_syscall_exiting): Call tabto() with no parameters.
* strace.c: Make acolumn static, add static char *acolumn_spaces.
(main): Allocate acolumn_spaces as a string of spaces.
(printleader): Call tabto() with no parameters.
(tabto): Use simpler method to print lots of spaces.
Denys Vlasenko [Wed, 24 Aug 2011 23:13:43 +0000 (01:13 +0200)]
Opotimize "scno >= 0 && scno < nsyscalls" check
gcc can't figure out on its own that this check can be done with
single compare, and does two compares. We can help it by casting
scno to unsigned long: ((unsigned long)(scno) < nsyscalls)
* defs.h: New macro SCNO_IN_RANGE(long_var).
* count.c (count_syscall): Use SCNO_IN_RANGE() instead of open-coded check.
* syscall.c (getrval2): Use SCNO_IN_RANGE() instead of open-coded check.
This fixes a bug: missing check for scno < 0 and scno > nsyscalls
instead of scno >= nsyscalls.
(get_scno): Use SCNO_IN_RANGE() instead of open-coded check.
This fixes a bug: scno > nsyscalls instead of scno >= nsyscalls.
(known_scno): Use SCNO_IN_RANGE() instead of open-coded check.
(internal_syscall): Likewise.
(syscall_enter): Likewise.
(trace_syscall_entering): Likewise.
(get_error): Likewise.
(trace_syscall_exiting): Likewise.