March 2009


RFC only, not for pulling, unless everyone is fine with these :-)

After telling the folks from Google that I had splice code that I needed
to get working, I finally got around to doing it.

Not only did I get the splice code working, but I also made a perl
script (might want to cover you eyes if you look at that code) that
can automagically create the output from the binary files reading
the format arguments in the /debugfs/tracing/events///format

From previous patches, we have in include/trace/sched_event_types.h:

#define TRACE_SYSTEM sched

TPPROTO(struct rq *rq, struct task_struct *prev,
struct task_struct *next),
TPARGS(rq, prev, next),
TPFMT("task %s:%d ==> %s:%d",
prev->comm, prev->pid, next->comm, next->pid),
TRACE_FIELD(pid_t, prev_pid, prev->pid)
TRACE_FIELD(int, prev_prio, prev->prio)
TRACE_FIELD(pid_t, next_pid, next->pid)
TRACE_FIELD(int, next_prio, next->prio)
TPRAWFMT("prev %d:%d ==> next %s:%d:%d")

By making that, we automagically get this file:

# cat /debug/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 29
field:unsigned char type; offset:0; size:1;
field:unsigned char flags; offset:1; size:1;
field:unsigned char preempt_count; offset:2; size:1;
field:int pid; offset:4; size:4;
field:int tgid; offset:8; size:4;

field:pid_t prev_pid; offset:12; size:4;
field:int prev_prio; offset:16; size:4;
field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
field:pid_t next_pid; offset:36; size:4;
field:int next_prio; offset:40; size:4;

print fmt: "prev %d:%d ==> next %s:%d:%d"

Now with this patch set, we create a way to read the ftrace ring buffers
directly, as a binary page. Splice has been used such that the user
could splice the ring buffers without need to copy the pages.
The pages are taken from the ring buffers and can be placed directly
into files, without extra copies.

# ls /debug/tracing/binary_buffers/
0 1 2 3 4 5 6 7

One can either just use the read/write to grab live data from these
buffers, or they could use splice. I have a simple file that reads
this buffers using splice. Note, it only runs on one file, you
can make multiple copies to run more. The ring buffers in ftrace
are per cpu and they are not dependent on each other.

Also, if there is no data in the buffer, it returns -EAGAIN.

# find /debug/tracing/events -name 'type' | while read f; do
> echo raw > $f; done

# find /debug/tracing/events -name 'enable' | while read f; do
> echo 1 > $f; done

# ./splice /debug/tracing/binary_buffers/0 /tmp/buf-0

Yes you can run multiple instances of this on different buffers.

# ./ /tmp/buf-0


0 [000] 7071.936459: (irq_handler_entry) irq 48
0 [000] 7071.936462: (irq_handler_exit) irq 48 ret 1
0 [000] 7071.988801: (sched_signal_send) sig: 14 task 0
0 [000] 7071.988813: (sched_wakeup) task 0 success=1
0 [000] 7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
3303 [000] 7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
0 [000] 7072.020370: (sched_wakeup) task 0 success=1
0 [000] 7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
16 [000] 7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0

Notice the "(sched_switch)" lines.

The splice C code:

The perl script to parse:

And yes I know, they are both ugly :-p

The following patches are in:


branch: rfc/splice/tip/tracing/ftrace

Steven Rostedt (5):
ring-buffer: reset write field for ring_buffer_read_page
ring-buffer: fix ring_buffer_read_page
ring-buffer: replace sizeof of event header with offsetof
ring-buffer: make ring_buffer_read_page read from start on partial page
tracing: add binary buffer files for use with splice

include/linux/ring_buffer.h | 7 +-
kernel/trace/ring_buffer.c | 118 +++++++++++++------
kernel/trace/trace.c | 274 +++++++++++++++++++++++++++++++++++++++++-
kernel/trace/trace.h | 1 +
4 files changed, 357 insertions(+), 43 deletions(-)
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at


> > Could you please update Documentation/ftrace.txt?
> > I guess many user interesting this patch. :)
> Yeah, it has become a really nice feature this way. As i told it
> to K.Prasad before: we need this tracer because the data tracer
> will likely become the most common usecase of this facility. We
> will get the hw breakpoints facility tested and used.
> And in fact we can go one step further: it would also be nice to
> wire it up with the ftrace histogram code: so that we can get
> usage histograms of kernel symbol read/write activities without
> the overhead of tracing. (The branch tracer already has this.)
> Especially frequently used variables generate a _lot_ of events.
> Ingo

Right, it will even be an occasion to improve and further test
the histogram tracing.
K. Prasad if you need some help on how to use it, don't hesitate to tell.


please don't hesitate to tell. で遠慮せずに聞いてね。という慣用句らしい

id:kazuhookuさんから、「don't hesitate to ask の間違いでなくて? 」とツッコミをいただいた。


Hi All,

I just pushed out support for kernel tracepoints in systemtap -- a copy
of the manpage section is included below for reference. Try this
command to see what's available in your kernel:
stap -l 'kernel.trace("*")'

On Fedora rawhide, the kernel does have tracepoints enabled, but the
kernel-devel package is missing the headers needed to define the
tracepoints. This has been fixed in the package spec now, so the next
kernel package should be ready to go. If you're impatient, you can also
grab the source rpm and copy include/trace into
/usr/src/kernels/$(uname -r)/include/trace/.

RHEL5 U3 also contains tracepoints, and is also missing the headers from
kernel-devel. Again, copying the files from the source RPM should let
you use them now. I don't have an ETA for when the official package
might be updated though...

For other distributions or custom kernels, YMMV, but anything 2.6.28+
should have at least a few tracepoints available.

As always, any feedback is welcome...




This family of probe points hooks up to static probing trace-
points inserted into the kernel or modules. As with markers,
these tracepoints are special macro calls inserted by kernel de-
velopers to make probing faster and more reliable than with
DWARF-based probes, and DWARF debugging information is not re-
quired to probe tracepoints. Tracepoints have an extra advan-
tage of more strongly-typed parameters than markers.

Tracepoint probes begin with kernel. The next part names the
tracepoint itself: trace("name"). The tracepoint name string,
which may contain the usual wildcard characters, is matched
against the names defined by the kernel developers in the trace-
point header files.

The handler associated with a tracepoint-based probe may read
the optional parameters specified at the macro call site. These
are named according to the declaration by the tracepoint author.
For example, the tracepoint probe kernel.trace("sched_switch")
provides the parameters $rq, $prev, and $next. If the parameter
is a complex type, as in a struct pointer, then a script can ac-
cess fields with the same syntax as DWARF $target variables.
Also, tracepoint parameters cannot be modified, but in guru-mode
a script may modify fields of parameters.

The name of the tracepoint is available in $$name, and a string
of name=value pairs for all parameters of the tracepoint is
available in $$vars.



I have a small problem. Maybe its just a misunderstanding but I cant
solve it.

I think that writing "3" to drop_caches should drop all buffers and
caches which are already written. So its recommended to put a "sync"
infront of it.
So I did "free -m ; sync ; echo 3 > /proc/sys/vm/drop_caches ; free -m"
And it gave me:

total used free shared buffers cached
Mem: 3950 3922 28 0 1 879
-/+ buffers/cache: 3041 909
Swap: 5342 205 5136

total used free shared buffers cached
Mem: 3950 3907 43 0 0 864
-/+ buffers/cache: 3041 908
Swap: 5341 206 5135

So the buffer was 1 and is 0 afterthat. But cached is at 879 MB before
and is still 864 MB (!!!) after that!

I am at swappiness=0 and when I remove and readd one swap-partition
after another (so there is always swap). It will keep the cached and
put the swapped memory on other swaps?!

I _think_ thats not the way it should go?

It would be really kind if someone could explain that issue and
what "cached" is at all!

Have a nice day...

PS: Please CC me as I am not subscribed.


# echo 3 > /proc/sys/vm/drop_caches

してもcached 欄が減らないぞーというわけ。

さて、このcached欄であるが、/proc/meminfoの "Cached:"フィールドの数字を表示している。

I rebooted in console without the nvidia-module loaded and have the same
results (updated to btw):
# sync ; echo 3 > /proc/sys/vm/drop_caches ; free -m ; cat /proc/meminfo
total used free shared buffers
Mem: 3950 1647 2303 0 0
-/+ buffers/cache: 722 3228
Swap: 5890 0 5890
MemTotal: 4045444 kB
MemFree: 2358944 kB
Buffers: 544 kB
Cached: 946624 kB
SwapCached: 0 kB
Active: 1614756 kB
Inactive: 7632 kB
Active(anon): 1602476 kB
Inactive(anon): 0 kB
Active(file): 12280 kB
Inactive(file): 7632 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 6032344 kB
SwapFree: 6032344 kB
Dirty: 72 kB
Writeback: 32 kB
AnonPages: 675224 kB
Mapped: 17756 kB
Slab: 19936 kB
SReclaimable: 9652 kB
SUnreclaim: 10284 kB
PageTables: 8296 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 8055064 kB
Committed_AS: 3648088 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 10616 kB
VmallocChunk: 34359716459 kB
DirectMap4k: 6080 kB
DirectMap2M: 4188160 kB

Cachedと{In}Active(file) が合ってないように見えるが、これはtmpfs用のページが{In}Active(anon)にチャージされる仕様のため。
ようするに Cached:の900MはほとんどがtmpfsファイルでActive(anon)にチャージされていると考えるとつじつまがある。



Tom Zanussi pointed out that the simple TRACE_FIELD was not enough to
record trace data that required memcpy. This patch addresses this issue
by adding a TRACE_FIELD_SPECIAL. The format is similar to TRACE_FIELD
but looks like so:

「パッチ address プロブレム名」 で、このパッチはこのプロブレムに(取り組む|対処する)パッチだよん。

ITpro様に第10回The Linux Foundation Japan Symposiumの様子をとりあげていただきました。


* Linus Torvalds wrote:

> On Tue, 24 Feb 2009, Dave Hansen wrote:
> >
> > Yeah, that's a good point. Are we sure that's what is
> > happening here, though? That's one thing a profile would
> > hopefully help with.
> One thing to note is that _if_ it's purely an issue of
> nontemporal stores vs normal stores, then profiling is very
> likely going to be almost entirely useless. You'll get
> "results", but the results have nothing what-so-ever to do
> with reality or anything interesting.
> The nontemporal stores may stand out in the profiles, but the
> actual performance impact will be all about whether totally
> unrelated code got cache misses or not. Quite often those
> cache misses will also be in user mode, and very possibly in
> other processes.
> So profiles can certainly be interesting, but if Salman says
> that his patch makes a difference for his benchmark, then
> profiling is almost certainly not interesting FOR THAT PATCH.
> It's interesting mainly as a way to look at whether there are
> then also _other_ issues that are worth addressing (ie the
> whole atime thing is in a whole different dimension and an
> independent issue).

a 'perfstat' run would certainly be interesting (for cases where
a pure /usr/bin/time run is inconclusive), comparing the
unpatched and patched kernel.

That way we can see summary counts for the whole workload, like:

| Performance counter stats for './mmap-perf' |
| |
| x86-defconfig | PARAVIRT=y
| 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74%
| |
| 1 | 1 CPU migrations
| 91 | 79 context switches
| 55945 | 55943 pagefaults
| ............................................
| 3781392474 | 3918777174 CPU cycles +3.63%
| 1957153827 | 2161280486 instructions +10.43%
| 50234816 | 51303520 cache references +2.12%
| 5428258 | 5583728 cache misses +2.86%
| 437983499 | 478967061 branches +9.36%
| 32486067 | 32336874 branch-misses -0.46%
| |
| 1314.782469 | 1363.694447 time elapsed (msecs) +3.72%
| |

Such a comparison of would certainly be more meaningful for such
things than a profile.

Salman, if you are interested in doing a perfstat comparison,
just pick up a tip:master kernel [perfcounters are
default-enabled in it]:

and run perfstat on it (as root, to get the kernel-mode counts

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at


KOSAKI Motohiro, there is a small introduction guide for how to do it.

CAI Qian


On Sun, 22 Feb 2009, Mel Gorman wrote:

> I haven't run a page-allocator micro-benchmark to see what sort of figures
> that gives. Christoph, I recall you had some sort of page allocator
> micro-benchmark. Do you want to give it a shot or remind me how to use
> it please?

The page allocator / slab allocator microbenchmarks are in my VM
development git tree. The branch is named tests.;a=shortlog;h=tests


On Fri, 27 Feb 2009, Steven Rostedt wrote:
> HAHAHAHHAAAA!!!! My evil scheme is working! I post some sub-optimal code,
> and have others do the nasty work for me!!!!
> Oh, did I just say that out loud?

Not only that, I will also sue you for my patent on that algorithm.


Steven Rostedt(ftraceの作者)がAndrew Mortonから、この設計はそもそもダメじゃね?とレビューコメントもらったとき、開き直って「いいもん。evilでも動くもん!!」と暴れた後「うるさく言い過ぎた?」