July 2009

らしい。たとえ、ext4を使っても。
よく考えたら、32bit max (4GB) x page-size(4KB) = 16TB なので、16TB以上はページ構造体がポイントできないので、当然の帰結だな。

しかし、64bitの移行はあんまり順調じゃないのに、ディスク16TBはすぐそこだ。困ったな

追記: 元ネタメールを貼っておく


Subject: How to handle >16TB devices on 32 bit hosts ??

Hi,
It has recently come to by attention that Linux on a 32 bit host does
not handle devices beyond 16TB particularly well.

In particular, any access that goes through the page cache for the
block device is limited to a pgoff_t number of pages.
As pgoff_t is "unsigned long" and hence 32bit, and as page size is
4096, this comes to 16TB total.

A filesystem created on a 17TB device should be able to access and
cache file data perfectly providing CONFIG_LBDAF is set.
However if the filesystem caches metadata using the block device,
then metadata beyond 16TB will be a problem.

Access to the block device (/dev/whatever) via open/read/write will
also cause problems beyond 16TB, though if O_DIRECT is used I think
it should work OK (it will probably try to flushed out completely
irrelevant parts of the page cache before allowing the IO, but that
is a benign error case I think).

With 2TB drives easily available, more people will probably try
building arrays this big and we cannot just assume they will only do
it on 64bit hosts.

So the question I wanted to ask really is: is there any point in
allowing >16TB arrays to be created on 32bit hosts, or should we just
disallow them? If we allow them, what steps should we take to make
the possible failure modes more obvious?

As I said, I think O_DIRECT largely works fine on these devices and
we could fix the few irregularities with little effort. So one step
might be to make mkfs/fsck utilities use O_DIRECT on >16TB devices on
32bit hosts.

Given that non-O_DIRECT can fail (e.g. in do_generic_file_read,
index = *ppos >> PAGE_CACHE_SHIFT
will lose data if *ppos is beyond 44 bits) we should probably fail
opens on devices larger than 16TB.... though just failing the open
doesn't help if the device can change size, as dm and md devices can.

I believe ext[234] uses the block device's page cache for metadata, so
they cannot safely be used with >16TB devices on 32bit. Is that
correct? Should they fail a mount attempt? Do they?

Are there any filesystems that do not use the block device cache and
so are not limited to 16TB on 32bit?

Even if no filesystem can use >16TB on 32bit, I suspect dm can
usefully use such a device for logical volume management, and as long
as each logical volume does not exceed 16TB, all should be happy. So
completely disallowing them might not be best.

I suppose we could add a CONFIG option to make pgoff_t be
"unsigned long long". Would the cost/benefit of that be acceptable?

Your thoughts are most welcome.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


このエントリーをはてなブックマークに追加

http://japan.zdnet.com/news/os/story/0,2000056192,20396891,00.htm

ZDNetの記事に載りました
このエントリーをはてなブックマークに追加

スレッドを一杯作った状況でメモリ不足になると、全員がいっせいにメモリ回収ロジックに入り、不必要にOOM起こしていた問題を修正。
あわせて、/proc/meminfoとOOM時のメモリ使用量ログを大幅強化

無事、mmotmに入った
このエントリーをはてなブックマークに追加

という議論があるようだ。

http://lwn.net/Articles/341244/

・なんか、失敗するとシステムがブートしなくなるのでリスキー
・二週間に一回しか、ライブラリアドレスが変わらなくなるので、セキュリティが落ちる
・最近の32bit x86だとVDSOがランダムなアドレスに配置されるので、
prelinkが前提としている、ライブラリのロードアドレスは事前に計算できるはず。という仮定は
 くずれている。
よく、glibc と VDSOが衝突してるよ
・だいたいSELinuxが受け入れられてるぐらいなんだから、たいていのユーザは
 そこまで、パフォーマンス気にしてないんだよ
・結局、体感できるほど効果あるのってOOo使った時だけだよね。システム全体に
手を入れる話なのかなぁ?
このエントリーをはてなブックマークに追加

今週のkernel podcast
http://www.kernelpodcast.org/2009/07/14/20090709-linux-kernel-podcast/


OOM. Rik van Riel posted a patch aimed at addressing some of the recent OOM situations, as touched upon in yesterday’s podcast. Rik noticed that vmscan can get horribly confused when too many tasks go into direct reclaim, and trigger an OOM situation that is caused because too few pages are on the LRU. Instead, Rik proposes limiting the number of tasks that may enter page reclaim to allow at most half of each inactive list to be isolated at any one time. In yet another semi-related VM thread, Kosaki Motohiro posted version 2 of his patches aimed at helping to track down OOM situations with more information presented to the user in the generated kernel log messages.




       {    !      _,, -ェェュ、   |
ィ彡三ミヽ  `ヽ     ,ィハミミミミミミミミミヽ、|
彡'⌒ヾミヽ   `ー  /ililハilミilミliliミliliミミヾ|
     ヾ、        /iiiiイ!ヾヾミ、ミニ=ー-ミii|
  _    `ー―' i!ハ:.:.\\_::::::::::::::/:.|   このスレは
彡三ミミヽ        i! ヽ:.:.:.:冫': : :::/,,∠|
彡'   ヾ、    _ノ i!::: ̄二ー:: : ::::ソ ・ ,|   Kernel Podcastに
      `ー '    {ヘラ' ・_>シ;テツ"''''"|
 ,ィ彡三ニミヽ  __ノ ヽヘ`" 彡' 〈     |    監視されています
彡'      ` ̄       `\   ー-=ェっ |
      _  __ ノ  {ミ;ヽ、   ⌒   |
   ,ィ彡'   ̄        ヾミミミミト-- '  |
ミ三彡'        /⌒ / ̄ ̄ | : ::::::::::|
       ィニニ=- '     / i   `ー-(二つ
     ,ィ彡'         { ミi      (二⊃
   //        /  l ミii       ト、二)
 彡'       __,ノ   | ミソ     :..`ト-'
        /          | ミ{     :.:.:..:|
            ノ / ヾ\i、   :.:.:.:.:|
      ィニ=-- '"  /  ヾヾiiヽ、 :.:.:.:.::::|
    /     /  `/ ̄ ̄7ハヾヾ : .:.:.|
   ノ     _/   /   /  |:. :.:.:.:.:.:.:|
      /     /   /   |::.:.:.:.:.:.:.:|



こえーよ。なんで知ってるんだよ
このエントリーをはてなブックマークに追加

なんか、デッドラインスケジューラーを追加しよーぜー。とか言ってる


Hi all!

This is a proposal for a global [1], deadline driven scheduler for
real-time tasks in the Linux kernel. I thought I should send out an RFC to
gather some feedback instead of wildy hack away at it.

This proposed scheduler is a modified MLLF (modified Least Laxity First)
called Earliest Failure First (EFF) as it orders tasks according to when
they will miss their deadlines, not when the actual deadline is.

== Motivation and background ==

Deadlines will give the developer greater flexibility and expressiveness
when creating real-time applications. Compared to a priority scheme,
this simplifies the process considerably as it removes the need for
calculating the priorities off-line in order to find the priority-map
that will order the tasks in the correct order. Yet another important
aspect with deadlines instead of priorities, are systems too dynamic to
analyze (a large application with 100s of processes/threads or a system
running more than one rt-application).

* In very large systems, it becomes very difficult to find the correct
set of priorities, even with sophisticated tools, and a slight change
will require a re-calculation of all priorities.

* In very dynamic systems, it can be impossible to analyze the system
off-line, reducing the calculated priorities to best-effort only

* If more than one application run at the same time, this become even
worse.


As a final point, a schedule of tasks with their priorities, is in
almost all scenarios, a result of all deadlines for all tasks. This also
goes for non-rt tasks, even though the concept of deadlines are a bit
more fuzzy here. The problem is that this mapping is a one-way function,
--you cannot determine the deadlines from a set of priorities.

The problem is, how to implement this efficiently in a priority-oriented
kernel, and more importantly, how to extend this to multi-core
systems. For single-core systems, EDF is optimal and can accept tasks up
to 100% utilization and still guarantee that all deadlines are
met. However, on multi-core, this breaks down and the utilization bound
must be set very low if any guarantee should be given (known as "Dhall's
effect").

== Related Work ==

Recently, I've been working away on a pfair-based scheduler (PD^2 to be
exact), but this failed for several reasons [2]. The most prominent being
the sensitivity for timer inaccuracies and very frequent task
preemption. pfair has several good qualities, as it reduces jitter,
scales to many cores and achieves high sustained utilization. However,
the benefits do not outweigh the added overhead and strict requirements
placed on the timer infrastructure.

This latter point is what haunts EDF on multi-core platforms. A global
EDF-US[1/2] cannot exceed (m+1)/2, standard EDF is much
worse. Partitioned can reach higher limits, but is very susceptible to
the bin-packing heuristics. Going fully partitioned will also introduce
other issues like the need for load balancing and more complex deadline
inheritance logic. However, one clear benefit with EDF, is that it will
minimize the number of task-switches, clearly something desirable.

== Proposed algorithm ==

So, with this in mind, my motivation was to find a way to extend the a
deadline driver scheduler scheduler to battle Dhall's effect. This can
be done if you look at time in a more general sense than just
deadlines. What you must do, is look at how the expected computation
time needed by a task with respect to the tasks deadline compares to
other tasks.

=== Notation ===

- Take a set of tasks with corresponding attributes. This set and their
attributes are called the schedule, 'S' and contains *all* tasks for
the given scheduling class (i.e. all EFF-tasks).

- Consider a multi-core system with 'm' processors.

- Let the i'th task in the schedule be denoted tau_i. [3]

- Each task will run in intervals, each 'round' is called a job. A task
consists of an infinite sequence of jobs. The k'th job of tau_i is
called tau_{i,k}

- Each task has a set of (relative) attributes supplied when the task is
inserted into the scheduler (passed via syscall)
* Period T_i
* Deadline D_i
* WCET C_i

- Each job (tau_{i,k}) has absolute attributes (computed from the relative
tasks-attributes coupled with physical time).
* Release-time r_{i,k}
* Deadline d_{i,k}
* Allocated time so for a job, C_a(t, tau_{i,k})
When C_a equals WCET, the jobs budget is exhausted and it should
start a new cycle. This is tested (see below) by the scheduler.
* Remaining time for the job, C_r(t, tau_{i,nk})

- The acceptance function for EFF screens new tasks on their expected
utilization. Depending on the mode and implementation, it can be based
on the period, or on the deadline. The latter will cause firmer
restraints, but may lead to wasted resources.

U = C_i / T_i For SRT (bounded deadline tardiness)
U = C_i / D_i For HRT

- A relative measure, time to failure, ttf, indicates how much time is
left before a job must be scheduled to run in order to avoid a
deadline-miss. This will decrease as time progresses and the job is
not granted CPU time. For tasks currently running on a CPU, this value
will be constant.

Take a job with a WCET of 10ms, it has been allowed to run for 4
ms so far. The deadline is 8 ms away. Then the job must be
scheduled to run within the next 4 ms, otherwise it will not be
able to finish in time.

- An absolute value, time of failure (tof) can also be computed in a
static manner. For tasks not running on a CPU, the allocated time is
static. That means you can take the absolute deadline, subtract the
allocated time and you have the absolute point in time when a given
job will fail to meet its deadline.

=== Outline of scheduler ===

Store tasks in 2 queues. One of size m, containing all the tasks
currently running on the CPUs (queue R). The other will hold all
currently active tasks waiting to execute (queue W).

queue R is sorted based on ttf (time to failure, the relative time left
until a task will miss it's deadline). As the tasks approaches the
absolute time of failure at the same rate C_a increases, ttf is
constant. R is only a 'map' of tasks to the CPUs. Position 0 in R
(i.e. smallest ttf) does not result in CPU#0, as the position->CPU will
be quite fluent.

queue W is sorted based on absolute time of failure (tof). Since this is
a fixed point in time, and the tasks in W are not running (C_a is
unchanged), this value is constant.

When a task is scheduled to run, a timer is set at the point in time
where it has exhausted it's budget (t_now + WCET - C_a). This is to
ensure that a runaway task does not grab the CPU.

When a new task arrives, it is handled according the following rules:
- The system has one or more CPUs not running EFF-tasks. Pick any of the
free CPUs and assign the new job there. Set a timer to

- All CPUs are busy, the new task has greater time to failure than the
head of W. The task is inserted into W at the appropriate place.

- All CPUs are busy and the new task has smaller time to failure than
the head of W. The new task is compared to the last task in Q. If time
to failure is larger than the task at the tail, it is added to the
head of W.

- If all CPUs are busy, and time to failure is smaller than the tail of
Q, the new task is a candidate for insertion. At this point the tasks
must be compared to see if picking one or the other will cause a
deadline-miss. If both will miss the deadline if the other is
scheduled, keep the existing running and place the new at the head of
W (as you will have a deadline-miss anyway unless the the task is
picked up by another CPU soon).

- A task running on a CPU with ttf=0 should *never* be preempted with
another task. If all tasks in R have ttf=0, and a newly arrived task
has ttf=0, a deadline-miss is inevitable and switching tasks will only
waste resources.

When a task in R finish (or is stopped due to the timer-limit), it is
removed from R, and the head of W is added to R, inserted at the
appropriate place.

It has been some discussion lately (in particular on #linux-rt) about
the bandwidth inheritance (BWI) and proxy execution protocol (PEP). It
should be possible to extend EFF to handle both. As a side note, if
anyone has some good information about PEP, I'd like a copy :)

Based on this, I think the utilization can be set as high as M
(i.e. full utilization of all CPUs), but the jitter can probably be
quite bad, so for jitter-sensitive tasks, a short period/deadline should
be used.

There are still some issues left to solve, for instance how to best
handle sporadic tasks, and whether or not deadline-miss should be allow,
or just 'bounded deadline tardiness'. Either way, EFF should be able to
handle it. Then, there are problems concerning blocking of tasks. One
solution would be BWI or PEP, but I have not had the time to read
properly through those, but from what I've gathered a combination of BWI
and PEP looks promising (anyone with good info about BWI and PEP - feel
free to share! (-: ).



Henrik


1) Before you freeze at 'global' and get all caught up on "This won't
ever scale to X", or "He will be haunted by Y" - I do not want to
extend a global algorithm to 2000 cores. I would like to scale to a
single *chip* and then we can worry about 2-way and 4-way systems
later. For the record, I've donned my asbestos suit anyway.

2) http://austad.us/kernel/thesis_henrikau.pdf

3) Anyone want to include LaTeX-notation into an email-rfc?



--
-> henrik
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


このエントリーをはてなブックマークに追加

Jon Masters がやってる Linux Kernel Mailing List (LKML) Summary Podcast に登場したらしい。あんた凄いよ。どうやってウォッチしてるんだ?

http://www.kernelpodcast.org/2009/07/08/20090705-linux-kernel-podcast/

OOM. Following up to recent discussion concerning noswap related patches triggering excessive OOM kill scenarios, and simply in reaction to the general mess that is trying to figure out exactly why an OOM occured, Kosaki Motohiro posted a “OOM analysis helper” patch series which adds a number of statistics to the output produced by the kernel on an OOM condition.


このエントリーをはてなブックマークに追加

http://www.phoronix.com/scan.php?page=article&item=ext4_btrfs_nilfs2&num=1

たぶん、デフォルトパラメタで勝負しているので、ext3はwritebackモードだと思う。

要約
・SQLiteテストはext3が20秒で終わるところが、ext4では870秒、btrfsに至っては1472秒もかかった
・PostgreSQLのpgbenchのbtrfs, XFSは完走できなかった
・IOZone
Write: ext3:107MB/s, ext4:131MB/s, Btrfs:89MB/s
Read: ext3:202MB/s, ext4:219MB/s, Btrfs:93MB/s
Btrfs遅いな~
・Dbenchはext3:100MB/s, ext4:32MB/s, Btrfs:46MB/s
やはり、ordered-modeは並列IOに弱すぎる。Btrfsはもうちょっとチューニング出来る気がするが
・PostMarkはext4の圧勝だけど、そもそも何やってるベンチなのかよく分からん
・BlogBench(Webサーバーワークロードのベンチ)だとBtrfsが優秀
たぶん、アプリがアペンドライトを多用すると、FSのCOWを通らなくなるからだね
このエントリーをはてなブックマークに追加

公開されました。

http://www.atmarkit.co.jp/flinux/rensai/watch2009/watch06a.html
このエントリーをはてなブックマークに追加

madviseで共有アドバイスを与える事になった模様
このエントリーをはてなブックマークに追加

↑このページのトップヘ