Soramichi's blog

Some seek complex solutions to simple problems; it is better to find simple solutions to complex problems

Ochanomizu University to accept non-women

Ochanomizu University to accept non-women who recognize themselves as female in addition to any women (including ones recognizing themselves as male and neutral). I think it's a very large step, and I want them to keep going but not stop here.

Note for those who are not familiar with Japanese universities: it used to be a women-only university but will introduce a new policy from next year.

In the short term, it is strongly needed to support students accepted by the new policy not only in the school but also around it. Do all apartment owners accept these students? I don't think so. Do all stations in Tokyo have universal restrooms? No they don't. I hope this new policy will have an impact to the society around the school as well.

In the long term, I believe now in this information era is the time to re-consider the long established concept of male vs. female. Why do we need to distinguish the two and map every single person to either of them? Why should the biological differences spiritually affect the way of thinking? How can we find a compromise in the real world where some do not wanna be distinguished (not discriminated) but others do feel comfortable being categorized? I hope this new policy will be a trigger to initiate constructive discussions, but not meaningless quarrels as usually seen in equality-related stuff.

Public Walfare (for the Constitution Memorial Day of Japan)

The Japanese constitution has several clauses that have the phrase "for the public welfare" in them. Basically what it says is that the fundamental human rights are respected if (and only if) they do not conflict with the "public welfare".

I have been long thinking that this "public welfare" is too vague so that it can be used to condemn almost anything, because there is actually no definition in the constitution. The vagueness may not be a problem if the people are culturally, religiously, morally and biologically "uniform", but obviously it is not the case. In an extreme case, if say 99% of people think that something is ugly, then that thing can be criminalized in order to promote the public welfare (people actually claim that things they hate must be criminalized and they often cite public welfare, although making something really criminalized is rarely done in my understanding).

It turned out that, this vagueness has actually been pointed out by the United Nations as well. The International Covenant on Civil and Political Rights of the United Nations had a 'concern that the concept of "public welfare" is vague and open-ended' (page 8 of http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=CCPR/C/JPN/CO/6&Lang=En ). There is a very good read (in Japanese) written by a professor of Osaka Sangyo University.

I guess there are two issues about this. First, people really do not care about it because for the majority of people who share the same way of thinking, something against the public welfare is what they do not want. So, condemning something against the public welfare is actually good for them! It might be the case that people do not even notice that the phrase "public welfare" is vague. Second, having a constructive and objective discussion on how to change the constitution is very difficult in Japan, because people are paranoid about the Clause No. 9 that says about the wars and forces (and I am too).

'who am i' does not work in recent GNOME-terminal (and MATE-terminal)

The 'who am i' idiom and the problem

The *unix command who is used to "show who is logged on" (c.f. https://linux.die.net/man/1/who ), and who am i, which is equivalent to who -m, is an idiomatic usage of who that only shows the user who calls this command. So it tells who you are. A funny thing is that any string works as the arguments. The command only checks if the number of arguments is 2 and then executes who -m if it is. Thus, as the manual says, who mother loves yields the same result.

The problem is that it does not work in recent GNOME-terminal, MATE-terminal, and any other terminal emulators that rely on libvte. In these terminal emulators, who am i just returns nothing. I think this is a critical issue as who am i is used in many shell scripts to retrieve the current user name.

f:id:sorami_chi:20180217231615p:plain

How it is implemented

Before explaining why it doesn't work, I describe how it is implemented. The who.c file of gnu coreutils relies on the file named /var/run/utmp. This file contains login logs of the system and what who.c does is basically just opens the file, parse it and print the log entries.

In each entry of /var/run/utmp, a username and login information of that user are recorded such as the pid of the login process and the device name of the terminal the user is associated with (see the manual for the details). who am i compares the terminal device name in an entry with a string obtained by ttyname(3), and prints the entry if they match.

  if (my_line_only) // when -m is specified
    {
      // retrieves the device name of stdin
      ttyname_b = ttyname (STDIN_FILENO);
      ...
    }

  while (n--) // for each entry of /var/run/utmp
    {
      // if -m is not specified, or
      // the device name of stdin is equal to the device name of the terminal associated with this user
      if (!my_line_only
          || STREQ_LEN (ttyname_b, utmp_buf->ut_line,
                        sizeof (utmp_buf->ut_line)))
        {
        // parse and print the entry
        ...
        }

Code: A part of who.c of gnu coreutils (comments are added by me)

Why it doesn't work and what to do

The root cause behind this problem is that /var/run/utmp is not magically maintained by the OS and it has to be updated by each process, while recent versions of libvte does not do this. This is not a bug of libvte, but an intentional choice discussed like here. Although in the same thread a concern about who am i was raised, it was just removed to improve the code cleanness (it seems like the function to update /var/run/utmp was in a very nasty file that is almost never used). Because both GNOME-terminal and MATE-terminal use libvte, this problem exist in both of them.

Two easy alternatives are (1) to use xterm, which does not rely on libvte and updates /ver/run/tmp by itself, and (2) to use the whoami command, which uses a different mechanism, instead of who am i.

キューブ型PC(SX58H7)のCPUとファン交換

以前知り合いがいらないからともらったキューブ型PCのCPUに負荷をかけまくっていると非常にうるさい&性能が微妙に不満なので、CPUとファンを交換した。

基本的には Shuttle 社の SX58H7 というもの(これ)で、ただしベアボーンなので実際にどこのメーカーがCPUなどを組み込んで売ったのかは謎。キューブということで普通にバラして部品を交換できるか微妙だったが、あけてみるとちゃんと自分でバラせるようになっていた。(ベアボーンなのでバラせて当然?)

f:id:sorami_chi:20180102011018j:plain 外観。光学ドライブのフタが壊れている。

元々ついていた Core i7 920 はなんとSandy Bridgeより以前の世代でIntelの用語では「過去のプロセッサ・ファミリー」と呼ばれているが、やらせたい仕事がフェッチするデータ量が少なくかつconditional branchもあんまりない(と思う)という純粋に力勝負的なワークロードなので少しアップグレードすればまだまだいけると判断した。

ファンは一つしかなく、CPUから伸びたヒートシンクに風をあてて温まった空気を外に排気するという方式だった。Shuttle社公式の組み立て説明動画ではヒートシンク等のCPUまわりは6分ちょうどあたりから出てくる。なお動画ではヒートシンクとファンが合体して見えるが、これは実は別々になっていてファンを交換するだけなら本体後ろのネジをはずすだけで交換できる。

f:id:sorami_chi:20180102011015j:plain ケースをあけたところ。左の横向きについているファンでヒートシンクから来た熱を外に出す。その手前はグラボで特に必要ないがネジ山がつぶれてはずせなくなってしまった。

ついていた超うるさいファンは AD0912UX-A7BGL という型番のもので、どうやらこれがイマイチらしい。米アマゾンのレビューでも "it sounds like a Pentium 4 case fan" とか "sounds like a meat grinder" とか書かれている(meat grinderはなんかソーセージを作ったりするのに使う肉粉砕マシーン??)。交換先のファンは GELID 社の Silent 9 というものにした。日本ではサイズ社が輸入して販売しているっぽい。アキバで1000円くらいで売っていた。気にしたポイントはPWMで回転数が制御できるものであることと、なんか静音っぽい製品であること。本当は羽の数とか回転数とかを見たほうがいいんだろうけど、ファンのことはあまりよくわからないので感で決めた。元々のものは 90mm でこれは 92mm と書いてあるけど、90mm と 92mm のファンは実際は同じ大きさらしく(?)、ベアボーンの部品であるファンを囲うアルミフレームに問題なく設置できた。

次にCPUだが、やらせたい仕事が純粋CPU勝負かついくらでも並列化できる(weak scalingできる)タスクなのでなるべく周波数が高い&コア数が多いものが有利である。元々ついていたCPUと同じソケット(LGA1366)にささる中で一番コア数が多いのは Xeon X5690 で、周波数も十分高いので理想的にはこのCPUがよい。ただしXeon系CPUはあまり中古で出回っていないこと、元々ついていたCore i7 920よりマイクロアーキテクチャが1世代進んでいるのでBIOSを更新しないといけないかもしれない(かつベアボーン自体が古いのでもう新しいBIOSをダウンロードできないかもしれない)ことが微妙なポイントだった。

結局アキバの中古を扱う店で売っていた Core i7 965 Extreme Edition を購入した(Xeonは案の定売ってなかった)。歳末特価ということで(購入したのは12/31)、4900円で買えた。ヤフオクでは6500円程度で落札されているっぽいのでかなりオトクだった。なお発売当時の価格は $990 だった模様。実は中古のCPUを買ったのははじめてだけど、CPUはストレージや電源ユニットと違い冷却ミスで燃えない限りほぼ壊れないのでちょっと前の世代のCPUを中古で買うのが一番コスパがいいかもしれないと感じた(特に自宅で使っていてHWトランザクショナルメモリとか Cache Allocation Technology とかの新しいハードウェア機能がいらない場合)。あとはめっちゃ安いグリスを購入して終了。グリスの違いは自分で試したことがないのでよく分からない。

CPUの交換時にソケットのピンをちょっとひっかいた気がして不安だったが(LGA1366はCPU側ではなくソケット側にピンがある)、組み直してちゃんと起動できた。ヒートシンクがばっちり熱くなっているのでグリスは安物でOKだった模様。たぶんファンの回転数が元々のものより低くなっていて、音は静かになったが冷却性能はかなりギリギリな感じがある。冬はいいけど真夏は最大負荷で常用するのは無理かも知れない。その代わり音は劇的に静かになって、最大回転数の一段階手間までは隣りにあるNASのHDD動作音にかき消されてわからないレベルになった。最大回転数だと少し音が聞こえるが、まったくうるさいという感じではない。

一番問題のワークロードの性能は16%強改善された。コア数、キャッシュサイズ、マイクロアーキテクチャは元々のCPUと一緒なので純粋に周波数が上がったことによる効果である。なお今回はワークロードが完全に100%CPU勝負だと分かっているからの判断であり、一般にはメモリの性能(帯域、レイテンシ)もかなり効いてくるので周波数の比較だけで性能を議論するのは実はほとんど意味がない場合が多い。また搭載しているコアが数世代違うマシンの周波数だけを比べるのは現代と100年前の幸福度を給与の額面で比べるくらい意味がない

How to configure LBR (Last Branch Record) on Intel CPUs

Introduction

LBR (Last Branch Record) is a functionality to record information about branch instructions that a CPU takes, especially the linear addresses which the CPU has jumped from and to.

The unique point of LBR is that the records are taken 100% by hardware. On the other hand, the record btrace functionality of gdb records branches by using the "step execution" mode of a CPU. This mode invokes an interruption on every branch instruction (or every instruction of any type, depending on the configuration) so that software such as gdb can record information about branches. This is more flexible than pure-hardware recording because software can record any information (such as internal states of the OS scheduler), but the overhead is huge due to many interruptions. LBR provides almost zero overhead in the cost of reduced flexibility.

This post explains how to configure LBR by actually setting model specific registers (MSRs). On why and how LBR is useful, you can refer other articles such as this or this.

Configuring LBR

The table below shows MSRs that are important for LBR configurations.

Name Address Description
IA32_DEBUGCTL 0x1d9 Setting the bit 0 bit this register to 1 starts LBR recording. Setting it to 0 disables recording.
MSR_LASTBRANCH_x_FROM_IP 0x680 - 0x69f x: 0 - 31. The originating addresses of 32 most recent branches are recorded.
MSR_LASTBRANCH_x_TO_IP 0x6c0 - 0x6df x: 0 - 31. The destination addresses of 32 most recent branches are recorded.
MSR_LBR_TOS 0x1c9 "Top of the Stack" of the records. It indicates which MSR includes the most recent record.
MSR_LBR_SELECT 0x1c8 Filter the records with some conditions such as "do not record when in ring 0".

LBRs are started being recorded by merely enabling the bit 0 of IA32_DEBUGCLT MSR. For example, you can do it for all CPU cores by $ sudo wrmsr -a 0x1d9 0x1 or for a specific core (let's say core #3) by $ sudo wrmsr -p 3 0x1d9 0x1.

The saved records can be retrieved by reading MSR_LASTBRANCH_x_FROM_IP and MSR_LASTBRANCH_x_TO_IP MSRs. They work like ring buffers and the head is indicated in MSR_LBR_TOS MSR, that is, the 33rd record is stored into MSR_LASTBRANCH_0_FROM_IP by overwriting the 1st record and the index of the register that includes the newest record is in MSR_LBR_TOS.

MSR_LBR_SELECT MSR is used to selectively record LBRs. For example, you can record branches only when the CPU is in ring 0 (or only when not in ring 0). The screenshot below is from the Intel's manual. f:id:sorami_chi:20171217225021p:plain

Things to care when using LBRs

There are two things on which you must be very careful.

First, LBRs are cleared when the CPU goes to a sleep state deeper than C2 and there is no configuration to keep them not cleared. C2 is not that deep, so just letting the CPU idle after a workload execution will clear the LBRs that are just recorded.

I guess the only way to prevent them from being cleared is to force the CPU awake all the time. You can easily do it by adding intel_idle.max_cstate=1 and intel_pstate=disable to GRUB_CMDLINE_LINUX_DEFAULT of /etc/default/grub and then do $ sudo update-grub and reboot your machine.

Second, stopping LBR recoding is somewhat tricky. Because there are only 32 records, you want to stop LBRs being updated as soon as your workload finishes (or suspended due to an event under interest such as a SEGV). Setting the bit 0 of IA32_DEBUGCTL to 0 by hand (or by a script) may not work because executing 32 branches takes a modern processor like a million times shorter than a blink of your eye.

The bad news is that the only one way provided by the CPU to automatically stop LBR recoding is to use PMIs (performance monitoring interruptions). If the bit 11 of IA32_DEBUGCTL is 1, the CPU "freezes" LBRs when it invokes a PMI. I guess this is why gdb does not support retrieving LBRs although LBR has been existing since ancient ages of 32 bit CPUs.

The good news, however, is that you can freeze LBRs as soon as any interruption is invoked by a software trick. This allows you to safely retrieve LBRs when a workload stops by a SIGSEGV or SEGFPE (or whatever interruption you're interested in).

To do this, you have to put a single line of code to set the bit 0 of IA32_DEBUGCLT to 0 in an exception handler of the linux kernel. For example, inserting wrmsrl(0x1d9, 0); into do_coprocessor_error and do_simd_coprocessor_error in arch/x86/kernel/traps.c lets the kernel to freeze LBR as soon as it receives a SIGFPE. Because the CPU jumps to an interruption handler directly when an exception occurs, this will overwrite the LBRs at most by 1 record (or actually no records are overwritten if you selectively record branches only in ring > 0).

海外留学を楽しくするコツ

Introduction

研究留学 Advent Calendar 2017 の16日目です。 海外の企業研究所や大学での研究の様子やインターン先の見つけ方などについては皆さんが素晴らしい記事を書かれているので、 私は少し話題を変えて海外留学(や国際会議などの海外渡航)を楽しくするコツについて書こうと思います。 カレンダーの枕にも「挑戦を考えている人の励ましになったりしたら良いな」と書いてありますので、そういう観点で書いていきます。

テンプレ

いちおうテンプレが与えられていますので載せておきますが、内容とはあまり関係ありません。

  • いつ行ったか:2013年5月から8月
  • どこに行ったか:マイクロソフトリサーチ(アメリカ ワシントン州 レドモンド)
  • 何をやったか:大規模分散システム向けプログラミングフレームワークの性能改善(HCIとかCVとか機械学習でないレアパターン)
  • どうやって行ったのか:先輩を経由して紹介。コネについてはこのあたりが詳しい?

本題

では本題です。海外留学を楽しくするコツはズバリ、友達をたくさんつくることです。 留学に限らず、国際会議などでも4泊くらいしていると疲労と孤独感でだんだん参ってくることがあるかもしれませんが、現地で友達を作れば楽しく過ごせます。

いやいや、そんなん分かっとるわ!!!友達がおらんから困っとんじゃ!!!!という感じなので、具体的にどうすればいいのか、また友達をたくさんつくるとどんないいことがあるのかを書いて参考と励みになればと思っています。

するべきことその1: とにかく話しかける

インターンに行って最初のオリエンテーションや食事で隣になったら、とにかく話しかけましょう。 気が効いたことを言う必要はなくて、 "Hi, I'm XXXX. What is your research about?" とかでOKです(研究の話題になってしまえばあとはいくらでも喋れるはず?) ここでのポイントは、日本とアメリカでは友達の基準が違うということです。 ムラ社会の日本では一度懇親会で喋ったくらいでは「うーん、あの人話したことあるけど名前知らんしよくわからん」みたいな感じですが、 アメリカでは「一度会話したらもう friends 」です*1。 これはマジなので騙されたと思ってとにかく話しかけてみてください。

するべきことその2: 少人数を狙う

とにかく話しかけろと書きましたが、英語が苦手で・・・といったことがあるかと思います。 そういうときはたまたま隣があいている人とかを狙いましょう。 これも日本だと perfect hashing か?みたいな感じでなるべく知らない人の隣に座らないようにしますが、 アメリカでは空いている人の横に座って交流するのはまったく変ではありません。気にせず座りましょう。 ちなみにうるさい場所で大人数入り乱れての雑談はかなり英語が得意な人でもあまり聞き取れないので、 自分が聞き取れないからといって落ち込む必要はありません。割り切って少人数を狙いましょう。

するべきことその3: 笑顔で挨拶

一度会話して friends になったら、次からすれちがったときは笑顔で挨拶しましょう。 日本だと会釈とか、最悪ケースだとわざと目をそらしたりとかしますが、欧米ではすれ違うときはニコッと口角をあげて目を合わせます。 これさえやっておけば忙しくてあんまり会話していなくても友人関係を維持できます。Poke みたいなもんと思ってください。

友達をつくるどんないいことがあるか?

留学先や国際会議で友達を作ると、まず滞在中寂しくないというのはもちろんですが、帰国後もいいことが山盛りです。 まず MSR とか CMU とか MIT みたいなところに留学に来ている人は海外の優秀な学生たちなので、帰国後も研究で協力できたりもっと現金な例だとコネができたりします。 また滞在中に友達を作っておくと、海外旅行に行った際に家に泊めてもらえたり案内してもらえたりします。 私はいままで学会で知り合った友達にスイスで泊めてもらったり、ボストンでMITの見学をさせてもらったり、またインターンで知り合った友達にニューヨークで泊めてもらったりしています。 (マウンティングではなく、友達を作るとこんなに楽しいことがあるぞ!!!という意味合いでご解釈ください。)

上級編:話題の選択

話しかける時の話題はなんでもいいと書きましたが、相手の国のXXって知ってる!みたいなことを話すと喜んでくれる場合が多いようです。 ただ日本在住者は気にしなくても相手国にとっては政治的にセンシティブ、といった内容などもあるので、あまり社会科に自信がない場合はやめておいて研究の話をしておいた方が無難な気がします。

終わりに

あんまり日本語で散文を書くのが得意ではないので、分かりにくかったりなんか変なことを書いているかもしれないですが、 最初にも書いたようにこれから海外留学する人の励み・参考になれば幸いです。

Dataflow Analysis to Semi-Automatically Find Chainer Bugs

Preface

As a system software researcher working for an (you know, one of many) "artificial intelligence research center", I use Chainer to explore what kind of system characteristics/supports the real AI applications need. Chainer is really good for this purpose because the framework itself is really simple so it is easy to hack as you wish.

Although the framework is intensively maintained, I sometimes happen to find bugs, especially when I use it in a bit different usage than normally done. This post explains a tiny tiny idea I came up with to (kind of) semi-automatically find a certain type of bugs of Chainer.

The Idea

So the idea is "the forward and the backward propagations for the same calculation are supposed to do similar things, especially for preparation". For example, both forward and backward of the linear function converts the first input into a matrix and assign the result into x (x = _as_mat(inputs[0])), and assign the second input into W (W = inputs[1]).

Given this idea, I extracted all assignments for each variable, and compare the extrancted assignments between forward and backward functions. If there is a variable with the same name in forward and backward but with different assignments, it might be a potential bug. In the linear example, both x and W have the same assignments in forward and backward.

Bugs It Found

Let's see how it works. Here is the code I wrote to extract the assignments and compare them. You should set the names of forward and backward functions by hand (l13 and l15), depending on whether they are vanilla forward/backward, forward_cpu/backward_cpu, or forward_gpu/backward_gpu.

Clone the Chainer repository and revert it to a point before the bug I found in this method has been fixed. After that, apply my script to chainer/chainer/functions/connection/deconvolution_2d.py.

$ git clone chainer && cd chainer
$ git checkout e6a7ec62773f0df0e3e0
$ ~/src/chainer_dataflow/chainer_dataflow.py chainer/functions/connection/deconvolution_2d.py
different data flow! ( b )
forward:
111 b = inputs[2] if len(inputs) == 3 else None
137 b = cuda.cupy.ascontiguousarray(b)
backward:
228 b = inputs[2] if len(inputs) == 3 else None
--------------------------------------------------
different data flow! ( kh )
forward:
123 kh, kw = W.shape[2:]
backward:
242 _, out_channels, kh, kw = W.shape
--------------------------------------------------
different data flow! ( kw )
forward:
123 kh, kw = W.shape[2:]
backward:
242 _, out_channels, kh, kw = W.shape
--------------------------------------------------
different data flow! ( c )
forward:
125 c = W.shape[1]  # out_c
backward:
243 c, h, w = gy.shape[1:]
--------------------------------------------------
different data flow! ( algo )
forward:
160 algo = libcudnn.getConvolutionBackwardDataAlgorithm(
165 algo = cuda.cupy.cuda.cudnn.CUDNN_CONVOLUTION_BWD_DATA_ALGO_1  # NOQA
backward:
258 algo = libcudnn.getConvolutionForwardAlgorithm(
283 algo = libcudnn.getConvolutionBackwardFilterAlgorithm(
288 algo = cuda.cupy.cuda.cudnn.CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1  # NOQA
--------------------------------------------------

There are many outputs, but (unfortunately) only the first one (b) is relavant here. The output shows that, in forward, b is assigned from inputs[2] in line 111 and converted into c-contiguous in line 137. However in backward, b is assigned in line 228 and that's it with no conversion into c-contiguous, which is a bug (#2666). In the same way, it can also find a smilar bug such as #2582 (do not forget to set l13 and l15 of chainer_dataflow.py into forward and backward, instead of forward_gpu and backward_gpu). This bug fix actually is the one that motivated me to try this idea.

Here's another example:

$ git checkout e6a7ec62773f0df0 # same commit as the above
$ ~/chainer_dataflow.py chainer/functions/connection/dilated_convolution_2d.py
...
...
--------------------------------------------------
different data flow! ( x_desc )
forward:
133 x_desc = cudnn.create_tensor_descriptor(xji)
backward:
247 x_desc = cudnn.create_tensor_descriptor(x)
--------------------------------------------------

In this case x_desc are assigned with tensor descriptors created from different tensors, which was actually not a critical bug but a naming inconsisntecy (#2665).

Limitations and Potantial Extensions

Because both the idea and the script are very simple, of cource there are many limitations. The aim of this post is not like "a research paper that claims super novelty", but to tell the idea to other people with a hope that they may come up with a more clever idea besed on mine, which will be beneficial to the whole community. One obvisous limitation is that it yields a loooot of false positives. It might be useful to defined a threshold of "relevant difference level".

A possible way to extend the idea I have in mind, is to compare the code among forward_cpu and foward_gpu, but and among foward and backward. This is based on a though that some preparation code must be shared both in the cpu mode and the gpu mode. For example, #2589 fixed a missing assertion in the gpu mode code that already existed in the cpu mode code.