Wednesday, January 30, 2013

Writing with FUSE / NFS client in GlusterFS, distributed mode


Out of curiosity ;)
We know that GlusterFS server provides FUSE and NFS interface.
What's the difference?
A test is made with VM environment.
All VMs using CentOS 6.3, factory kernel 2.6.32, GlusterFS 3.3.1.

server: gluster volume "test"
c6:/brick1
c61:/brick2
client: c6c
iftop was installed from EPEL repository to see how the data flows.

Gluster分散模式下, 兩種mount方式的差異(1)

出於好奇...
前面提過GlusterFS可以用FUSE client,也可以用NFS client。
但是用起來會有什麼差別?
我做了一個測試,環境是在VM建的
都是CentOS 6.3, factory kernel 2.6.32, GlusterFS 3.3.1

server: gluster volume "test"
c6:/brick1
c61:/brick2
client: c6c
三台用同一個subnet連接,然後從EPEL安裝iftop看網路流量,看看資料是怎樣傳送的。

先用NFS client, 把資料寫進/data
可以看到c6c的資料全部都傳到c6,也就是NFS server;c6c沒有直接傳資料到c61。
資料去到c6之後,c6再把一部分的資料傳到c61。

如果改用FUSE client:
c6c傳送時就直接把資料分散到c6跟c61兩台主機。

所以這樣可以知道,採用FUSE會比較節省網路頻寬,因為用NFS的話,要去c61的資料等於是多被傳送了一次。
這裡也可以衍生一個結論,如果client多的話,應該要避免NFS全部掛同一台Gluster server,以免頻寬塞住。

讀取的情形還沒有試,有空再來做,不過我想應該差不多吧...

Note on Using Drobo with GlusterFS

(Thought I really doubt that anyone on this planet is unfortunate enough to uses the same configuration with me lol)

OK. Here we know that:
1. Drobo can use only ext3, maximum 8TB volume, under Linux.
2. Using 2 Volumes on CentOS 6.3 must use USB connection with kernel-ml 3.7.x from ELREPO repository.

Doing the single machine GlstuerFS on ths box goes smoothly, except one thing:
Attempting to ls the mounted GlusterFS volume will hang up.
How do one uses the volume if one can't see its content?

The bug is caused by: (reference)
The ext4 HASH function, in fs/ext4/dir.c changed with kernel 3.3.x.
This change also made on ext3 with kernel 3.7.x, so the kernel-ml from ELREPO is affected by this.
Since the GlusterFS relies on HASH to process the metadata...

So the problem I had is to find a kernel new enough to correctly detect multiple USB volumes on drobo, while old enough to have working ext3 HASH with GlusterFS.
It turned out that the kernel.org 3.4.27 version is OK.
I just compile this kernel on my own, since there's no ready RPM found.
How to compiling kernel.org kernels on CentOS can be found here.

Using GlusterFS client / server on the same machine

Unlike Lustre (stated in its readme), GlusterFS can run server and cilent on the same machine.
It's even simpler because there's only one machine to setup.

I'm using HP z620 with CentOS 6.3, hostname "hydro1", with 2 Drobo S (4 volumes total).

The software install is simple as:
Download the glusterfs repo file into /etc/yum.repos.d
yum install glusterfs glusterfs-server glusterfs-fuse

Start the glusterd service:
service glusterd start
chkconfig glusterd on (to make it auto-start during boot)

In my case the bricks are mounted at /bricks/1~4, ext3 format.

Issue the command to create and start the GlusterFS volume "drobo":

gluster volume create drobo hydro1:/bricks/1 hydro1:/bricks/2 hydro1:/bricks/3 hydro1:/bricks/4
gluster volume start drobo

Note that although we're using the local machine for bricks, the command do not accept "localhost" as brick prefix.

The rest's simple:
mount  hydro1:drobo /data -t glusterfs

You can also use NFS client to mount, but be sure to install and start NFS-related services (rpcbind, nfslock, etc.) before the glusterd starts.

GlusterFS遇上Drobo的問題

怎麼問題這麼多orz

OK,回顧一下。
Drobo在Linux底下只能用ext3,volume最大8TB。
要用USB連接,而且在CentOS 6下要換成ELREPO的kernel-ml (3.7.x)才能抓到完整容量。
然後GlusterFS "理論上"不挑kernel版本 (後面就知道慘了),可以把空間合併起來。

那遇到的問題是什麼,在上一篇GlusterFS單機惡搞版寫得好像很美好的情形下....

如果你真的去做會發現,掛起來的GluserFS (/data) 可以正常寫入,不管是用cp或是rsync都沒有問題。
但是要去ls看內容的時候... 會當住。orz
這樣要怎麼用啊!

問題出在這裡:(引用來源)
簡單說,kernel 3.3.x版改了ext4中dir.c的HASH函數,因為GlusterFS依靠HASH來處理metadata,所以會掛掉。
到kernel 3.7.x版,ext3也做了類似的修改,所以這下連ext3 + GlusterFS也會有問題了...
這就是下ls會當掉的原因。

好,因為這些限制,所以我們必須要找一個kernel版本可以:
1. 抓到USB上面Drobo的兩個volume,容量必須正確。
2. ext3還沒有改HASH函數。

經過追查,發現kernel 3.4.x可以符合這個條件。所以最後是自己編了一個kernel.org來的3.4.27才解決問題.... 是誰說不挑kernel的啊(翻桌)
CentOS底下編kernel.org核心的方法,可以參考這裡。

小玩GlusterFS (單機惡搞版)

上次提到,想找一個方法把4個Drobo的volume合併成一個空間.
因為Drobo只能用ext3的特性,所以一般會用的LVM跟mdadm兩個方法都不能用.
說到要底層是ext3... 想來想去也是只有GlusterFS.
所以來(被)玩玩看....

因為GlusterFS架構在FUSE之上,是user-space的,所以不會牽涉到編核心什麼的麻煩事情。(對,Lustre我就是在說你XD)
而且底層吃的是一般的FS,所以ext3、ext4、XFS等等都可以用,只要有extended attbitute就可以。
和一般如Lustre等的分散式檔案系統相比,GlusterFS是對檔名做HASH,然後把一些metadata存到底層檔案的extended attributes裡面,所以不需要metadata server,這是比較不一樣的地方。
另外雖然原本設計是要過網路走server-client架構,但GlusterFS也允許機器同時當作server與client,所以在我們的例子就很合用了。相對來說Lustre就不能這樣做。

安裝倒是超級簡單。
對CentOS來說,去Gluster網站把repo檔抓回來放到/etc/yum.repos.d裡面,然後:
(server)
yum install glusterfs-server glusterfs-fuse glusterfs fuse fuse-libs
service glusterd start
chkconfig glusterd on
(client)
yum install glusterfs-fuse glusterfs fuse fuse-libs
就這樣XD

這個例子裡面server跟client是同一台機器,所以安裝更簡單。
我先把4個Drobo上的ext3 volume,掛載到:
/bricks/1 ~ /bricks/4
然後設定:
gluster volume create drobo hydro1:/bricks/1 hydro1:/bricks/2 hydro1:/bricks/3 hydro1:/bricks/4
(hydro1是主機的名稱,雖然是本機但是不能用localhost)
gluster volume start drobo
這樣server就設定完畢了,爆簡單XD

再來就是client掛載。
GlusterFS支援兩種client模式,一個是native GlusterFS (FUSE),另一個是走NFS。
只是要用NFS的話,記得像rpcbind、nfslock等service也要安裝,而且必須在glusterd之前啟動。
你大概也猜到了,指令:
mount hydro1:drobo /data -t glusterfs
結束。(茶)
然後就會看到:


[root@hydro1 data]# df -h /data /bricks/*
Filesystem            Size  Used Avail Use% Mounted on
hydro1:/drobo          19T  4.1T   15T  22% /data
/dev/sda1             4.7T 1000G  3.7T  21% /bricks/1
/dev/sdb1             4.7T  1.1T  3.7T  23% /bricks/2
/dev/sdc1             4.7T  1.1T  3.7T  22% /bricks/3
/dev/sdd1             4.7T  1.1T  3.7T  23% /bricks/4


/data的容量是bricks/1~4的總和。這樣就達到我的目的了。
如果你好奇進去bricks裡面看,會看到資料是以檔案為單位,分散在各個bricks裡面。
這就是GlusterFS的預設模式:分散模式 (distributed mode)。
好處是簡單,而且因為單位是檔案,所以不會因為有brick壞掉就丟掉所有資料。
GlusterFS另外還有區塊分散模式 (striped mode,類似RAID0) 跟複本模式 (Replication mode,類似RAID1),不過因為我缺空間而且膽小 (lol) 所以沒測試~



Experiences on Drobo with Linux

OK, I've been troubled by this for 2 weeks....
Some note to keep in track and hope to help others ;)

First of all, Drobo inc. DID NOT CLAIM TO SUPPORT LINUX.
It's not unusable, but there's indeed some limit.

Currently our lab has 2 Drobo S (a bit old), each with 5 3TB harddisks.
This box has 5 bays, max. 4TB each bay, 2 Firewire 800 port (for daisy-chain), 1 USB 3.0, and 1 eSATA port.
The computer pairing with is HP z620 workstation, installed with CentOS 6.3.

Here's the points and experiences:
1.  There's no Drobo Dashboard on Linux. You can set the volume of Drobo with Mac / Windows, and then connect it with Linux box.
2. The only filesystem can be used on Drobo under Linux is ext3. ext4 and XFS? no go, sorry.
3. The maximum ext3 volume size is 8TB. If you have more than that, you'll have to make 2 volumes.
Yes, it can be set to 16TB with Dashboard and the Linux recognizes it. But the Drobo still uses 8TB as volume capacity limit, so it will indicates drive full when the actual data reaches 8TB, and writing becomes very slow.
4. When using 2 volumes, it becomes picky on interface.
5. Using USB 3.0 will present 2 volumes, but factory CentOS 6.3 2.6.32.x kernel can't detect the correct size: the first volume as 8TB and the second 2.4TB only. It can be solved by using kernel-ml (3.7.x) from ELREPO repository, or kernel.org 3.4.x if you really want to build your own.
6. Using (onboard SATA as) eSATA, at least on this HP box, will present only 1 volume. Using the SAS port on the LSI SAS2008 chip card... no volume. (ow!)
7. This computer does not have Firewire 800 port (400 only), so I didn't test this with Firewire.
(Edit: 2013/2/1) Another computer can detect only 1 volume with kernel 2.6.32, but upgrade to kernel 3.4.27 solves the problem.

Note that many distros backport their kernel, so the kernel version may be different on other distros. Another working distro on USB 3.0 is OpenSUSE 12.1, with kernel 3.1.x.

Drobo is overall a good product that supports thin provisioning and great harddisk configuration ability. It is not that good when used with Linux though!

(Our 2 Drobos then become 4 volumes, which is not easy to use. It becomes an adventure on the journey to combine them into one. Challenging that is lol)

Drobo跟Linux... 不搭啊

Drobo這個產品很有趣,尤其是那個BeyondRAID的功能。
簡單說,反正你就是把硬碟都插上去,Drobo會自動找最大顆的做備援。
也不強制一定要都用一樣容量的硬碟。

還有一個特點是... Drobo支援thin provisioning。
也就是說,假使你現在的硬碟容量只有6GB,你還是可以叫Drobo設定8TB的空間
它也會對電腦講說自己是8TB,你就慢慢裝資料,等快用完的時候再補硬碟就好。

機器有監測容量的功能,表示說Drobo會把手伸進去Filesystem,不過這樣也帶來相容問題。
簡單說... 你不能在Drobo上使用不支援的檔案系統,不然有各種奇奇怪怪的問題會出現。
(這和一般的block device不同,通常是不會挑FS的...)
在Windows跟MacOS都還好,反正也沒有幾種Filesystem,但是Linux上的問題就大了...

(請注意: Drobo在官網上並沒有聲明自己支援Linux!)

目前這裡使用的是Drobo S,稍微早期的機種,支援5顆4TB硬碟 (我是裝了5顆3TB),USB 3.0,兩組Firewire 800 (可以串接),eSATA。
電腦是HP的z620工作站,作業系統是CentOS 6.3。

經過一陣折騰之後,得到下面的結論:
1. Drobo的工具程式,Drobo Dashboard,沒有Linux版。所以設定工作要在Windows或是Mac做好。
2. Drobo在Linux只能使用ext3 不能用ext4或是xfs等的檔案系統。
3. ext3的volume最大只能8TB,所以如果你的Drobo超過8TB,就要割成兩個。
你可以在Dashboard設定16TB,OS也抓得到,但是Drobo會以8TB做計算上限,導致就算你硬碟空間足夠,一裝到8TB,Drobo會認定空間已滿要求補硬碟。
4. 就算割成兩個,這台主機的onboard SATA轉eSATA,只能抓到第一個volume。如果接到LSI SAS2008 chip的SAS port... 一個都抓不到orz
5. USB可以抓到兩個volume,但是CentOS 6.3 2.6.32的核心無法抓到全部容量。第一個會抓到8TB,但是第二個會只有2.4TB。這可以用ELREPO repository的kernel-ml核心 (3.7.x) 解決, 但是kernel-lt (3.0.x)都還不夠新。
6. Firewire 800... 因為電腦沒有Firewire 800 port,所以沒試。
(2013/2/1)今天在另一台機器上做了測試,用2.6.32 kernel只能抓到一個volume,但是更新到3.4.
27之後就可以抓到兩個。
7. 效能... 呃... 這台大約60MB/s。這不是比單硬碟還慢嗎 (翻桌)

因為CentOS / RH有很多東西是backport進去kernel的, 所以版本資訊不一定能套用在其他distro上面。目前試過OpenSUSE 12.1 3.1.x的核心,USB容量可以全部抓到。

總之,Drobo不是不好,它搭配Windows或Mac其實是個蠻完整的方案,軟硬體都OK,但是搭配Linux... 說真的不適合XD

我這邊有兩台一樣的Drobo S,這樣做之後變成4個volume,不好用。所以想了一些辦法,不過越搞越難搞... 讓我們繼續看下去(盛竹如調)