Jul
24
[原]解决OCFS2的o2net_connect_expired问题
接上次的文章,在修改/etc/sysconfig/o2cb的配置后,发现两机器只有一台可以自动挂载ocfs2分区,而另外一台不能自动挂载。但启动完毕后,手动挂载正常。
一、详细情况
两机器分别是dbsrv-1和dbsrv-2,使用交叉线做网络心跳,并在cluster.conf中使用私有心跳IP,非公用IP地址。
1、检查o2cb状态
启动后,o2cb服务是启动正常的,ocfs2模块也加载正常的,但心跳是Not Active:
2、检查/etc/fstab文件
配置正确;
3、检查两机器的/etc/ocfs2/cluster.conf内容
已经确认,两机器该文件是完全相同的。
4、查看系统日志
报错信息如下:
二、分析问题
1、node节点的启动顺序
从Google搜索到如此的信息:
说明o2cb启动的时候,是根据node节点的大小顺序启动的。
而在cluster.conf中,node0是dbsrv-2,node1是dbsrv-1,所以,dbsrv-1在启动的时候马上可联通本机IP,然后挂载ocfs2分区;但dbsrv-2启动的时候,则不能即时发现对方IP地址,所以启动失败。
2、尝试修改HEARTBEAT_THRESHOLD参数
从Goolge搜索到另外一条信息:
并参考网上的资料,修改/etc/sysconfig/o2cb的HEARTBEAT_THRESHOLD参数为301,启动后报:
问题依旧。
※注释
综上所述,已经能清楚所有配置都是正确的。
导致故障的原因是:
在启动o2cb服务的前,由于某些原因,o2cb依赖的IP地址未能及时取得联系,操作了其限定的时间,而启动失败。而在机器完整启动后,网络已经正常,所以,手动挂载ocfs2分区正常。
内文分页: [1] [2]
一、详细情况
两机器分别是dbsrv-1和dbsrv-2,使用交叉线做网络心跳,并在cluster.conf中使用私有心跳IP,非公用IP地址。
1、检查o2cb状态
启动后,o2cb服务是启动正常的,ocfs2模块也加载正常的,但心跳是Not Active:
引用
Checking heartbeat: Not Active
2、检查/etc/fstab文件
引用
#cat /etc/fstab|grep ocfs2
/dev/sdc1 /oradata ocfs2 _netdev,datavolume,nointr 0 0
/dev/sdc1 /oradata ocfs2 _netdev,datavolume,nointr 0 0
配置正确;
3、检查两机器的/etc/ocfs2/cluster.conf内容
引用
# more /etc/ocfs2/cluster.conf
node:
ip_port = 7777
ip_address = 172.20.3.2
number = 0
name = dbsrv-2
cluster = ocfs2
node:
ip_port = 7777
ip_address = 172.20.3.1
number = 1
name = dbsrv-1
cluster = ocfs2
cluster:
node_count = 2
name = ocfs2
node:
ip_port = 7777
ip_address = 172.20.3.2
number = 0
name = dbsrv-2
cluster = ocfs2
node:
ip_port = 7777
ip_address = 172.20.3.1
number = 1
name = dbsrv-1
cluster = ocfs2
cluster:
node_count = 2
name = ocfs2
已经确认,两机器该文件是完全相同的。
4、查看系统日志
报错信息如下:
引用
Jul 20 19:33:18 dbsrv-2 kernel: OCFS2 1.2.3
Jul 20 19:33:24 dbsrv-2 kernel: (4452,0):o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_request_join:786 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_try_to_join_domain:934 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_join_domain:1186 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_register_domain:1379 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_dlm_init:2009 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_mount_volume:1062 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: ocfs2: Unmounting device (8,33) on (node 0)
Jul 20 19:33:26 dbsrv-2 mount: mount.ocfs2: Transport endpoint is not connected
Jul 20 19:33:26 dbsrv-2 mount:
Jul 20 19:33:26 dbsrv-2 netfs: Mounting other filesystems: failed
Jul 20 19:33:24 dbsrv-2 kernel: (4452,0):o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_request_join:786 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_try_to_join_domain:934 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_join_domain:1186 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_register_domain:1379 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_dlm_init:2009 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_mount_volume:1062 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: ocfs2: Unmounting device (8,33) on (node 0)
Jul 20 19:33:26 dbsrv-2 mount: mount.ocfs2: Transport endpoint is not connected
Jul 20 19:33:26 dbsrv-2 mount:
Jul 20 19:33:26 dbsrv-2 netfs: Mounting other filesystems: failed
二、分析问题
1、node节点的启动顺序
从Google搜索到如此的信息:
引用
Mount triggers the heartbeat thread which triggers the o2net
to make a connection to all heartbeating nodes. If this connection
fails,the mount fails. (The larger node number initiates the connection
to the lower node number.)
to make a connection to all heartbeating nodes. If this connection
fails,the mount fails. (The larger node number initiates the connection
to the lower node number.)
说明o2cb启动的时候,是根据node节点的大小顺序启动的。
而在cluster.conf中,node0是dbsrv-2,node1是dbsrv-1,所以,dbsrv-1在启动的时候马上可联通本机IP,然后挂载ocfs2分区;但dbsrv-2启动的时候,则不能即时发现对方IP地址,所以启动失败。
2、尝试修改HEARTBEAT_THRESHOLD参数
从Goolge搜索到另外一条信息:
引用
After confirming with Stephan, this problem appears to relate to the HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After encountering this myself and having confirmed with a couple of other people in the list that it has caused problems, it seems that the default threshold of 7 is possibly too short, even in reasonably fast server-storage solutions such as an HP DL380 Packaged Cluster.
Does the OCFS2 development team also consider this to be too short, or is altering the paramater just a workaround that shouldn't be used? If this is the case then how should we approach the problem of self-fencing nodes?
Also, can we expect this behaviour with some platforms but not others, or is it too short for all platforms? If it is a blanket problem, then should the default threshold be raised?
Finally, if the altering the threshold is a valid solution, could it please be added to the FAQs and the user guide so that people know to adjust it as a first step on encountering the problem, rather than having to post to the list and wait for replies.
Does the OCFS2 development team also consider this to be too short, or is altering the paramater just a workaround that shouldn't be used? If this is the case then how should we approach the problem of self-fencing nodes?
Also, can we expect this behaviour with some platforms but not others, or is it too short for all platforms? If it is a blanket problem, then should the default threshold be raised?
Finally, if the altering the threshold is a valid solution, could it please be added to the FAQs and the user guide so that people know to adjust it as a first step on encountering the problem, rather than having to post to the list and wait for replies.
并参考网上的资料,修改/etc/sysconfig/o2cb的HEARTBEAT_THRESHOLD参数为301,启动后报:
引用
Jul 23 13:59:50 dbsrv-2 kernel: (4477,0):o2hb_check_slot:883 ERROR: Node 1 on device sdc1 has a dead count of 14000 ms, but our count is 602000 ms.
Jul 23 13:59:50 dbsrv-2 kernel: Please double check your configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
Jul 23 13:59:54 dbsrv-2 kernel: OCFS2 1.2.3
Jul 23 14:00:00 dbsrv-2 kernel: (4449,0):o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.
Jul 23 14:00:00 dbsrv-2 kernel: (4475,2):dlm_request_join:786 ERROR: status = -107
Jul 23 13:59:50 dbsrv-2 kernel: Please double check your configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
Jul 23 13:59:54 dbsrv-2 kernel: OCFS2 1.2.3
Jul 23 14:00:00 dbsrv-2 kernel: (4449,0):o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.
Jul 23 14:00:00 dbsrv-2 kernel: (4475,2):dlm_request_join:786 ERROR: status = -107
问题依旧。
※注释
引用
[隔离时间(秒)] = (O2CB_HEARTBEAT_THRESHOLD - 1) * 2
(301 - 1) * 2 = 600 秒
(301 - 1) * 2 = 600 秒
综上所述,已经能清楚所有配置都是正确的。
导致故障的原因是:
在启动o2cb服务的前,由于某些原因,o2cb依赖的IP地址未能及时取得联系,操作了其限定的时间,而启动失败。而在机器完整启动后,网络已经正常,所以,手动挂载ocfs2分区正常。
