Dubbo + Nacos这么玩就失去高可用的能力了

我们常用的微服务框架是SpringCloud那一套 , 在服务远程调用和注册中心的选型上也有不少方案 。在服务远程调用上常用的有:Feign、Dubbo等 , 在注册中心上常用的有:Nacos、Zookeeper、Consul、Eureka等 。我们项目这两块的选型是这样的:RPC调用-Dubbo、注册中心和配置中心-Nacos 。
【Dubbo + Nacos这么玩就失去高可用的能力了】

Dubbo + Nacos这么玩就失去高可用的能力了

文章插图
一、故障开端项目平稳运行了好几年 , 有一天发现Nacos集群的Server内存有点高 , 所以想升级下机器配置 , 然后重启 。说干就干 , 立马在测试环境的3台Nacos-Server集群中 , 任意选了一台进行停机 , 暂且叫它Nacos-Server-1吧 。接下来就是故障了开端了 。
停机之后 , 测试环境立马有许多服务的接口调不通 , 等待许久 , 故障一直没恢复 。所以又赶紧把Nacos-Server-1启动起来 。要找找原因 , 否则无法在生产环境重启Nacos-Server 。
我一直的观点是:出现疑难问题时 , 首先看异常信息 , 然后猜测原因 , 再通过实践去验证 , 最终可以通过源码再去证实 。而不是一上来就看源码 , 那样比酱香配拿铁更伤头 。
二、、异常信息当Nacos-server-1停机时 , 首先在Nacos-Client(即某个微服务应用)看到异常 , 主要有2个:
  • nacos-client与nacos-server心跳异常
  • dubbo微服务调用异常
(1) nacos-client与nacos-server心跳异常:
2023-09-06 08:10:09|ERROR|com.alibaba.nacos.client.naming.NET.NamingProxy:reqApi|548|com.alibaba.nacos.naming.beat.sender|"request: /nacos/v1/ns/instance/beat failed, servers: [10.20.1.13:8848, 10.20.1.14:8848, 10.20.1.15:8848], code: 500, msg: JAVA.net.SocketTimeoutException: Read timed out"|""2023-09-06 08:10:09|ERROR|com.alibaba.nacos.client.naming.beat.BeatReactor$BeatTask:run|198|com.alibaba.nacos.naming.beat.sender|"[CLIENT-BEAT] failed to send beat: {"port":0,"ip":"10.21.230.14","weight":1.0,"serviceName":"DEFAULT_GROUP@@consumers:com.cloud.usercenter.api.PartyCompanyMemberApi:1.0:","metadata":{"owner":"ehome-cloud-owner","init":"false","side":"consumer","Application.version":"1.0","methods":"queryGroupMemberCount,queryWithValid,query,queryOne,update,insert,queryCount,queryPage,delete,queryList","release":"2.7.8","dubbo":"2.0.2","pid":"6","check":"false","interface":"com.bm001.ehome.cloud.usercenter.api.PartyCompanyMemberApi","version":"1.0","qos.enable":"false","timeout":"20000","revision":"1.2.38-SNAPSHOT","retries":"0","path":"com.bm001.ehome.cloud.usercenter.api.PartyCompanyMemberApi","protocol":"consumer","metadata-type":"remote","application":"xxxx-cloud","sticky":"false","category":"consumers","timestamp":"1693917779436"},"scheduled":false,"period":5000,"stopped":false}, code: 500, msg: failed to req API:/nacos/v1/ns/instance/beat after all servers([10.20.1.13:8848, 10.20.1.14:8848, 10.20.1.15:8848])"|""2023-09-06 08:10:10|ERROR|com.alibaba.nacos.client.naming.net.NamingProxy:callServer|613|com.alibaba.nacos.naming.beat.sender|"[NA] failed to request"|"com.alibaba.nacos.api.exception.NacosException: java.net.ConnectException: 拒绝连接 (Connection refused)at com.alibaba.nacos.client.naming.net.NamingProxy.callServer(NamingProxy.java:611)at com.alibaba.nacos.client.naming.net.NamingProxy.reqApi(NamingProxy.java:524)at com.alibaba.nacos.client.naming.net.NamingProxy.reqApi(NamingProxy.java:491)at com.alibaba.nacos.client.naming.net.NamingProxy.sendBeat(NamingProxy.java:426)at com.alibaba.nacos.client.naming.beat.BeatReactor$BeatTask.run(BeatReactor.java:167)Caused by: java.io.IOException: Server returned HTTP response code: 502 for URL: http://10.20.1.14:8848/nacos/v1/ns/instance/beat?app=unknown&serviceName=DEFAULT_GROUP%40%40providers%3AChannelOrderExpressApi%3A1.0%3A&namespaceId=dev&port=20880&ip=10.20.0.200at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1914)at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1512)at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)at com.alibaba.nacos.common.http.client.response.JdkHttpClientResponse.getStatusCode(JdkHttpClientResponse.java:75)at com.alibaba.nacos.common.http.client.handler.AbstractResponseHandler.handle(AbstractResponseHandler.java:43)"


推荐阅读