今天运维在执行线上功能时发现服务异常,跟进之后发现Redis出现问题,如下:

[ERROR] [2022-05-26 00:05:01.518] [c.q.d.utils.DataAccessSampleRedisUtil:326][Thread-154] 服务名称:ivdg-data-access-service--> 查询缓存数据出错:Unknown redis exception; nested exception is java.util.concurrent.RejectedExecutionException: event executor terminated
org.springframework.data.redis.RedisSystemException: Unknown redis exception; nested exception is java.util.concurrent.RejectedExecutionException: event executor terminated
	at org.springframework.data.redis.FallbackExceptionTranslationStrategy.getFallback(FallbackExceptionTranslationStrategy.java:53)
	at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:43)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:273)
	at org.springframework.data.redis.connection.lettuce.LettuceScriptingCommands.convertLettuceAccessException(LettuceScriptingCommands.java:236)
	at org.springframework.data.redis.connection.lettuce.LettuceScriptingCommands.evalSha(LettuceScriptingCommands.java:195)
	at org.springframework.data.redis.connection.DefaultedRedisConnection.evalSha(DefaultedRedisConnection.java:1502)
	at org.springframework.data.redis.connection.DefaultStringRedisConnection.evalSha(DefaultStringRedisConnection.java:1757)
	at sun.reflect.GeneratedMethodAccessor606.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.data.redis.core.CloseSuppressingInvocationHandler.invoke(CloseSuppressingInvocationHandler.java:61)
	at com.sun.proxy.$Proxy258.evalSha(Unknown Source)
	at org.springframework.data.redis.core.script.DefaultScriptExecutor.eval(DefaultScriptExecutor.java:77)
	at org.springframework.data.redis.core.script.DefaultScriptExecutor.lambda$execute$0(DefaultScriptExecutor.java:68)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:228)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:188)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:175)
	at org.springframework.data.redis.core.script.DefaultScriptExecutor.execute(DefaultScriptExecutor.java:58)
	at org.springframework.data.redis.core.script.DefaultScriptExecutor.execute(DefaultScriptExecutor.java:52)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:350)
	at com.qsdi.dataAccess.utils.DataAccessSampleRedisUtil.lambda$scanData$7(DataAccessSampleRedisUtil.java:324)
	at java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:110)
	at java.util.stream.IntPipeline$Head.forEach(IntPipeline.java:559)
	at com.qsdi.dataAccess.utils.DataAccessSampleRedisUtil.scanData(DataAccessSampleRedisUtil.java:318)
	at com.qsdi.dataAccess.service.impl.DataSamplingInspectionHourStatisticsServiceImpl.handleHourData(DataSamplingInspectionHourStatisticsServiceImpl.java:69)
	at com.qsdi.dataAccess.service.impl.DataSamplingInspectionHourStatisticsServiceImpl.statisticsDataByTime(DataSamplingInspectionHourStatisticsServiceImpl.java:58)
	at com.qsdi.dataAccess.job.HourStatisticsByVehicleJobHandler.execute(HourStatisticsByVehicleJobHandler.java:29)
	at com.xxl.job.core.thread.JobThread.run(JobThread.java:163)
Caused by: java.util.concurrent.RejectedExecutionException: event executor terminated
	at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:926)
	at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:353)
	at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:346)
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:828)
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:818)
	at io.netty.util.concurrent.AbstractScheduledEventExecutor.schedule(AbstractScheduledEventExecutor.java:261)
	at io.netty.util.concurrent.AbstractScheduledEventExecutor.schedule(AbstractScheduledEventExecutor.java:177)
	at io.netty.util.concurrent.AbstractEventExecutorGroup.schedule(AbstractEventExecutorGroup.java:50)
	at io.netty.util.concurrent.AbstractEventExecutorGroup.schedule(AbstractEventExecutorGroup.java:32)
	at io.lettuce.core.protocol.CommandExpiryWriter.potentiallyExpire(CommandExpiryWriter.java:168)
	at io.lettuce.core.protocol.CommandExpiryWriter.write(CommandExpiryWriter.java:115)
	at io.lettuce.core.RedisChannelHandler.dispatch(RedisChannelHandler.java:195)
	at io.lettuce.core.StatefulRedisConnectionImpl.dispatch(StatefulRedisConnectionImpl.java:176)
	at io.lettuce.core.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:474)
	at io.lettuce.core.AbstractRedisAsyncCommands.evalsha(AbstractRedisAsyncCommands.java:512)
	at sun.reflect.GeneratedMethodAccessor608.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at io.lettuce.core.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:63)
	at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:79)
	at com.sun.proxy.$Proxy254.evalsha(Unknown Source)
	at org.springframework.data.redis.connection.lettuce.LettuceScriptingCommands.evalSha(LettuceScriptingCommands.java:193)
	... 23 common frames omitted

如上代码段所示, 再调用redis执行luna脚本查询时提示异常,具体异常描述是由SingeThreadEventExecutor这个类调用execute方法时出现的异常,源码如下:

private void execute(Runnable task, boolean immediate) {
        boolean inEventLoop = inEventLoop();
        addTask(task);
        if (!inEventLoop) {
            startThread();
            if (isShutdown()) {
                boolean reject = false;
                try {
                    if (removeTask(task)) {
                        reject = true;
                    }
                } catch (UnsupportedOperationException e) {
                    // The task queue does not support removal so the best thing we can do is to just move on and
                    // hope we will be able to pick-up the task before its completely terminated.
                    // In worst case we will log on termination.
                }
                if (reject) {
                    reject();
                }
            }
        }

        if (!addTaskWakesUp && immediate) {
            wakeup(inEventLoop);
        }
    }

分析这个方法,大意是将当前写事件提交到 SingleThreadEventExecutor 任务队列中。报错是在 addTask 这个方法:

/**
     * Add a task to the task queue, or throws a {@link RejectedExecutionException} if this instance was shutdown
     * before.
     */
    protected void addTask(Runnable task) {
        ObjectUtil.checkNotNull(task, "task");
        if (!offerTask(task)) {
            reject(task);
        }
    }

这个方法判断如果没有插入任务队列成功就调用 reject (task) 拒绝任务,reject (task) 里面抛出的异常就是我们看到的最外面的异常。于是看一看 offerTask 方法:

final boolean offerTask(Runnable task) {
        if (isShutdown()) {
            reject();
        }
        return taskQueue.offer(task);
    }
	
	@Override
    public boolean isShutdown() {
        return state >= ST_SHUTDOWN;
    }
	
    private static final int ST_NOT_STARTED = 1;
    private static final int ST_STARTED = 2;
    private static final int ST_SHUTTING_DOWN = 3;
    private static final int ST_SHUTDOWN = 4;
    private static final int ST_TERMINATED = 5;

这里可以看出 offerTask 的时候校验了当前的 SingleThreadEventExecutor 的状态是否是结束和终止。那么问题来了,是什么导致的 SingleThreadEventExecutor 的状态为终止的勒,通过前面缘由中日志可以看到 SingleThreadEventExecutor 也打印了堆溢出的错误日志,搜索这个日志是哪里打印的,发现:

private void doStartThread() {
        assert thread == null;
        executor.execute(new Runnable() {
            @Override
            public void run() {
                thread = Thread.currentThread();
                if (interrupted) {
                    thread.interrupt();
                }

                boolean success = false;
                updateLastExecutionTime();
                try {
                    SingleThreadEventExecutor.this.run();
                    success = true;
                } catch (Throwable t) {
                    logger.warn("Unexpected exception from an event executor: ", t);
                } finally {
                    for (;;) {
                        int oldState = state;
                        if (oldState >= ST_SHUTTING_DOWN || STATE_UPDATER.compareAndSet(
                                SingleThreadEventExecutor.this, oldState, ST_SHUTTING_DOWN)) {
                            break;
                        }
                    }

                    // Check if confirmShutdown() was called at the end of the loop.
                    if (success && gracefulShutdownStartTime == 0) {
                        if (logger.isErrorEnabled()) {
                            logger.error("Buggy " + EventExecutor.class.getSimpleName() + " implementation; " +
                                    SingleThreadEventExecutor.class.getSimpleName() + ".confirmShutdown() must " +
                                    "be called before run() implementation terminates.");
                        }
                    }

                    try {
                        // Run all remaining tasks and shutdown hooks. At this point the event loop
                        // is in ST_SHUTTING_DOWN state still accepting tasks which is needed for
                        // graceful shutdown with quietPeriod.
                        for (;;) {
                            if (confirmShutdown()) {
                                break;
                            }
                        }

                        // Now we want to make sure no more tasks can be added from this point. This is
                        // achieved by switching the state. Any new tasks beyond this point will be rejected.
                        for (;;) {
                            int oldState = state;
                            if (oldState >= ST_SHUTDOWN || STATE_UPDATER.compareAndSet(
                                    SingleThreadEventExecutor.this, oldState, ST_SHUTDOWN)) {
                                break;
                            }
                        }

                        // We have the final set of tasks in the queue now, no more can be added, run all remaining.
                        // No need to loop here, this is the final pass.
                        confirmShutdown();
                    } finally {
                        try {
                            cleanup();
                        } finally {
                            // Lets remove all FastThreadLocals for the Thread as we are about to terminate and notify
                            // the future. The user may block on the future and once it unblocks the JVM may terminate
                            // and start unloading classes.
                            // See https://github.com/netty/netty/issues/6596.
                            FastThreadLocal.removeAll();

                            STATE_UPDATER.set(SingleThreadEventExecutor.this, ST_TERMINATED);
                            threadLock.countDown();
                            int numUserTasks = drainTasks();
                            if (numUserTasks > 0 && logger.isWarnEnabled()) {
                                logger.warn("An event executor terminated with " +
                                        "non-empty task queue (" + numUserTasks + ')');
                            }
                            terminationFuture.setSuccess(null);
                        }
                    }
                }
            }
        });
    }

这个方法,这个方法里面同时在 finally 设置了状态为 ST_TERMINATED,这个方法大意是启动 eventloop 真正的线程来轮询读写事件,这个方法将在首次执行 excute 方法的时候被调用,注意这里的 executor 变量是一个线程池,于是查阅资料才认识到,eventloopgroup 下的每个 eventloop 实际上都会引用到一个线程池,这个线程池线程数目与 eventloop 总大小相同,是 fork/join 框架创建的,真正执行轮询逻辑实际上是这个线程池去提交的。eventloop 的意义实际上只是为了保存任务队列。真正的去轮询任务队列的逻辑是由其代理的线程池去执行的。

这样也就清楚了,整个事故实际很简单,就是先由于内存溢出的错误导致 eventloop 的状态为终止,这个终止状态据我看到的代码貌似是不可逆的,然后当在调用 chanel.write 的时候提交任务到 eventloop 校验状态为终止所以报了拒绝错误。

解决这个 bug 的重要途径是找到内存溢出的原因,这里就不赘述了,因为在网上没有搜到这个错的原因,所以跟了一下代码,在这里做个记录,后续分析一下到底是哪里有内存溢出问题再贴上解决方案

Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐