<?xml version="1.0" encoding="UTF-8" ?>

<bugzilla version="5.2"
          urlbase="https://bugzilla.altlinux.org/"
          
          maintainer="jenya@basealt.ru"
>

    <bug>
          <bug_id>45601</bug_id>
          
          <creation_ts>2023-03-21 03:40:13 +0300</creation_ts>
          <short_desc>hangs on access to /etc/pve (freeze in pmxcfs)</short_desc>
          <delta_ts>2023-11-08 01:17:42 +0300</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>4</classification_id>
          <classification>Development</classification>
          <product>Sisyphus</product>
          <component>pve-cluster</component>
          <version>unstable</version>
          <rep_platform>x86_64</rep_platform>
          <op_sys>Linux</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>WORKSFORME</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P5</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Vitaly Lipatov">lav</reporter>
          <assigned_to name="Alexey Shabalin">shaba</assigned_to>
          <cc>shaba</cc>
    
    <cc>shrek</cc>
          
          <qa_contact>qa-sisyphus</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>223090</commentid>
    <comment_count>0</comment_count>
    <who name="Vitaly Lipatov">lav</who>
    <bug_when>2023-03-21 03:40:13 +0300</bug_when>
    <thetext>Steps to reproduce on a machine with PVE installed:

0. Preparing
# killall -9 pmxcfs
# umount /etc/pve -l

1. Run pmxcfs manually:
# pmxcfs -l -f -d

2. Access to /etc/pve:
# ls /etc/pve

last message on this access:
[main] debug: enter cfs_plug_link_getattr  (cfs-plug-link.c:58:cfs_plug_link_getattr)


The code:
static int cfs_plug_link_getattr(cfs_plug_t *plug, const char *path, struct stat *stbuf)
{
        g_return_val_if_fail(plug != NULL, PARAM_CHECK_ERRNO);
        g_return_val_if_fail(plug-&gt;ops == &amp;cfs_ops, PARAM_CHECK_ERRNO);
        g_return_val_if_fail(path != NULL, PARAM_CHECK_ERRNO);
        g_return_val_if_fail(stbuf != NULL, PARAM_CHECK_ERRNO);

        cfs_debug(&quot;enter cfs_plug_link_getattr %s&quot;, path);

        memset(stbuf, 0, sizeof(struct stat));

        if (cfs_is_quorate()) {
                stbuf-&gt;st_mode = S_IFLNK | 0777;
        } else {
                stbuf-&gt;st_mode = S_IFLNK | 0555;
        }

        stbuf-&gt;st_nlink = 1;

        return 0;
}


gdb shows after install pve-cluster-debuginfo
pmxcfs hungs on
cfs_is_quorate() call


1948	gboolean
1949	cfs_is_quorate(void)
1950	{
1951		g_mutex_lock (&amp;mutex);
1952		gboolean res =  cfs_status.quorate;
1953		g_mutex_unlock (&amp;mutex);
1954	
1955		return res;


1951		g_mutex_lock (&amp;mutex);
(gdb) s
g_mutex_lock (mutex=0x555555584a60 &lt;mutex&gt;) at ../glib/gthread-posix.c:1515
1515	  if G_UNLIKELY (g_atomic_int_add (&amp;mutex-&gt;i[0], 1) != 0)
1516	    g_mutex_lock_slowpath (mutex);
g_mutex_lock_slowpath (mutex=0x555555584a60 &lt;mutex&gt;) at ../glib/gthread-posix.c:1491
1491	  while (exchange_acquire (&amp;mutex-&gt;i[0], 2) != 0)
1492	    syscall (__NR_futex, &amp;mutex-&gt;i[0], (gsize) FUTEX_WAIT_PRIVATE, (gsize) 2, NULL);
(hung)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>223169</commentid>
    <comment_count>1</comment_count>
    <who name="Vitaly Lipatov">lav</who>
    <bug_when>2023-03-21 18:35:19 +0300</bug_when>
    <thetext>It hangs on rrdcached access via rrd_client_update


(gdb) info threads
  Id   Target Id                                   Frame 
* 1    Thread 0x7f665918ba00 (LWP 3193) &quot;pmxcfs&quot;   futex_abstimed_wait_cancelable (private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7ffe6657f8e0) at ../sysdeps/nptl/futex-internal.h:320
  2    Thread 0x7f6659156640 (LWP 3194) &quot;cfs_loop&quot; 0x00007f665a089856 in epoll_wait (epfd=8, events=events@entry=0x7f6659155990, maxevents=maxevents@entry=12, timeout=timeout@entry=999)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  3    Thread 0x7f6658934640 (LWP 3318) &quot;server&quot;   __libc_recv (flags=&lt;optimized out&gt;, len=4096, buf=0x7f665a1de290 &lt;default_client+16&gt;, fd=21) at ../sysdeps/unix/sysv/linux/recv.c:28
  4    Thread 0x7f6653fff640 (LWP 3319) &quot;pmxcfs&quot;   __libc_read (nbytes=135168, buf=0x7f6658112010, fd=7) at ../sysdeps/unix/sysv/linux/read.c:26
  5    Thread 0x7f66537fe640 (LWP 3320) &quot;pmxcfs&quot;   syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

(gdb) thread 3
[Switching to thread 3 (Thread 0x7f6658934640 (LWP 3318))]
#0  __libc_recv (flags=&lt;optimized out&gt;, len=4096, buf=0x7f665a1de290 &lt;default_client+16&gt;, fd=21) at ../sysdeps/unix/sysv/linux/recv.c:28
28	  return SYSCALL_CANCEL (recvfrom, fd, buf, len, flags, NULL, NULL);

(gdb) bt
#0  __libc_recv (flags=&lt;optimized out&gt;, len=4096, buf=0x7f665a1de290 &lt;default_client+16&gt;, fd=21) at ../sysdeps/unix/sysv/linux/recv.c:28
#1  __libc_recv (fd=21, buf=buf@entry=0x7f665a1de290 &lt;default_client+16&gt;, len=len@entry=4096, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recv.c:23
#2  0x00007f665a1b88bc in recv (__flags=0, __n=4096, __buf=0x7f665a1de290 &lt;default_client+16&gt;, __fd=&lt;optimized out&gt;) at /usr/include/bits/socket2.h:44
#3  recvline (client=client@entry=0x7f665a1de280 &lt;default_client&gt;, buf=buf@entry=0x7f66589306b0 &quot;&quot;, n=4095) at rrd_client.c:526
#4  0x00007f665a1b8bd4 in recvline (n=4096, buf=0x7f66589306b0 &quot;&quot;, client=0x7f665a1de280 &lt;default_client&gt;) at rrd_client.c:514
#5  response_read (client=client@entry=0x7f665a1de280 &lt;default_client&gt;, ret_response=ret_response@entry=0x7f6658931700) at rrd_client.c:588
#6  0x00007f665a1b9373 in request (buffer=0x7f6658931790 &quot;update /var/lib/rrdcached/db/pve2-node/aspetos 1679411717:1.24:32:0:0:134163562496:1661198336:134179975168:0:1882764304384:12975132672:693041:297340\n&quot;, buffer_size=0, 
    ret_response=0x7f6658931788, client=0x7f665a1de280 &lt;default_client&gt;) at rrd_client.c:692
#7  request (client=client@entry=0x7f665a1de280 &lt;default_client&gt;, 
    buffer=buffer@entry=0x7f6658931790 &quot;update /var/lib/rrdcached/db/pve2-node/aspetos 1679411717:1.24:32:0:0:134163562496:1661198336:134179975168:0:1882764304384:12975132672:693041:297340\n&quot;, buffer_size=&lt;optimized out&gt;, 
    ret_response=ret_response@entry=0x7f6658931788) at rrd_client.c:671
#8  0x00007f665a1b9c07 in rrd_client_update (client=client@entry=0x7f665a1de280 &lt;default_client&gt;, filename=filename@entry=0x7f664c005950 &quot;/var/lib/rrdcached/db/pve2-node/aspetos&quot;, values_num=1486038928, values_num@entry=1, 
    values=values@entry=0x7f6658933800) at rrd_client.c:1090
#9  0x00007f665a1b9cb0 in rrdc_update (filename=0x7f664c005950 &quot;/var/lib/rrdcached/db/pve2-node/aspetos&quot;, values_num=1, values=0x7f6658933800) at rrd_client.c:1109
#10 0x00005571d2457853 in update_rrd_data (len=&lt;optimized out&gt;, data=&lt;optimized out&gt;, key=&lt;optimized out&gt;) at status.c:1374
#11 rrdentry_hash_set (rrdhash=0x2, key=0x7f664c005950 &quot;/var/lib/rrdcached/db/pve2-node/aspetos&quot;, data=0x7f66500cfe0c &quot;1679411717:1.24:32:0:0:134163562496:1661198336:134179975168:0:1882764304384:12975132672:693041:297340&quot;, 
    len=140077549434880) at status.c:1523
#12 0x00005571d245ad16 in cfs_status_set (key=&lt;optimized out&gt;, data=0x7f66500cfe08, len=len@entry=106) at status.c:1697
#13 0x00005571d2455ef2 in s1_msg_process_fn (c=0x7f664c006900, data=&lt;optimized out&gt;, size=&lt;optimized out&gt;) at server.c:234
#14 0x00007f665a4af409 in _process_request_ (ms_timeout=10, c=0x7f664c006900) at ipcs.c:699
#15 qb_ipcs_dispatch_connection_request (fd=&lt;optimized out&gt;, revents=&lt;optimized out&gt;, data=0x7f664c006900) at ipcs.c:801
#16 0x00007f665a4ab5df in _poll_dispatch_and_take_back_ (item=0x7f664c0012e8, p=&lt;optimized out&gt;) at loop_poll.c:109
#17 0x00007f665a4ab3fc in qb_loop_run_level (level=&lt;optimized out&gt;) at loop.c:43
#18 qb_loop_run (lp=&lt;optimized out&gt;) at loop.c:211
#19 0x00005571d2455833 in worker_thread (data=&lt;optimized out&gt;) at server.c:592
#20 0x00007f665a3ea1cd in g_thread_proxy (data=0x5571d400d060) at ../glib/gthread.c:826
#21 0x00007f665a169eb0 in start_thread (arg=0x7f6658934640) at pthread_create.c:463
#22 0x00007f665a08952f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>223170</commentid>
    <comment_count>2</comment_count>
    <who name="Vitaly Lipatov">lav</who>
    <bug_when>2023-03-21 18:36:52 +0300</bug_when>
    <thetext>Вроде как помогло выключение инициализации сервиса rrrdcached через сокет:
 # systemctl disable rrdcached.socket
Removed /etc/systemd/system/sockets.target.wants/rrdcached.socket.

И включение через rdcached.service (хотя и без него работает вроде как).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>223176</commentid>
    <comment_count>3</comment_count>
    <who name="Alexey Shabalin">shaba</who>
    <bug_when>2023-03-21 20:50:33 +0300</bug_when>
    <thetext>rdcached нужен для pve, он в нем кэштрует графики загрузки процессора и т.п.
при этом сами rrd лежат в /etc/pve.
Вполне возможно, что есть гонка, кто-то рано стучится в сокет и запускается сервис раньше положенного. Если в этом и есть проблема, то можно вообще убрать socket файл из пакета.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>224849</commentid>
    <comment_count>4</comment_count>
    <who name="Vitaly Lipatov">lav</who>
    <bug_when>2023-04-22 01:26:52 +0300</bug_when>
    <thetext>(Ответ для Alexey Shabalin на комментарий #3)
&gt; rdcached нужен для pve, он в нем кэштрует графики загрузки процессора и т.п.
&gt; при этом сами rrd лежат в /etc/pve.
&gt; Вполне возможно, что есть гонка, кто-то рано стучится в сокет и запускается
&gt; сервис раньше положенного. Если в этом и есть проблема, то можно вообще
&gt; убрать socket файл из пакета.
Мы не могли воспроизвести проблему на тестовом стенде.
Если проблема встречается, то проще убрать socket-файл и прописать обязательство запускать rrdcached явно.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>236554</commentid>
    <comment_count>5</comment_count>
    <who name="Vitaly Lipatov">lav</who>
    <bug_when>2023-11-08 01:17:42 +0300</bug_when>
    <thetext>(Ответ для Vitaly Lipatov на комментарий #2)
&gt; Вроде как помогло выключение инициализации сервиса rrrdcached через сокет:
&gt;  # systemctl disable rrdcached.socket
&gt; Removed /etc/systemd/system/sockets.target.wants/rrdcached.socket.
Напоролся ещё раз на новой машине после очередной перезагрузки. Выключение инициализации через сокет помогло.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>