Discussion:
[SSSD-users] "Child not responding" on loaded servers
Patrick Coleman
2016-04-29 16:56:52 UTC
Permalink
Hi,

We've got a number of machines using sssd to connect to LDAP for auth.
In the past we've had problems with sssd crashing regularly[1], but
after posting here we built some custom packages to disable netlink
notifications from the kernel, and it's generally improved.

We're still seeing auth failures across random machines - perhaps 1-2%
when we run a process which connects to all hosts. The machines are
generally heavily loaded when this happens, and sssd.log looks like:

(Fri Apr 29 09:31:19 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
(Fri Apr 29 09:31:29 2016) [sssd] [tasks_check_handler] (0x0020):
Child (meraki) not responding! (yet)
(Fri Apr 29 09:31:39 2016) [sssd] [tasks_check_handler] (0x0020):
Child (meraki) not responding! (yet)
(Fri Apr 29 09:31:39 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]

While sssd is in this state, it seems to deny auth randomly for LDAP
users - they receive "connection closed by remote host". It will
eventually restart its children, but that doesn't seem to fix the
problem.

Logs for the meraki domain and for nss indicate the subprocesses are running:

/var/log/sssd/sssd_meraki.log
(Fri Apr 29 09:30:53 2016) [sssd[be[meraki]]] [sdap_save_user]
(0x0400): Storing info for user blinken
(Fri Apr 29 09:31:22 2016) [sssd[be[meraki]]]
[sdap_initgr_rfc2307_next_base] (0x0400): Searching for groups with
base [dc=meraki,dc=com]
(Fri Apr 29 09:31:22 2016) [sssd[be[meraki]]]
[sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with
[(&(memberuid=blinken)(objectClass=posixGroup)(cn=*)(&(gidNumber=*)(!(
gidNumber=0))))][dc=meraki,dc=com].
(Fri Apr 29 09:31:22 2016) [sssd[be[meraki]]]
[sdap_get_generic_op_finished] (0x0400): Search result: Success(0), no
errmsg set

/var/log/sssd/sssd_nss.log
(Fri Apr 29 09:31:22 2016) [sssd[nss]] [nss_cmd_getgrgid_search]
(0x0080): No matching domain found for [1155]
(Fri Apr 29 09:31:22 2016) [sssd[nss]] [nss_cmd_getbynam] (0x0100):
Requesting info for [blinken] from [<ALL>]
(Fri Apr 29 09:31:22 2016) [sssd[nss]] [nss_cmd_initgroups_search]
(0x0100): Requesting info for [***@meraki]
(Fri Apr 29 09:31:26 2016) [sssd[nss]] [calc_flat_name] (0x0080): Flat
name requested but domain has noflat name set, falling back to domain
name
(Fri Apr 29 09:31:26 2016) [sssd[nss]] [nss_cmd_getbynam] (0x0100):
Requesting info for [meraki] from [<ALL>]
(Fri Apr 29 09:31:26 2016) [sssd[nss]] [nss_cmd_initgroups_search]
(0x0080): No matching domain found for [meraki], fail!

We first saw the behaviour on sssd 1.11.7 and have upgraded to sssd
version 1.13.4, with more or less the same symptoms. We've turned
enumerate on and off with no apparent change in behaviour.

Does anyone have any suggestions here? Let me know if I can provide
more detailed debugging information (perhaps off-list).

Cheers,

Patrick


1. https://lists.fedorahosted.org/archives/list/sssd-***@lists.fedorahosted.org/message/V6W4ONRKOOIR7SWSSGRN7DD6PSKKW5YQ/
Lukas Slebodnik
2016-04-29 20:10:26 UTC
Permalink
Post by Patrick Coleman
Hi,
We've got a number of machines using sssd to connect to LDAP for auth.
In the past we've had problems with sssd crashing regularly[1], but
after posting here we built some custom packages to disable netlink
notifications from the kernel, and it's generally improved.
We're still seeing auth failures across random machines - perhaps 1-2%
when we run a process which connects to all hosts. The machines are
Do you meand IO related load or CPU related load?

If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)

LS
Patrick Coleman
2016-04-30 08:28:58 UTC
Permalink
Post by Lukas Slebodnik
Post by Patrick Coleman
Hi,
We've got a number of machines using sssd to connect to LDAP for auth.
In the past we've had problems with sssd crashing regularly[1], but
after posting here we built some custom packages to disable netlink
notifications from the kernel, and it's generally improved.
We're still seeing auth failures across random machines - perhaps 1-2%
when we run a process which connects to all hosts. The machines are
Do you meand IO related load or CPU related load?
Lots of both, but we're typically IO bound more of the time.
Post by Lukas Slebodnik
If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)
Cool, I'll give that a go.

Cheers
Patrick
Jakub Hrozek
2016-05-01 16:04:48 UTC
Permalink
Post by Patrick Coleman
Post by Lukas Slebodnik
Post by Patrick Coleman
Hi,
We've got a number of machines using sssd to connect to LDAP for auth.
In the past we've had problems with sssd crashing regularly[1], but
after posting here we built some custom packages to disable netlink
notifications from the kernel, and it's generally improved.
We're still seeing auth failures across random machines - perhaps 1-2%
when we run a process which connects to all hosts. The machines are
Do you meand IO related load or CPU related load?
Lots of both, but we're typically IO bound more of the time.
Post by Lukas Slebodnik
If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)
Cool, I'll give that a go.
Alternatively, increase the 'timeout' option in sssd's sections..
Post by Patrick Coleman
Cheers
Patrick
_______________________________________________
sssd-users mailing list
Patrick Coleman
2016-05-03 14:52:03 UTC
Permalink
Post by Jakub Hrozek
Post by Patrick Coleman
Post by Lukas Slebodnik
Do you meand IO related load or CPU related load?
Lots of both, but we're typically IO bound more of the time.
Post by Lukas Slebodnik
If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)
Cool, I'll give that a go.
Alternatively, increase the 'timeout' option in sssd's sections..
I appreciate the advice, thankyou. I've put /var/lib/sss on to a tmpfs
filesystem on a couple of loaded machines and seen what I believe to
be improvements - it's a little too early to say, but I'll report back
once I have a wider deployment.

I did want to feed back a little of our research into this issue. If
we strace the sssd_be subprocess on a loaded machine, we see it
sitting in msync() and fdatasync() for periods of up to 7.3 seconds in
one test. This is perhaps expected, given the machine is under heavy
IO load, but sssd makes a *lot* of these calls.

In a 7m 49.985s test (this is as long as the sssd_be process lasts
before it is killed by the parent for not replying to ping) on a
machine with moderate disk load and no new interactive logins, sssd
made 232 *sync calls. The median syscall takes only 67ms, but the
maximum is more than seven seconds - in the eight minute test sssd
spent 1m 00.044s in *sync system calls.

My (naive) analysis here is that the backend process is spending 13%
of its time unavailable to service account queries, because it's doing
cache maintenance. This seems to rather defeat the point of having a
cache... are my assumptions correct here? I'm happy to send the strace
log (and any other data) to interested parties off-list, just let me
know.

In an attempt to improve this behaviour, in addition to a tmpfs for
/var/lib/sss I've also just added the following to the nss and pam
stanzas in the config:

memcache_timeout = 1800
entry_cache_timeout = 1800

...the idea being they will respond from their own cache without
contacting the backend, which may be busy per the above. Is this
reasonable?

Cheers,

Patrick
Jakub Hrozek
2016-05-03 15:30:23 UTC
Permalink
Post by Patrick Coleman
Post by Jakub Hrozek
Post by Patrick Coleman
Post by Lukas Slebodnik
Do you meand IO related load or CPU related load?
Lots of both, but we're typically IO bound more of the time.
Post by Lukas Slebodnik
If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)
Cool, I'll give that a go.
Alternatively, increase the 'timeout' option in sssd's sections..
I appreciate the advice, thankyou. I've put /var/lib/sss on to a tmpfs
filesystem on a couple of loaded machines and seen what I believe to
be improvements - it's a little too early to say, but I'll report back
once I have a wider deployment.
I did want to feed back a little of our research into this issue. If
we strace the sssd_be subprocess on a loaded machine, we see it
sitting in msync() and fdatasync() for periods of up to 7.3 seconds in
one test. This is perhaps expected, given the machine is under heavy
IO load, but sssd makes a *lot* of these calls.
Yes, every cache update does 4 of these. This is a know issue I'm
working on right now:
https://fedorahosted.org/sssd/ticket/2602
In particular:
https://fedorahosted.org/sssd/wiki/DesignDocs/OneFourteenPerformanceImprovements
Post by Patrick Coleman
In a 7m 49.985s test (this is as long as the sssd_be process lasts
before it is killed by the parent for not replying to ping) on a
machine with moderate disk load and no new interactive logins, sssd
made 232 *sync calls. The median syscall takes only 67ms, but the
maximum is more than seven seconds - in the eight minute test sssd
spent 1m 00.044s in *sync system calls.
My (naive) analysis here is that the backend process is spending 13%
of its time unavailable to service account queries, because it's doing
cache maintenance.
Very nice analysis.

Just a detail, it's not cache maintenance, but updates. The thing we are
doing wrong at the moment is that we do a full write of the whole object
even if nothing changes.
Post by Patrick Coleman
This seems to rather defeat the point of having a
cache... are my assumptions correct here? I'm happy to send the strace
log (and any other data) to interested parties off-list, just let me
know.
In an attempt to improve this behaviour, in addition to a tmpfs for
/var/lib/sss I've also just added the following to the nss and pam
memcache_timeout = 1800
entry_cache_timeout = 1800
...the idea being they will respond from their own cache without
contacting the backend, which may be busy per the above. Is this
reasonable?
Yes it is, in the sense that the cache writes would be performed less
frequently. But your value of entry_cache_timeout is too low, the default
is 5400. By the way, an authentication request ignores the cache validity,
since during authentication we always try to get fresh group membership.

When #2602 is implemented, the cost of cache updates should (mostly) go
away.
Jakub Hrozek
2016-05-03 16:04:31 UTC
Permalink
Post by Jakub Hrozek
Post by Patrick Coleman
Post by Jakub Hrozek
Post by Patrick Coleman
Post by Lukas Slebodnik
Do you meand IO related load or CPU related load?
Lots of both, but we're typically IO bound more of the time.
Post by Lukas Slebodnik
If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)
Cool, I'll give that a go.
Alternatively, increase the 'timeout' option in sssd's sections..
I appreciate the advice, thankyou. I've put /var/lib/sss on to a tmpfs
filesystem on a couple of loaded machines and seen what I believe to
be improvements - it's a little too early to say, but I'll report back
once I have a wider deployment.
I did want to feed back a little of our research into this issue. If
we strace the sssd_be subprocess on a loaded machine, we see it
sitting in msync() and fdatasync() for periods of up to 7.3 seconds in
one test. This is perhaps expected, given the machine is under heavy
IO load, but sssd makes a *lot* of these calls.
Yes, every cache update does 4 of these. This is a know issue I'm
https://fedorahosted.org/sssd/ticket/2602
https://fedorahosted.org/sssd/wiki/DesignDocs/OneFourteenPerformanceImprovements
By the way, some comparison from my WIP branch. Without the patches,
updating a user who is a member of several hundred large groups with 'id'
takes the following:
Total run time of id was: 19415 ms
Number of zero-level cache transactions: 283
--> Time spent in level-0 sysdb transactions: 7694 ms
--> Time spent writing to LDB: 2958 ms
Number of LDAP searches: 562
Time spent waiting for LDAP: 4548 ms

With the patches to avoid cache writes:
Total run time of id was: 9482 ms
Number of zero-level cache transactions: 283
--> Time spent in level-0 sysdb transactions: 1074 ms
--> Time spent writing to LDB: 38 ms
Number of LDAP searches: 562
Time spent waiting for LDAP: 4792 ms

So I think this already shows a nice improvement, although there is
still quite a bit to work on..
Patrick Coleman
2016-05-06 14:25:28 UTC
Permalink
Post by Jakub Hrozek
Post by Jakub Hrozek
Yes, every cache update does 4 of these. This is a know issue I'm
https://fedorahosted.org/sssd/ticket/2602
https://fedorahosted.org/sssd/wiki/DesignDocs/OneFourteenPerformanceImprovements
By the way, some comparison from my WIP branch. Without the patches,
updating a user who is a member of several hundred large groups with 'id'
Total run time of id was: 19415 ms
Number of zero-level cache transactions: 283
--> Time spent in level-0 sysdb transactions: 7694 ms
--> Time spent writing to LDB: 2958 ms
Number of LDAP searches: 562
Time spent waiting for LDAP: 4548 ms
Total run time of id was: 9482 ms
Number of zero-level cache transactions: 283
--> Time spent in level-0 sysdb transactions: 1074 ms
--> Time spent writing to LDB: 38 ms
Number of LDAP searches: 562
Time spent waiting for LDAP: 4792 ms
That's good to know, and I'm happy to test any patches - just get in
touch directly.

I wanted to report back that after a few days of running with a tmpfs
at /var/lib/sss/db the issue with child processes timing out seems to
have been mostly resolved. Our most loaded machines will report
occasional timeouts but this doesn't seem to impact service:

(Wed May 4 03:32:38 2016) [sssd] [mark_service_as_started] (0x0400):
SSSD is initialized, terminating parent process
(Wed May 4 03:32:43 2016) [sssd] [services_startup_timeout] (0x0400):
Handling timeout
(Wed May 4 17:15:48 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [pam]. Attempt [0]
(Thu May 5 04:57:28 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
(Thu May 5 11:17:18 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
(Thu May 5 15:44:18 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
(Fri May 6 03:57:48 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
(Fri May 6 05:17:28 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]

As far as the actual auth failures go, it's hard to quantify whether
we're now seeing _zero_ problems because they tend to not get
reported, but our metrics are showing substantially fewer failures.

Really appreciate your assistance here, and if you need any further
debugging information just let me know.

Cheers,

Patrick

Loading...