[SSSD-users] "Child not responding" on loaded servers

Post by Patrick Coleman
Hi,
We've got a number of machines using sssd to connect to LDAP for auth.
In the past we've had problems with sssd crashing regularly[1], but
after posting here we built some custom packages to disable netlink
notifications from the kernel, and it's generally improved.
We're still seeing auth failures across random machines - perhaps 1-2%
when we run a process which connects to all hosts. The machines are

Do you meand IO related load or CPU related load?

If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)

LS

Patrick Coleman

2016-04-30 08:28:58 UTC

Post by Lukas Slebodnik

Do you meand IO related load or CPU related load?

Lots of both, but we're typically IO bound more of the time.

Post by Lukas Slebodnik
If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)

Cool, I'll give that a go.

Cheers
Patrick

Jakub Hrozek

2016-05-01 16:04:48 UTC

Post by Lukas Slebodnik

Do you meand IO related load or CPU related load?

Lots of both, but we're typically IO bound more of the time.

Post by Lukas Slebodnik
If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)

Cool, I'll give that a go.

Alternatively, increase the 'timeout' option in sssd's sections..

Post by Patrick Coleman
Cheers
Patrick
_______________________________________________
sssd-users mailing list

Patrick Coleman

2016-05-03 14:52:03 UTC

Post by Lukas Slebodnik
Do you meand IO related load or CPU related load?

Lots of both, but we're typically IO bound more of the time.

Post by Lukas Slebodnik
If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)

Cool, I'll give that a go.

Alternatively, increase the 'timeout' option in sssd's sections..

I appreciate the advice, thankyou. I've put /var/lib/sss on to a tmpfs
filesystem on a couple of loaded machines and seen what I believe to
be improvements - it's a little too early to say, but I'll report back
once I have a wider deployment.

I did want to feed back a little of our research into this issue. If
we strace the sssd_be subprocess on a loaded machine, we see it
sitting in msync() and fdatasync() for periods of up to 7.3 seconds in
one test. This is perhaps expected, given the machine is under heavy
IO load, but sssd makes a *lot* of these calls.

In a 7m 49.985s test (this is as long as the sssd_be process lasts
before it is killed by the parent for not replying to ping) on a
machine with moderate disk load and no new interactive logins, sssd
made 232 *sync calls. The median syscall takes only 67ms, but the
maximum is more than seven seconds - in the eight minute test sssd
spent 1m 00.044s in *sync system calls.

My (naive) analysis here is that the backend process is spending 13%
of its time unavailable to service account queries, because it's doing
cache maintenance. This seems to rather defeat the point of having a
cache... are my assumptions correct here? I'm happy to send the strace
log (and any other data) to interested parties off-list, just let me
know.

In an attempt to improve this behaviour, in addition to a tmpfs for
/var/lib/sss I've also just added the following to the nss and pam
stanzas in the config:

memcache_timeout = 1800
entry_cache_timeout = 1800

...the idea being they will respond from their own cache without
contacting the backend, which may be busy per the above. Is this
reasonable?

Cheers,

Patrick

Jakub Hrozek

2016-05-03 15:30:23 UTC

Post by Lukas Slebodnik
Do you meand IO related load or CPU related load?

Lots of both, but we're typically IO bound more of the time.

Post by Lukas Slebodnik
If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)

Cool, I'll give that a go.

Alternatively, increase the 'timeout' option in sssd's sections..

Yes, every cache update does 4 of these. This is a know issue I'm
working on right now:
https://fedorahosted.org/sssd/ticket/2602
In particular:
https://fedorahosted.org/sssd/wiki/DesignDocs/OneFourteenPerformanceImprovements

Post by Patrick Coleman
In a 7m 49.985s test (this is as long as the sssd_be process lasts
before it is killed by the parent for not replying to ping) on a
machine with moderate disk load and no new interactive logins, sssd
made 232 *sync calls. The median syscall takes only 67ms, but the
maximum is more than seven seconds - in the eight minute test sssd
spent 1m 00.044s in *sync system calls.
My (naive) analysis here is that the backend process is spending 13%
of its time unavailable to service account queries, because it's doing
cache maintenance.

Very nice analysis.

Just a detail, it's not cache maintenance, but updates. The thing we are
doing wrong at the moment is that we do a full write of the whole object
even if nothing changes.

Post by Patrick Coleman
This seems to rather defeat the point of having a
cache... are my assumptions correct here? I'm happy to send the strace
log (and any other data) to interested parties off-list, just let me
know.
In an attempt to improve this behaviour, in addition to a tmpfs for
/var/lib/sss I've also just added the following to the nss and pam
memcache_timeout = 1800
entry_cache_timeout = 1800
...the idea being they will respond from their own cache without
contacting the backend, which may be busy per the above. Is this
reasonable?

Yes it is, in the sense that the cache writes would be performed less
frequently. But your value of entry_cache_timeout is too low, the default
is 5400. By the way, an authentication request ignores the cache validity,
since during authentication we always try to get fresh group membership.

When #2602 is implemented, the cost of cache updates should (mostly) go
away.

Jakub Hrozek

2016-05-03 16:04:31 UTC

Post by Lukas Slebodnik
Do you meand IO related load or CPU related load?

Lots of both, but we're typically IO bound more of the time.

Post by Lukas Slebodnik
If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)

Cool, I'll give that a go.

Alternatively, increase the 'timeout' option in sssd's sections..

Yes, every cache update does 4 of these. This is a know issue I'm
https://fedorahosted.org/sssd/ticket/2602
https://fedorahosted.org/sssd/wiki/DesignDocs/OneFourteenPerformanceImprovements

By the way, some comparison from my WIP branch. Without the patches,
updating a user who is a member of several hundred large groups with 'id'
takes the following:
Total run time of id was: 19415 ms
Number of zero-level cache transactions: 283
--> Time spent in level-0 sysdb transactions: 7694 ms
--> Time spent writing to LDB: 2958 ms
Number of LDAP searches: 562
Time spent waiting for LDAP: 4548 ms

With the patches to avoid cache writes:
Total run time of id was: 9482 ms
Number of zero-level cache transactions: 283
--> Time spent in level-0 sysdb transactions: 1074 ms
--> Time spent writing to LDB: 38 ms
Number of LDAP searches: 562
Time spent waiting for LDAP: 4792 ms

So I think this already shows a nice improvement, although there is
still quite a bit to work on..

Patrick Coleman

2016-05-06 14:25:28 UTC