Discussion:
sssd-ldap caching issue ?
(too old to reply)
Thomas HUMMEL
2015-04-03 15:59:10 UTC
Permalink
Hello,

I'm using sssd-ldap-1.11.6 (from the official CentOS repo) on CentOS release
6.6 (Final) on a cluster of compute nodes running the slurm scheduler
(http://slurm.schedmd.com/) in 14.11 version.

Sssd is configured without enumerate, with cache_credential and default various
cache timeout values.

It works fine except in the following case where there seem to be a caching
issue :

[ the following is 100% reproducible ]

a) I clear the cache with the following commands :

. /etc/init.d/sssd stop
. rm -rf /var/lib/sss/mc/* /var/lib/sss/db/*
. /etc/init.d/sssd start

b) I launch a "job array" consisting of 100 or so simple task. Basically this
will execute in batch many instances (each one called a task) of the same
program in parallel on the compute node.

Such a job write its output in a .out text file owned by <user>:<gid>.

-> so many processes end up querying sssd in parallel to retrieve the user groups

What happens is that :

. the first task completes without error
. tasks 2 and 3 (or something like that) fail with a "permission denied" message
. tasks > 3 complete without error

. also if we ask slurm to launch each task one after the other instead of in
a parallel fashion, the pb does not occur

Note :

- the job array is very fast since each task is very simple. Many tasks can be
completed under a second of time.

- if I don't clear sssd cache or if I just issue sss_cache -E or -g, the
problem occurs randomly and may be hard to reproduce.

At full debug level, sssd shows ldap answer correcty and sssd, only for entries
not already in cache, is adding so called "fake groups" :

ex : 'Adding fake group gensoft to sysdb'

A simple patch to slurm in order to print (with getgroups(2)) the number of
group of user shows that, for failed tasks, the number of groups retrieved for <user>
is incomplete, which explains the "permission denied" message.

In fact, the missing groups seem to be the "fake" groups which seem to be first
put in sssd cache by the first task.

So my guess is that :

. task 1 fetches groups missing from cache and first flag them as "fake"
. before task1 finishes "resolving" fake groups entry, tasks 2 and 3 discard
those incomplete entries
. task 1 finishes replacing fake by real groups
. following tasks behave as expected regarding groups

Any ideas ?

Thanks

Here is my sssd.conf file :

[sssd]
config_file_version = 2
services = nss, pam
domains = pasteur_ldap_home

[nss]
filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd

[pam]



[domain/pasteur_ldap_home]
ldap_tls_reqcert = allow

auth_provider = ldap
ldap_schema = rfc2307
ldap_search_base = xxxx
ldap_group_search_base = xxxx
id_provider = ldap
ldap_id_use_start_tls = True
# We do not authorize password change
chpass_provider = none
ldap_uri = ldap://xxxx/
cache_credentials = True
ldap_tls_cacertdir = /etc/openldap/certs
ldap_network_timeout = 3
# getent passwd will only list /etc/passwd, but id or getent passwd login will query ldap
#enumerate = True
ldap_page_size = 500
#debug_level = 0x02F0
debug_level = 0x77F0
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Jakub Hrozek
2015-04-07 19:42:35 UTC
Permalink
Post by Thomas HUMMEL
Hello,
I'm using sssd-ldap-1.11.6 (from the official CentOS repo) on CentOS release
6.6 (Final) on a cluster of compute nodes running the slurm scheduler
(http://slurm.schedmd.com/) in 14.11 version.
Sssd is configured without enumerate, with cache_credential and default various
cache timeout values.
It works fine except in the following case where there seem to be a caching
[ the following is 100% reproducible ]
. /etc/init.d/sssd stop
. rm -rf /var/lib/sss/mc/* /var/lib/sss/db/*
. /etc/init.d/sssd start
b) I launch a "job array" consisting of 100 or so simple task. Basically this
will execute in batch many instances (each one called a task) of the same
program in parallel on the compute node.
Such a job write its output in a .out text file owned by <user>:<gid>.
-> so many processes end up querying sssd in parallel to retrieve the user groups
. the first task completes without error
. tasks 2 and 3 (or something like that) fail with a "permission denied" message
. tasks > 3 complete without error
. also if we ask slurm to launch each task one after the other instead of in
a parallel fashion, the pb does not occur
- the job array is very fast since each task is very simple. Many tasks can be
completed under a second of time.
- if I don't clear sssd cache or if I just issue sss_cache -E or -g, the
problem occurs randomly and may be hard to reproduce.
At full debug level, sssd shows ldap answer correcty and sssd, only for entries
ex : 'Adding fake group gensoft to sysdb'
A simple patch to slurm in order to print (with getgroups(2)) the number of
group of user shows that, for failed tasks, the number of groups retrieved for <user>
is incomplete, which explains the "permission denied" message.
In fact, the missing groups seem to be the "fake" groups which seem to be first
put in sssd cache by the first task.
. task 1 fetches groups missing from cache and first flag them as "fake"
. before task1 finishes "resolving" fake groups entry, tasks 2 and 3 discard
those incomplete entries
. task 1 finishes replacing fake by real groups
. following tasks behave as expected regarding groups
That sound like a good analysis except it would also be a bug..:-)

In case the back end is contacted at all and fetches data from the
server, the other requests should be suspended until the first one
finishes.

You said earlier this is 100% reproducable and you were able to gather
the debug logs, right? Could we see them? Since there seems to be some
kind of a race condition, it might be nice to also enable debug_microseconds.
Post by Thomas HUMMEL
Any ideas ?
Thanks
[sssd]
config_file_version = 2
services = nss, pam
domains = pasteur_ldap_home
[nss]
filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd
[pam]
[domain/pasteur_ldap_home]
ldap_tls_reqcert = allow
auth_provider = ldap
ldap_schema = rfc2307
ldap_search_base = xxxx
ldap_group_search_base = xxxx
id_provider = ldap
ldap_id_use_start_tls = True
# We do not authorize password change
chpass_provider = none
ldap_uri = ldap://xxxx/
cache_credentials = True
ldap_tls_cacertdir = /etc/openldap/certs
ldap_network_timeout = 3
# getent passwd will only list /etc/passwd, but id or getent passwd login will query ldap
#enumerate = True
ldap_page_size = 500
#debug_level = 0x02F0
debug_level = 0x77F0
--
Thomas Hummel | Institut Pasteur
_______________________________________________
sssd-users mailing list
https://lists.fedorahosted.org/mailman/listinfo/sssd-users
Thomas HUMMEL
2015-04-08 13:27:30 UTC
Permalink
Post by Jakub Hrozek
You said earlier this is 100% reproducable and you were able to gather
the debug logs, right? Could we see them?
Correct. Please see the attached file below.
Post by Jakub Hrozek
Since there seems to be some kind of a race condition, it might be nice to
also enable debug_microseconds.
This log was generated with this option set to True.

Thanks.
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Thomas HUMMEL
2015-04-08 16:04:44 UTC
Permalink
Post by Thomas HUMMEL
. /etc/init.d/sssd stop
. rm -rf /var/lib/sss/mc/* /var/lib/sss/db/*
. /etc/init.d/sssd start
b) I launch a "job array" consisting of 100 or so simple task.
Please note that, if after step a) I issue a command such as

id <user>

before running step b) (as <user>), the problem does not occur.

Also, changing "enumerate" to "True" seems to workaround the problem, at least
if we wait for the enumeration to complete.

Thanks.
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Jakub Hrozek
2015-04-09 10:27:14 UTC
Permalink
Post by Thomas HUMMEL
Post by Thomas HUMMEL
. /etc/init.d/sssd stop
. rm -rf /var/lib/sss/mc/* /var/lib/sss/db/*
. /etc/init.d/sssd start
b) I launch a "job array" consisting of 100 or so simple task.
Please note that, if after step a) I issue a command such as
id <user>
before running step b) (as <user>), the problem does not occur.
Also, changing "enumerate" to "True" seems to workaround the problem, at least
if we wait for the enumeration to complete.
Thanks for the logs, the back end logs show no error. Would it be
possible to also see the NSS responder logs that capture the failure?
Thomas HUMMEL
2015-04-09 14:14:28 UTC
Permalink
Would it be possible to also see the NSS responder logs that capture the
failure?
Here it is.

Thanks
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Thomas HUMMEL
2015-04-10 13:26:02 UTC
Permalink
Post by Thomas HUMMEL
Here it is.
Hello,

We tried with sssd 1.12.4 and it doesn't fix the problem

Thanks.
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Thomas HUMMEL
2015-04-10 16:08:03 UTC
Permalink
Post by Thomas HUMMEL
We tried with sssd 1.12.4 and it doesn't fix the problem
Further on the debug process, we wanted to know if the problem comes from
slurm, glibc or sssd. Here's what we've tried :

1. we hacked slurmd code to add a getgroups() call before and after slurm calls
initgroups() :

debug2("Uncached user/gid: %s/%ld", job->user_name, (long)job->gid);
debug2("Before initgroups number of groups for %s/%ld : %d", job->user_name, (long)job->gid, getgroups(0, NULL));
if ((rc = initgroups(job->user_name, job->gid))) {
if ((errno == EPERM) && (getuid() != (uid_t) 0)) {
debug("Error in initgroups(%s, %ld): %m",
job->user_name, (long)job->gid);
} else {
error("Error in initgroups(%s, %ld): %m",
job->user_name, (long)job->gid);
}
return -1;
}
debug2("After initgroups number of groups for %s/%ld : %d", job->user_name, (long)job->gid, getgroups(0, NULL));
return 0;


-> when the problem occurs (note that slurmd is running as root before dropping privileges) :

Apr 10 17:10:28 myriad-n407 slurmstepd[7219]: Before initgroups number of groups for njoly/3044 : 0
Apr 10 17:10:28 myriad-n407 slurmstepd[7219]: After initgroups number of groups for njoly/3044 : 1

-> when the problem does not occur

Apr 10 17:32:14 myriad-n407 slurmstepd[11075]: Before initgroups number of groups for njoly/3044 : 0
Apr 10 17:32:14 myriad-n407 slurmstepd[11075]: After initgroups number of groups for njoly/3044 : 11

So our understanding is that slurm is not to blame

Note : in previous tests where we put a getgroups() elsewhere in the code,
sometimes we noticed that more than one group was retrieved. So sometimes a
subset of the supplementary groups is retrieved.

2. We stopped sssd and remove the cache files (mc/* db/*) and put the user in
/etc/passwd file and all his supplementary (as well as his primary group)
groups in /etc/group :

-> the problem does not occur anymore

So we think that glibc is not to blame either.

Conclusion : it seems to us that it really is an sssd problem. Can you hint us
somewhere in the sssd source code we can start to further investigate because
we are unable to build a test case without slurm.


Thanks
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Jakub Hrozek
2015-04-13 09:24:22 UTC
Permalink
Post by Thomas HUMMEL
Conclusion : it seems to us that it really is an sssd problem. Can you hint us
somewhere in the sssd source code we can start to further investigate because
we are unable to build a test case without slurm.
Thank you for the logs. Can you tell which was the first failing
request? The very first request in the nss log is received at
16:06:15:464051:
(Thu Apr 9 16:06:15:464051 2015) [sssd[nss]]
[nss_cmd_initgroups_search] (0x0100): Requesting info for
[***@pasteur_ldap_home]
There is no cached info, so the request goes to the back end:
(Thu Apr 9 16:06:15:464290 2015) [sssd[nss]] [sss_dp_issue_request]
(0x0400): Issuing request for [0x418850:3:***@pasteur_ldap_home]
And completes:
(Thu Apr 9 16:06:15:481246 2015) [sssd[nss]]
[nss_cmd_initgroups_search] (0x0400): Initgroups for
[***@pasteur_ldap_home] completed
The another initgroups is received:
(Thu Apr 9 16:06:15:481526 2015) [sssd[nss]]
[nss_cmd_initgroups_search] (0x0100): Requesting info for
[***@pasteur_ldap_home]

Which is returned from the cache. So I currently don't see a problem in
the logs, but apparently there is some.

If you want to add more debugging to SSSD, the function to start at is
check_cache(). In particular, the part that decides whether the info is
valid is on line 632 in git master. Also, the part that actually sends
the GIDs back to the client is fill_initgr().
Thomas HUMMEL
2015-04-13 17:50:55 UTC
Permalink
Post by Jakub Hrozek
Thank you for the logs. Can you tell which was the first failing
request?
It's hard to tell because, by definition, slurm tasks run in parallel.

In fill_initgr() we added the 2 following debug lines :

DEBUG(SSSDBG_TRACE_FUNC, "XXXX Retrieve %d groups\n", num-1);
/* skip first entry, it's the user entry */
for (i = 0; i < num; i++) {
gid = sss_view_ldb_msg_find_attr_as_uint64(dom, res->msgs[i + 1],
SYSDB_GIDNUM, 0);
posix = ldb_msg_find_attr_as_string(res->msgs[i + 1],
SYSDB_POSIX, NULL);
DEBUG(SSSDBG_TRACE_FUNC, "XXXX lookup entry %d/%d %d\n", i, num-1, gid);
if (!gid) {

Most of the time num matches the correct number of supplementary groups but
sometimes its value is 0 (when our problem occurs we guess).

How can we easily print the number of groups for each component/stage of the
request/answer flow (backend, responder, cache, memcache) to narrow the search
?

Besides, can you elaborate on the so called fake groups and ghost users ? Our
understanding is that fake groups are incomplete groups entries put in cache to
reduce the load on the backend server and that ghost users are group attributes
meant to avoid creating fake users as group members. Does it make sense to look
in the direction of the fake groups to understand our problem ?

In particular (and if this understanding is correct) : is the responder aware
that it is reading a fake group ? Which component job is it to fully resolve
the fake group (the one which put it in cache or the one who needs the info) ?

Thanks
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Thomas HUMMEL
2015-04-13 18:05:01 UTC
Permalink
Also, is it possible to desactivate or bypass the fastcache and/or the system cache ?


Thanks
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Lukas Slebodnik
2015-04-13 20:12:22 UTC
Permalink
Post by Thomas HUMMEL
Also, is it possible to desactivate or bypass the fastcache and/or the system cache ?
You can disable fast memory cache if environment variable SSS_NSS_USE_MEMCACHE
is set to no. The environment variable need to be set on client side (the
process which call getpw*/getgr*).

But I do not expect any problem with memory cache. The problem seems to be on
responder side.

BTW. There is a wiki page [1] which describes some sssd internals.
At least diagrams might be useful for better understanding of work flow.

LS

[1] https://fedorahosted.org/sssd/wiki/InternalsDocs
Thomas HUMMEL
2015-04-14 08:33:18 UTC
Permalink
Post by Lukas Slebodnik
You can disable fast memory cache if environment variable SSS_NSS_USE_MEMCACHE
is set to no. The environment variable need to be set on client side (the
process which call getpw*/getgr*).
Yes thank you, we discovered that a moment after we posted the message ;-)
Post by Lukas Slebodnik
But I do not expect any problem with memory cache. The problem seems to be on
responder side.
Why do you say that ? I mean can't the problem be on the backend side ? We
suspect some kind of race condition because we can only (deterministically)
reproduce the problem with slurm which spawns many processes in the same
second.
Post by Lukas Slebodnik
BTW. There is a wiki page [1] which describes some sssd internals.
At least diagrams might be useful for better understanding of work flow.
Thanks, we've seen and read that. Very good and interesting doc !
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Thomas HUMMEL
2015-04-14 08:36:32 UTC
Permalink
Post by Lukas Slebodnik
But I do not expect any problem with memory cache.
As a matter of fact, we tested it and no memory cache does not change our problem...
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Lukas Slebodnik
2015-04-14 10:15:14 UTC
Permalink
Post by Thomas HUMMEL
Post by Lukas Slebodnik
But I do not expect any problem with memory cache.
As a matter of fact, we tested it and no memory cache does not change our problem...
I spent lot of time with debugging memory cache bugs so I expected there cannot
be a problem.

At least you reduce scope of problem.
Thank you very much for troubleshooting. I hope you/we will fix the bug very
soon.

LS
Jakub Hrozek
2015-04-14 09:06:44 UTC
Permalink
Post by Thomas HUMMEL
Post by Lukas Slebodnik
You can disable fast memory cache if environment variable SSS_NSS_USE_MEMCACHE
is set to no. The environment variable need to be set on client side (the
process which call getpw*/getgr*).
Yes thank you, we discovered that a moment after we posted the message ;-)
Post by Lukas Slebodnik
But I do not expect any problem with memory cache. The problem seems to be on
responder side.
Why do you say that ? I mean can't the problem be on the backend side ? We
suspect some kind of race condition because we can only (deterministically)
reproduce the problem with slurm which spawns many processes in the same
second.
Perhaps, maybe the backend is signalling to the frontend too soon to
check the cache again after the inital update.

But I'm not sure how to help you without a local reproducer :-/
Jean-Baptiste Denis
2015-04-14 13:37:33 UTC
Permalink
Post by Jakub Hrozek
Perhaps, maybe the backend is signalling to the frontend too soon to
check the cache again after the inital update.
In function sysdb_initgroups_with_views in file src/db/sysdb_search.c, We
wrapped the ldb_wait call :

ret = ldb_request(domain->sysdb->ldb, req);
if (ret == LDB_SUCCESS) {
DEBUG(SSSDBG_TRACE_FUNC, "XXSYSDB before %d %s %d\n", ret, name, res->count);
ret = ldb_wait(req->handle, LDB_WAIT_ALL);
DEBUG(SSSDBG_TRACE_FUNC, "XXSYSDB after %d %s %d\n", ret, name, res->count);
}

In some cases (we guess the ones that cause problem on the client side), we only
have one result after the ldb_wait call :

/var/log/sssd/sssd_nss.log:(Tue Apr 14 15:25:32:222973 2015) [sssd[nss]]
[sysdb_initgroups_with_views] (0x0400): XXSYSDB before 0 jbdenis 1
/var/log/sssd/sssd_nss.log:(Tue Apr 14 15:25:32:223031 2015) [sssd[nss]]
[sysdb_initgroups_with_views] (0x0400): XXSYSDB after 0 jbdenis 1

We suppose that when everything is fine on the client side, we've got six
results after ldb_wait :

/var/log/sssd/sssd_nss.log:(Tue Apr 14 15:25:32:438755 2015) [sssd[nss]]
[sysdb_initgroups_with_views] (0x0400): XXSYSDB before 0 jbdenis 1
/var/log/sssd/sssd_nss.log:(Tue Apr 14 15:25:32:439140 2015) [sssd[nss]]
[sysdb_initgroups_with_views] (0x0400): XXSYSDB after 0 jbdenis 6

DO you think this is relevant to our problem here ? If indeed the backend is
signalling to thre frontend too soon, where could we check that ? Do you have a
hint ?
Post by Jakub Hrozek
But I'm not sure how to help you without a local reproducer :-/
Yep, we understand. We're trying to build a test case, but no luck so.

Thank you for your help.

Jean-Baptiste
Jakub Hrozek
2015-04-15 06:41:38 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Jakub Hrozek
Perhaps, maybe the backend is signalling to the frontend too soon to
check the cache again after the inital update.
In function sysdb_initgroups_with_views in file src/db/sysdb_search.c, We
ret = ldb_request(domain->sysdb->ldb, req);
if (ret == LDB_SUCCESS) {
DEBUG(SSSDBG_TRACE_FUNC, "XXSYSDB before %d %s %d\n", ret, name, res->count);
ret = ldb_wait(req->handle, LDB_WAIT_ALL);
DEBUG(SSSDBG_TRACE_FUNC, "XXSYSDB after %d %s %d\n", ret, name, res->count);
}
In some cases (we guess the ones that cause problem on the client side), we only
/var/log/sssd/sssd_nss.log:(Tue Apr 14 15:25:32:222973 2015) [sssd[nss]]
[sysdb_initgroups_with_views] (0x0400): XXSYSDB before 0 jbdenis 1
/var/log/sssd/sssd_nss.log:(Tue Apr 14 15:25:32:223031 2015) [sssd[nss]]
[sysdb_initgroups_with_views] (0x0400): XXSYSDB after 0 jbdenis 1
We suppose that when everything is fine on the client side, we've got six
/var/log/sssd/sssd_nss.log:(Tue Apr 14 15:25:32:438755 2015) [sssd[nss]]
[sysdb_initgroups_with_views] (0x0400): XXSYSDB before 0 jbdenis 1
/var/log/sssd/sssd_nss.log:(Tue Apr 14 15:25:32:439140 2015) [sssd[nss]]
[sysdb_initgroups_with_views] (0x0400): XXSYSDB after 0 jbdenis 6
DO you think this is relevant to our problem here ? If indeed the backend is
signalling to thre frontend too soon, where could we check that ? Do you have a
hint ?
I think this means the frontend (responder) either checks too soon or
the back end wrote incomplete data.

The responder is the sssd_nss process. When the getgrouplist() request
arrives, the cache validity is checked. If the cache is empty or too
old, the sssd_nss process queries the sssd_be process to update the
cache. When the sssd_be process is done, it sends a dbus signal (over a
private unix socket, not the system bus) that the cache is up-to-date
and the

I wonder if adding another sysdb_initgroups call into
sdap_get_initgr_recv() would verify when/if the groups were written?
Post by Jean-Baptiste Denis
Post by Jakub Hrozek
But I'm not sure how to help you without a local reproducer :-/
Yep, we understand. We're trying to build a test case, but no luck so.
Thank you for your help.
I tried to write a simple program that just calls getgrouplist() in many
concurrent threads to simulate your behaviour, but couldn't reproduce
the problem...
Jean-Baptiste Denis
2015-04-15 10:05:03 UTC
Permalink
Post by Jakub Hrozek
I think this means the frontend (responder) either checks too soon or
the back end wrote incomplete data.
OK.
Post by Jakub Hrozek
The responder is the sssd_nss process. When the getgrouplist() request
arrives, the cache validity is checked. If the cache is empty or too
old, the sssd_nss process queries the sssd_be process to update the
cache. When the sssd_be process is done, it sends a dbus signal (over a
private unix socket, not the system bus) that the cache is up-to-date
and the
Thank you for clarifying that up. It corresponds quite well the model we already
build ourself reading the different design documents on the wiki and your blog
post "Anatomy of SSSD user lookup".
(https://jhrozek.wordpress.com/2015/03/11/anatomy-of-sssd-user-lookup/)

We are not developer per see (and certainly not in C), but we've got a hard time
matching this model to the asynchronous nature of the code using tevent, SBus
and ldb. Try and miss is our best tool at the moment =)
Post by Jakub Hrozek
I wonder if adding another sysdb_initgroups call into
sdap_get_initgr_recv() would verify when/if the groups were written?
Do you mean litteraly add a call to sysdb_initgroups within this function ?
Is that possible with the res parameter of the sdap_get_initgr_recv function ?

int sdap_get_initgr_recv(struct tevent_req *req)
{
TEVENT_REQ_RETURN_ON_ERROR(req);

return EOK;
}
Post by Jakub Hrozek
I tried to write a simple program that just calls getgrouplist() in many
concurrent threads to simulate your behaviour, but couldn't reproduce
the problem...
We tried that with processes too, but without able to reproduce the problem either.

Our next step will be to mimic the flow of slurmd on job execution to build a
test case : the slurmd process (double- ?) forks a slurmstepd process that will
call the _initgroups slurm function within the _drop_privileges function :

https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmstepd/mgr.c

That's our best call at the moment

Jean-Baptiste
Thomas HUMMEL
2015-04-15 12:17:38 UTC
Permalink
Post by Jakub Hrozek
I think this means the frontend (responder) either checks too soon
But in that case wouldn't it see no answer instead of wrong or incomplete answer ?
Post by Jakub Hrozek
or the back end wrote incomplete data.
My undestanding is that it can be valid (for the backend to write incomplete
data) and that it has something to do with the 'fake group' concept (which is
why I was asking you how they worked previously) : is that correct ?

Thanks
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Jakub Hrozek
2015-04-15 18:48:02 UTC
Permalink
Post by Thomas HUMMEL
Post by Jakub Hrozek
I think this means the frontend (responder) either checks too soon
But in that case wouldn't it see no answer instead of wrong or incomplete answer ?
I suspected that the user entry is written but not the groups.
Post by Thomas HUMMEL
Post by Jakub Hrozek
or the back end wrote incomplete data.
My undestanding is that it can be valid (for the backend to write incomplete
data) and that it has something to do with the 'fake group' concept (which is
why I was asking you how they worked previously) : is that correct ?
A shot in the dark but maybe worth a try - can you try disabling the
cleanup task?

ldap_purge_cache_timeout = 0

in the [domain] section. The cleanup might cause some groups with no
members to be removed, I wonder if that is your case..
Jean-Baptiste Denis
2015-04-15 20:58:12 UTC
Permalink
Post by Jakub Hrozek
A shot in the dark but maybe worth a try - can you try disabling the
cleanup task?
ldap_purge_cache_timeout = 0
in the [domain] section. The cleanup might cause some groups with no
members to be removed, I wonder if that is your case..
Just did this, but didn't work.

Maybe I don't understand the purpose of this test, but the result does not
surprise me because the ldap cache is empty at that time. As Thomas stated in
the initial message of this thread, our actual test case implies:

. /etc/init.d/sssd stop
. rm -rf /var/lib/sss/mc/* /var/lib/sss/db/*
. /etc/init.d/sssd start

before running anything else. So I guess the ldap backend has no need to be
cleaned up at this particular time. If I run the test case again without
restarting sssd and without cleaning up the cache, I've got no problem for next
jobs (maybe until the next ldap purge. I think that this is exactly how we first
encounter the problem : sometimes, some jobs were failing with a permission
denied error while accessing a directory owned by one the user supplementary
groups. The instrumented slurmd code showed us that the initgroups was not
correctly getting the secondary groups. And the sssd backend log showed some
purge activity if I remember correcty - need confirmation -)
Post by Jakub Hrozek
I think this means the frontend (responder) either checks too soon or
the back end wrote incomplete data.
We are not 100% sure that we've found the right place to look at, but each time
we instrumented the code to print the number of groups, we've got the correct
answer.

Maybe you could show us where to look exactly for :

- where the backend is writing the groups data to the sysdb cache
- where the backend is signaling to the responder that the cache has been updated
- where the responder is aware that he can now check the cache to get the answer
- where the responder is actually getting the data from the sysdb cache

Thank you for your help,

Jean-Baptiste
Jakub Hrozek
2015-04-16 07:59:16 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Jakub Hrozek
A shot in the dark but maybe worth a try - can you try disabling the
cleanup task?
ldap_purge_cache_timeout = 0
in the [domain] section. The cleanup might cause some groups with no
members to be removed, I wonder if that is your case..
Just did this, but didn't work.
Maybe I don't understand the purpose of this test, but the result does not
surprise me because the ldap cache is empty at that time. As Thomas stated in
. /etc/init.d/sssd stop
. rm -rf /var/lib/sss/mc/* /var/lib/sss/db/*
. /etc/init.d/sssd start
before running anything else. So I guess the ldap backend has no need to be
cleaned up at this particular time.
I was suspecting a race condition, because as well as the rest of SSSD,
the cleanup task is asynchronous. I was suspecting the following might
have happened:
- initgroups starts:
- users are written to the cache
- groups are written to the cache but not linked yet to the user
objects
- cleanup tasks starts
- cleanup task removes the group objects because they are
"empty". It shouldn't happen because the cleanup task should
only remove expired entries, but IIRC Lukas saw a similar
race-condition elsewhere.
Post by Jean-Baptiste Denis
If I run the test case again without
restarting sssd and without cleaning up the cache, I've got no problem for next
jobs (maybe until the next ldap purge. I think that this is exactly how we first
encounter the problem : sometimes, some jobs were failing with a permission
denied error while accessing a directory owned by one the user supplementary
groups. The instrumented slurmd code showed us that the initgroups was not
correctly getting the secondary groups. And the sssd backend log showed some
purge activity if I remember correcty - need confirmation -)
Post by Jakub Hrozek
I think this means the frontend (responder) either checks too soon or
the back end wrote incomplete data.
We are not 100% sure that we've found the right place to look at, but each time
we instrumented the code to print the number of groups, we've got the correct
answer.
- where the backend is writing the groups data to the sysdb cache
So the operation that evaluates what groups the user is a member of is
called initgroups. IIRC you're using the rfc2307 (non-bis) schema, so
the initgroups request that you run starts at
src/providers/ldap/sdap_async_initgroups.c:385 in function
sdap_initgr_rfc2307_send() and ends at sdap_initgr_rfc2307_recv()
Post by Jean-Baptiste Denis
- where the backend is signaling to the responder that the cache has been updated
The schema-specific request is the one I listed above, then
returns to the generic LDAP code in ldap_common.c. The function that
signals over sbus (dbus protocol used over unix socket) is at
sdap_handler_done(), in particular be_req_terminate()
Post by Jean-Baptiste Denis
- where the responder is aware that he can now check the cache to get the answer
This is done in src/responder/common/responder_dp.c. The request is
sent with sss_dp_get_account_send().

This code is a bit complex, because concurrent requests are just added to
queue in sss_dp_issue_request() if the corresponding request is already
found in rctx->dp_request_table hash table. But the first request that
finishes would receive an sbus message from the provider in
sss_dp_internal_get_done(). Then it would iterate over the queue of
requests and mark them as done or failed.o

The callback that should be invoked by this generic NSS code is
nss_cmd_getby_dp_callback().
Post by Jean-Baptiste Denis
- where the responder is actually getting the data from the sysdb cache
src/responder/nss/nsssrv_cmd.c, in particular
nss_cmd_initgroups_search() and the function check_cache().
Jean-Baptiste Denis
2015-04-16 09:37:53 UTC
Permalink
Post by Jakub Hrozek
I was suspecting a race condition, because as well as the rest of SSSD,
the cleanup task is asynchronous. I was suspecting the following might
- users are written to the cache
- groups are written to the cache but not linked yet to the user
objects
- cleanup tasks starts
- cleanup task removes the group objects because they are
"empty". It shouldn't happen because the cleanup task should
only remove expired entries, but IIRC Lukas saw a similar
race-condition elsewhere.
"groups are written to the cache but not linked yet to the user objects"

Is it possible for the responder to answer a client about groups information
before the groups are written to the cache AND linked to it ? That's what the
getgroups syscall (from the client) returning the wrong number of group would
suggest when the problem occurs. Could that be related to ghost or fake entries ?
Post by Jakub Hrozek
Post by Jean-Baptiste Denis
- where the backend is writing the groups data to the sysdb cache
So the operation that evaluates what groups the user is a member of is
called initgroups. IIRC you're using the rfc2307 (non-bis) schema, so
the initgroups request that you run starts at
src/providers/ldap/sdap_async_initgroups.c:385 in function
sdap_initgr_rfc2307_send() and ends at sdap_initgr_rfc2307_recv()
Post by Jean-Baptiste Denis
- where the backend is signaling to the responder that the cache has been updated
The schema-specific request is the one I listed above, then
returns to the generic LDAP code in ldap_common.c. The function that
signals over sbus (dbus protocol used over unix socket) is at
sdap_handler_done(), in particular be_req_terminate()
Post by Jean-Baptiste Denis
- where the responder is aware that he can now check the cache to get the answer
This is done in src/responder/common/responder_dp.c. The request is
sent with sss_dp_get_account_send().
This code is a bit complex, because concurrent requests are just added to
queue in sss_dp_issue_request() if the corresponding request is already
found in rctx->dp_request_table hash table. But the first request that
finishes would receive an sbus message from the provider in
sss_dp_internal_get_done(). Then it would iterate over the queue of
requests and mark them as done or failed.o
The callback that should be invoked by this generic NSS code is
nss_cmd_getby_dp_callback().
Post by Jean-Baptiste Denis
- where the responder is actually getting the data from the sysdb cache
src/responder/nss/nsssrv_cmd.c, in particular
nss_cmd_initgroups_search() and the function check_cache().
Thank you for this extensive answer. We were quite close to this understanding.
We'll try to dig more.

Jean-Baptiste
Jakub Hrozek
2015-04-16 10:01:29 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Jakub Hrozek
I was suspecting a race condition, because as well as the rest of SSSD,
the cleanup task is asynchronous. I was suspecting the following might
- users are written to the cache
- groups are written to the cache but not linked yet to the user
objects
- cleanup tasks starts
- cleanup task removes the group objects because they are
"empty". It shouldn't happen because the cleanup task should
only remove expired entries, but IIRC Lukas saw a similar
race-condition elsewhere.
"groups are written to the cache but not linked yet to the user objects"
Is it possible for the responder to answer a client about groups information
before the groups are written to the cache AND linked to it ? That's what the
getgroups syscall (from the client) returning the wrong number of group would
suggest when the problem occurs. Could that be related to ghost or fake entries ?
No, it shouldn't be. The whole backend request should run and only then
the backend should signal to frontend to re-check the cache. That's why
I was suspecting the cleanup task, it's asynchronous.
Jean-Baptiste Denis
2015-04-16 10:31:39 UTC
Permalink
Post by Jakub Hrozek
No, it shouldn't be. The whole backend request should run and only then
the backend should signal to frontend to re-check the cache. That's why
I was suspecting the cleanup task, it's asynchronous.
Got it, thank you.

Jean-Baptiste
Jean-Baptiste Denis
2015-04-18 01:27:33 UTC
Permalink
No, it shouldn't be. The whole backend request should run and only then the
backend should signal to frontend to re-check the cache. That's why I was
suspecting the cleanup task, it's asynchronous.
I think I've got a test case without involving slurm. It is quite reproductible
on my machine. Since it looks like a race, you may need to tweak the parameter
of the python script.

The basic idea is to run a bunch of process and wait for a slight amount of time
before calling the initgroups libc function for a specific user

You have to log as root and not use sudo to prevent sssd cache to be populated
before the test is started. You also *need* to cleanup sssd state before running
the test.

usage:

## log as root
## check the number secondary group for a user using id for example
# id jbdenis

uid=21489(jbdenis) gid=110(sis)
groups=110(sis),3044(CIB),19(floppy),1177(dump-projets),56(netadm),3125(vpn-ssl-admin)

Here I've got 5 secondary groups (sis is my primary group)

## !! VERY IMPORTANT !! cleanup sssd state
# /etc/init.d/sssd stop && rm -f /var/lib/sss/mc/* /var/lib/sss/db/* &&
/etc/init.d/sssd start


## run this program
# python initgroups.py jbdenis 110 5 24 200
wrong number of secondary groups in process 17145 : 0 instead of 5 (sleep 55ms)
wrong number of secondary groups in process 17149 : 0 instead of 5 (sleep 55ms)
2/24 failed

# first parameter is a login
# second parameter is your primary gid (could be anything)
# third parameter is your number of secondary groups
# fourth parameter is the number of process you want to run concurrently
# the last parameter is the maximum delay in milliseconds before calling
initgroups (the delay is randomized up to this maximum)

I've got good results with 24 processes and randomized delay of 200ms between
startup. Those parameters are somewhat relative to the machine you're running
the script on I guess. You may have to run this test multiple time before
triggering the bug.

I'm unable to reproduce the bug when I use 0 delay and I think that why we could
reproduce it with our initial test case.

I really hope that you could reproduce the bug on your side.

Thank you for your help,

Jean-Baptiste
Jean-Baptiste Denis
2015-04-21 21:37:44 UTC
Permalink
Post by Jean-Baptiste Denis
I think I've got a test case without involving slurm. It is quite reproductible
on my machine. Since it looks like a race, you may need to tweak the parameter
of the python script.
Hi,

does anyone from the sssd team (or not ;)) had a chance to reproduce the bug
with the attached script in the previous message ?

Jean-Baptiste
Jakub Hrozek
2015-04-22 13:57:39 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Jean-Baptiste Denis
I think I've got a test case without involving slurm. It is quite reproductible
on my machine. Since it looks like a race, you may need to tweak the parameter
of the python script.
Hi,
does anyone from the sssd team (or not ;)) had a chance to reproduce the bug
with the attached script in the previous message ?
Sorry, but not yet. I keep this task on my TODO list, but at the moment,
I need to finish another task, sorry.
Jean-Baptiste Denis
2015-04-23 16:12:15 UTC
Permalink
Post by Jakub Hrozek
Sorry, but not yet. I keep this task on my TODO list, but at the moment,
I need to finish another task, sorry.
No need to justify, I was just relieved to have been able to reproduce the
problem without involving slurm.

Jean-Baptiste
Chris Petty
2015-04-23 19:11:56 UTC
Permalink
I actually tried it and it was reproducible on my system using sssd 1.11.6 ( ad and ldap config ).

[***@dirac linux]# python initgroups.py cmp12 119549 95 24 200
wrongs number of secondary groups in process 4363 : 5 instead of 95 (sleep 78ms)
wrongs number of secondary groups in process 4366 : 5 instead of 95 (sleep 95ms)
wrongs number of secondary groups in process 4353 : 5 instead of 95 (sleep 90ms)
wrongs number of secondary groups in process 4362 : 5 instead of 95 (sleep 108ms)
wrongs number of secondary groups in process 4358 : 5 instead of 95 (sleep 110ms)
wrongs number of secondary groups in process 4371 : 5 instead of 95 (sleep 121ms)


I’ve been following the thread because i see this same behavior on our linux cluster which uses sssd for authentication.

When a lot of jobs hit the cluster, sometimes we’ll get failures because of authentication:
"failed assumedly before job:can't get password entry for user "wfb6". Either the user does not exist or NIS error!"

Presumably the authentication mechanism could not keep up with the count of requests ( or large number of groups per user in the domain ).

-Chris
Post by Jean-Baptiste Denis
Post by Jean-Baptiste Denis
I think I've got a test case without involving slurm. It is quite reproductible
on my machine. Since it looks like a race, you may need to tweak the parameter
of the python script.
Hi,
does anyone from the sssd team (or not ;)) had a chance to reproduce the bug
with the attached script in the previous message ?
Jean-Baptiste
_______________________________________________
sssd-users mailing list
https://lists.fedorahosted.org/mailman/listinfo/sssd-users
Jean-Baptiste Denis
2015-04-24 17:40:13 UTC
Permalink
Post by Chris Petty
I actually tried it and it was reproducible on my system using sssd 1.11.6 ( ad and ldap config ).
Thank you for trying it on your side and reporting it. I was able to reproduce
it with 1.12.4, 1.11.6 and 1.9.7 in the hope of bissecting the bug =)

Since we're not the only one anymore, I think it is a good time to open a bug :
#2634 (https://fedorahosted.org/sssd/ticket/2634)

Jean-Baptiste
Jakub Hrozek
2015-04-24 17:49:26 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Chris Petty
I actually tried it and it was reproducible on my system using sssd 1.11.6 ( ad and ldap config ).
Thank you for trying it on your side and reporting it. I was able to reproduce
it with 1.12.4, 1.11.6 and 1.9.7 in the hope of bissecting the bug =)
#2634 (https://fedorahosted.org/sssd/ticket/2634)
Thanks for opening the ticket, that way the thread won't be lost!
Jean-Baptiste Denis
2015-04-27 10:12:24 UTC
Permalink
Post by Jean-Baptiste Denis
#2634 (https://fedorahosted.org/sssd/ticket/2634)
I've also opened a support case (#01436512) on access.redhat.com (I've
reproduced the bug on a production RHEL6 server).

Jean-Baptiste
Jean-Baptiste Denis
2015-04-28 14:50:52 UTC
Permalink
Post by Jean-Baptiste Denis
I've also opened a support case (#01436512) on access.redhat.com (I've
reproduced the bug on a production RHEL6 server).
Private bug #1215765 opened at redhat:

https://bugzilla.redhat.com/show_bug.cgi?id=1215765

Jean-Baptiste
Lukas Slebodnik
2015-04-30 08:11:51 UTC
Permalink
Post by Jean-Baptiste Denis
No, it shouldn't be. The whole backend request should run and only then the
backend should signal to frontend to re-check the cache. That's why I was
suspecting the cleanup task, it's asynchronous.
I think I've got a test case without involving slurm. It is quite reproductible
on my machine. Since it looks like a race, you may need to tweak the parameter
of the python script.
The basic idea is to run a bunch of process and wait for a slight amount of time
before calling the initgroups libc function for a specific user
You have to log as root and not use sudo to prevent sssd cache to be populated
before the test is started. You also *need* to cleanup sssd state before running
the test.
## log as root
## check the number secondary group for a user using id for example
# id jbdenis
uid=21489(jbdenis) gid=110(sis)
groups=110(sis),3044(CIB),19(floppy),1177(dump-projets),56(netadm),3125(vpn-ssl-admin)
Here I've got 5 secondary groups (sis is my primary group)
## !! VERY IMPORTANT !! cleanup sssd state
# /etc/init.d/sssd stop && rm -f /var/lib/sss/mc/* /var/lib/sss/db/* &&
/etc/init.d/sssd start
## run this program
# python initgroups.py jbdenis 110 5 24 200
wrong number of secondary groups in process 17145 : 0 instead of 5 (sleep 55ms)
wrong number of secondary groups in process 17149 : 0 instead of 5 (sleep 55ms)
2/24 failed
# first parameter is a login
# second parameter is your primary gid (could be anything)
# third parameter is your number of secondary groups
# fourth parameter is the number of process you want to run concurrently
# the last parameter is the maximum delay in milliseconds before calling
initgroups (the delay is randomized up to this maximum)
I've got good results with 24 processes and randomized delay of 200ms between
startup. Those parameters are somewhat relative to the machine you're running
the script on I guess. You may have to run this test multiple time before
triggering the bug.
I'm unable to reproduce the bug when I use 0 delay and I think that why we could
reproduce it with our initial test case.
I really hope that you could reproduce the bug on your side.
Thank you for your help,
I tried to reproduce bug with your script but I was not successful.

Domain section from sssd.conf
[domain/refLDAP]
id_provider = ldap
auth_provider = ldap
debug_level = 0xFFF0
ldap_uri = ldap://172.17.0.1
ldap_search_base = dc=example,dc=com
ldap_schema = rfc2307bis
ldap_group_object_class = groupOfNames
timeout = 600
ldap_pwd_policy = shadow

I tried different values for number of process and maximum delay in milliseconds
{1..12}x{50ms..300ms/step 10ms}

My laptop has 4 cores and "Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz"

There have to be something different in my configuration.
Could you provide more information how to reproduce?

LS
Jean-Baptiste Denis
2015-04-30 13:39:09 UTC
Permalink
Post by Lukas Slebodnik
I tried to reproduce bug with your script but I was not successful.
Domain section from sssd.conf
[domain/refLDAP]
id_provider = ldap
auth_provider = ldap
debug_level = 0xFFF0
ldap_uri = ldap://172.17.0.1
ldap_search_base = dc=example,dc=com
ldap_schema = rfc2307bis
ldap_group_object_class = groupOfNames
timeout = 600
ldap_pwd_policy = shadow
I tried different values for number of process and maximum delay in milliseconds
{1..12}x{50ms..300ms/step 10ms}
There have to be something different in my configuration.
Could you provide more information how to reproduce?
Mmmm...

This is our domain section :

[domain/pasteur_ldap_home]
ldap_tls_reqcert = allow
auth_provider = ldap
ldap_schema = rfc2307
ldap_search_base = xxxx
ldap_group_search_base = xxxx
id_provider = ldap
ldap_id_use_start_tls = True
chpass_provider = none
ldap_uri = ldap://xxxx/
cache_credentials = True
ldap_tls_cacertdir = /etc/openldap/certs
ldap_network_timeout = 3
ldap_page_size = 500
debug_level = 0x77F0

We're using rfc2307 schema and default ldap_group_object_class value
(posixGroup). Besides that, I don't see what could explain that you can't
reproduce the problem. Chris Petty is using AD hence rc2307bis schema. So I
don't know if it is relevant.

Just to sure, did you log as root (no sudo), stopped sssd, cleanup the cache,
restarting it (all as root without sudo), and ran the script (as root) ?

# (logged as root)
# /etc/init.d/sssd stop && rm -f /var/lib/sss/mc/* /var/lib/sss/db/* &&
/etc/init.d/sssd start
# python initgroups.py jbdenis 110 5 24 200

Sometimes I have to perform these steps multiple time to catch the problem.

Jean-Baptiste
Chris Petty
2015-04-30 14:29:56 UTC
Permalink
Here is my domain section … reproducible every time if i clear the sssd cache.


[domain/default]
debug_level = 9
id_provider = ad
auth_provider = ad
access_provider = ldap
chpass_provider = ad
ad_domain = dhe.duke.edu
ldap_search_base = DC=dhe,DC=duke,DC=edu
ldap_idmap_default_domain = dhe.duke.edu
ldap_sasl_mech = GSSAPI
ldap_user_principle = workAround
ldap_account_expire_policy = ad
ldap_access_order = expire
ldap_schema = ad
ldap_referrals = False
ldap_id_mapping = True
ldap_force_upper_case_realm = True
ldap_user_search_base = DC=dhe,DC=duke,DC=edu?subtree?(memberOf=CN=BIAC-Users,OU=Groups,OU=BIAC,OU=SOM,OU=EnterpriseResources,DC=dhe,DC=duke,DC=edu)
ldap_idmap_default_domain_sid = REMOVED
ldap_tls_reqcert = never
case_sensitive = False
krb5_lifetime = 10h
krb5_renewable_lifetime = 7d
krb5_renew_interval = 3600
krb5_ccachedir = /mnt/cluster_dhe/clustertmp/common/krb5ccache
krb5_ccname_template = FILE:%d/krb5cc_%U_XXXXXX
ldap_account_expire_policy = ad
krb5_realm = DHE.DUKE.EDU
#these will go away with IDMU uid
ldap_idmap_range_size = 20000000
ldap_idmap_range_min = 0
ldap_idmap_range_max = 2000000000
min_id = 500
override_gid = 197250
cache_credentials = True
ignore_group_members = True
Post by Jean-Baptiste Denis
Post by Lukas Slebodnik
I tried to reproduce bug with your script but I was not successful.
Domain section from sssd.conf
[domain/refLDAP]
id_provider = ldap
auth_provider = ldap
debug_level = 0xFFF0
ldap_uri = ldap://172.17.0.1
ldap_search_base = dc=example,dc=com
ldap_schema = rfc2307bis
ldap_group_object_class = groupOfNames
timeout = 600
ldap_pwd_policy = shadow
I tried different values for number of process and maximum delay in milliseconds
{1..12}x{50ms..300ms/step 10ms}
There have to be something different in my configuration.
Could you provide more information how to reproduce?
Mmmm...
[domain/pasteur_ldap_home]
ldap_tls_reqcert = allow
auth_provider = ldap
ldap_schema = rfc2307
ldap_search_base = xxxx
ldap_group_search_base = xxxx
id_provider = ldap
ldap_id_use_start_tls = True
chpass_provider = none
ldap_uri = ldap://xxxx/
cache_credentials = True
ldap_tls_cacertdir = /etc/openldap/certs
ldap_network_timeout = 3
ldap_page_size = 500
debug_level = 0x77F0
We're using rfc2307 schema and default ldap_group_object_class value
(posixGroup). Besides that, I don't see what could explain that you can't
reproduce the problem. Chris Petty is using AD hence rc2307bis schema. So I
don't know if it is relevant.
Just to sure, did you log as root (no sudo), stopped sssd, cleanup the cache,
restarting it (all as root without sudo), and ran the script (as root) ?
# (logged as root)
# /etc/init.d/sssd stop && rm -f /var/lib/sss/mc/* /var/lib/sss/db/* &&
/etc/init.d/sssd start
# python initgroups.py jbdenis 110 5 24 200
Sometimes I have to perform these steps multiple time to catch the problem.
Jean-Baptiste
_______________________________________________
sssd-users mailing list
https://lists.fedorahosted.org/mailman/listinfo/sssd-users
Lukas Slebodnik
2015-05-05 12:56:35 UTC
Permalink
Post by Chris Petty
Here is my domain section … reproducible every time if i clear the sssd cache.
[domain/default]
debug_level = 9
id_provider = ad
auth_provider = ad
access_provider = ldap
chpass_provider = ad
ad_domain = dhe.duke.edu
ldap_search_base = DC=dhe,DC=duke,DC=edu
ldap_idmap_default_domain = dhe.duke.edu
ldap_sasl_mech = GSSAPI
ldap_user_principle = workAround
ldap_account_expire_policy = ad
ldap_access_order = expire
ldap_schema = ad
ldap_referrals = False
ldap_id_mapping = True
ldap_force_upper_case_realm = True
ldap_user_search_base = DC=dhe,DC=duke,DC=edu?subtree?(memberOf=CN=BIAC-Users,OU=Groups,OU=BIAC,OU=SOM,OU=EnterpriseResources,DC=dhe,DC=duke,DC=edu)
ldap_idmap_default_domain_sid = REMOVED
ldap_tls_reqcert = never
case_sensitive = False
krb5_lifetime = 10h
krb5_renewable_lifetime = 7d
krb5_renew_interval = 3600
krb5_ccachedir = /mnt/cluster_dhe/clustertmp/common/krb5ccache
krb5_ccname_template = FILE:%d/krb5cc_%U_XXXXXX
ldap_account_expire_policy = ad
krb5_realm = DHE.DUKE.EDU
#these will go away with IDMU uid
ldap_idmap_range_size = 20000000
ldap_idmap_range_min = 0
ldap_idmap_range_max = 2000000000
min_id = 500
override_gid = 197250
cache_credentials = True
ignore_group_members = True
You can have a different problem caused by enabled id-mapping and
ignore_group_members in the same time.

@see https://fedorahosted.org/sssd/ticket/2646

The second execution of "id user" does not return supplementary groups.

The workaround is to disable tokengroups for that domain
man sssd.conf -> ldap_use_tokengroups

LS
Jean-Baptiste Denis
2015-05-05 12:36:27 UTC
Permalink
Post by Jean-Baptiste Denis
...
We're using rfc2307 schema and default ldap_group_object_class value
(posixGroup). Besides that, I don't see what could explain that you can't
reproduce the problem. Chris Petty is using AD hence rc2307bis schema. So I
don't know if it is relevant.
Just to sure, did you log as root (no sudo), stopped sssd, cleanup the cache,
restarting it (all as root without sudo), and ran the script (as root) ?
# (logged as root)
# /etc/init.d/sssd stop && rm -f /var/lib/sss/mc/* /var/lib/sss/db/* &&
/etc/init.d/sssd start
# python initgroups.py jbdenis 110 5 24 200
Sometimes I have to perform these steps multiple time to catch the problem.
Hi,

Did you have a chance trying again if you didn't follow this modus operandi
(everything as root, no sudo) in your previous attempt ?

Since Chris Petty managed to reproduce it with the script with in a different
environement, I'm quite confident that we're on something here.

Jean-Baptiste
Lukas Slebodnik
2015-05-05 12:40:59 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Jean-Baptiste Denis
...
We're using rfc2307 schema and default ldap_group_object_class value
(posixGroup). Besides that, I don't see what could explain that you can't
reproduce the problem. Chris Petty is using AD hence rc2307bis schema. So I
don't know if it is relevant.
Just to sure, did you log as root (no sudo), stopped sssd, cleanup the cache,
restarting it (all as root without sudo), and ran the script (as root) ?
# (logged as root)
# /etc/init.d/sssd stop && rm -f /var/lib/sss/mc/* /var/lib/sss/db/* &&
/etc/init.d/sssd start
# python initgroups.py jbdenis 110 5 24 200
Sometimes I have to perform these steps multiple time to catch the problem.
Hi,
Did you have a chance trying again if you didn't follow this modus operandi
(everything as root, no sudo) in your previous attempt ?
Since Chris Petty managed to reproduce it with the script with in a different
environement, I'm quite confident that we're on something here.
I executed attached script as a root.

LS
Jean-Baptiste Denis
2015-05-05 13:04:11 UTC
Permalink
Post by Lukas Slebodnik
I executed attached script as a root.
Thank your for sharing the script.

I reproduce the bug with it after some iterations :

# ./reproduce.sh
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
wrongs number of secondary groups in process 17288 : 0 instead of 5 (sleep 49ms)
1/20 failed
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
wrongs number of secondary groups in process 17451 : 0 instead of 5 (sleep 56ms)
1/20 failed
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
wrongs number of secondary groups in process 17535 : 0 instead of 5 (sleep 60ms)
wrongs number of secondary groups in process 17523 : 0 instead of 5 (sleep 60ms)
wrongs number of secondary groups in process 17540 : 0 instead of 5 (sleep 61ms)
wrongs number of secondary groups in process 17528 : 0 instead of 5 (sleep 63ms)
wrongs number of secondary groups in process 17537 : 0 instead of 5 (sleep 64ms)
5/20 failed
Stopping sssd: [ OK ]
Starting sssd:

I've got the same behaviour if I remove the "sleep 9", it's just quicker to test
=) I've also tested with another user, same behaviour.

Are you 100% sure that the user "user_many_groups" is not doing anything, or any
users belonging to one of its 23 secondary groups ?

Jean-Baptiste
Lukas Slebodnik
2015-05-05 13:14:12 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Lukas Slebodnik
I executed attached script as a root.
Thank your for sharing the script.
# ./reproduce.sh
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
wrongs number of secondary groups in process 17288 : 0 instead of 5 (sleep 49ms)
1/20 failed
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
wrongs number of secondary groups in process 17451 : 0 instead of 5 (sleep 56ms)
1/20 failed
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
wrongs number of secondary groups in process 17535 : 0 instead of 5 (sleep 60ms)
wrongs number of secondary groups in process 17523 : 0 instead of 5 (sleep 60ms)
wrongs number of secondary groups in process 17540 : 0 instead of 5 (sleep 61ms)
wrongs number of secondary groups in process 17528 : 0 instead of 5 (sleep 63ms)
wrongs number of secondary groups in process 17537 : 0 instead of 5 (sleep 64ms)
5/20 failed
Stopping sssd: [ OK ]
I've got the same behaviour if I remove the "sleep 9", it's just quicker to test
=) I've also tested with another user, same behaviour.
I added "sleep 9" later, because I was not able to reproduce your bug.

The results can be influenced byt the fact I used fedora for testing and sssd
master.
Post by Jean-Baptiste Denis
Are you 100% sure that the user "user_many_groups" is not doing anything, or any
users belonging to one of its 23 secondary groups ?
Yes; I'm sure.
I created new user and new groups in LDAP. (openldap)

Could you describe your testing machine?
Hardware, OS, network, ... (anything what could influence results)

LS
Jean-Baptiste Denis
2015-05-05 13:41:13 UTC
Permalink
Post by Lukas Slebodnik
I added "sleep 9" later, because I was not able to reproduce your bug.
The results can be influenced byt the fact I used fedora for testing and sssd
master.
OK. I've tested with 1.12.4, but not with the master since I don't have python3
available and it looks like this is a dependency (I didn't check if I could
build it without python3, could you tell me ?)
Post by Lukas Slebodnik
Post by Jean-Baptiste Denis
Are you 100% sure that the user "user_many_groups" is not doing anything, or any
users belonging to one of its 23 secondary groups ?
Yes; I'm sure.
I created new user and new groups in LDAP. (openldap)
Noted.
Post by Lukas Slebodnik
Could you describe your testing machine?
Hardware, OS, network, ... (anything what could influence results)
Since I've reproduce it on a physical machine running CentOS 6 and a VMware
guest running RHEL6, I don't really think that's relevant. The CentOS 6 physical
machine is a dual Xeon E5-2630, 10 Gb intel network card, supermicro motherboard
and 64 GB of ram.

Jean-Baptiste
Lukas Slebodnik
2015-05-05 14:06:47 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Lukas Slebodnik
I added "sleep 9" later, because I was not able to reproduce your bug.
The results can be influenced byt the fact I used fedora for testing and sssd
master.
OK. I've tested with 1.12.4, but not with the master since I don't have python3
available and it looks like this is a dependency (I didn't check if I could
build it without python3, could you tell me ?)
python3 was optional from beginning but we recently added hint to configure
script how to disable it.
sssd-1.12 is very close to master so I do not expect any difference.
Post by Jean-Baptiste Denis
Post by Lukas Slebodnik
Post by Jean-Baptiste Denis
Are you 100% sure that the user "user_many_groups" is not doing anything, or any
users belonging to one of its 23 secondary groups ?
Yes; I'm sure.
I created new user and new groups in LDAP. (openldap)
Noted.
Post by Lukas Slebodnik
Could you describe your testing machine?
Hardware, OS, network, ... (anything what could influence results)
Since I've reproduce it on a physical machine running CentOS 6 and a VMware
guest running RHEL6, I don't really think that's relevant. The CentOS 6 physical
machine is a dual Xeon E5-2630, 10 Gb intel network card, supermicro motherboard
and 64 GB of ram.
So you tested with el6.

I will try to reproduce with RHEL6/CentOS6.6

LS
Jean-Baptiste Denis
2015-05-05 14:22:59 UTC
Permalink
Post by Lukas Slebodnik
python3 was optional from beginning but we recently added hint to configure
script how to disable it.
sssd-1.12 is very close to master so I do not expect any difference.
Good to know.
Post by Lukas Slebodnik
So you tested with el6.
I will try to reproduce with RHEL6/CentOS6.6
OK cool, thank you.

I've also tested with 1.12.4 (from source) on each of those : same problem.

Jean-Baptiste
Lukas Slebodnik
2015-05-06 08:35:50 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Lukas Slebodnik
python3 was optional from beginning but we recently added hint to configure
script how to disable it.
sssd-1.12 is very close to master so I do not expect any difference.
Good to know.
Post by Lukas Slebodnik
So you tested with el6.
I will try to reproduce with RHEL6/CentOS6.6
OK cool, thank you.
I've also tested with 1.12.4 (from source) on each of those : same problem.
I tried with RHEL 6.6 but I wasn't able to reproduce.

My script ran for log time.
real 175m1.614s
user 15m22.642s
sys 12m5.248s


I can try to test with different machines, but you were able to reproduce in VM
as well. So i'm not sure it will help.

BTW HPC machines are usually diskless. Is it your case as well?
because it is not supported to have sssd cache (/va/lib/sssd/) on nfs.

LS
Jean-Baptiste Denis
2015-05-06 09:26:29 UTC
Permalink
Post by Lukas Slebodnik
I tried with RHEL 6.6 but I wasn't able to reproduce.
Weird :|
Post by Lukas Slebodnik
My script ran for log time.
real 175m1.614s
user 15m22.642s
sys 12m5.248s
I can try to test with different machines, but you were able to reproduce in VM
as well. So i'm not sure it will help.
BTW HPC machines are usually diskless. Is it your case as well?
because it is not supported to have sssd cache (/va/lib/sssd/) on nfs.
sssd cache is on a local disk.

Jean-Baptiste
Thomas Hummel
2015-05-06 10:08:32 UTC
Permalink
Post by Lukas Slebodnik
I tried with RHEL 6.6 but I wasn't able to reproduce.
My script ran for log time.
real 175m1.614s
user 15m22.642s
sys 12m5.248s
This comes to my mind : we are using LDAP paging.

ldap_page_size = 500

Would it change something if you use it too ?

Thanks.
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Thomas Hummel
2015-05-06 13:37:51 UTC
Permalink
Post by Thomas Hummel
This comes to my mind : we are using LDAP paging.
ldap_page_size = 500
Would it change something if you use it too ?
Don't bother testing it : we tried to bind with a «non paged» dn (using
ldap_default_bind_dn) and without ldap_page_size and we still reproduced the
bug.

Thanks.
--
Thomas Hummel | Institut Pasteur
<***@pasteur.fr> | Groupe Exploitation et Infrastructure
Jean-Baptiste Denis
2015-05-05 17:02:12 UTC
Permalink
Post by Lukas Slebodnik
python3 was optional from beginning but we recently added hint to configure
script how to disable it.
sssd-1.12 is very close to master so I do not expect any difference.
Indeed.

I've just compiled the git master (56552c518a07b45b25d4a2ef58d37fac0918ce60) and
was still able to reproduce the bug on CentOS 6.6.

Jean-Baptiste
Jakub Hrozek
2015-05-06 03:30:34 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Lukas Slebodnik
python3 was optional from beginning but we recently added hint to configure
script how to disable it.
sssd-1.12 is very close to master so I do not expect any difference.
Indeed.
I've just compiled the git master (56552c518a07b45b25d4a2ef58d37fac0918ce60) and
was still able to reproduce the bug on CentOS 6.6.
Jean-Baptiste
I guess none of your machines are (or could be) accessible publicly if
we can't reproduce the bug in-house at all?
Jean-Baptiste Denis
2015-05-06 09:30:48 UTC
Permalink
Post by Jakub Hrozek
I guess none of your machines are (or could be) accessible publicly if
we can't reproduce the bug in-house at all?
This should be doable in a few days/next week. May I contact you and Lukas
off-list for the details ?

Thankl you for proposing that.

Jean-Baptiste
Jakub Hrozek
2015-05-06 10:08:02 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Jakub Hrozek
I guess none of your machines are (or could be) accessible publicly if
we can't reproduce the bug in-house at all?
This should be doable in a few days/next week. May I contact you and Lukas
off-list for the details ?
Sure.
Post by Jean-Baptiste Denis
Thankl you for proposing that.
I think it would save time on both ends unless we can reproduce
ourselves :-)
Jean-Baptiste Denis
2015-05-06 22:26:47 UTC
Permalink
Post by Jakub Hrozek
I think it would save time on both ends unless we can reproduce
ourselves :-)
We've got a "recipie" and configuration files to reproduce the bug from scratch,
on a vanilla CentOS 6 distro (the ldap part is inspired from
http://wiki.openiam.com/pages/viewpage.action?pageId=7635198)

# yum install sssd sssd-common openldap-servers openldap-clients perl-LDAP.noarch
# cp /usr/share/openldap-servers/DB_CONFIG.example /var/lib/ldap/DB_CONFIG
# chown -R ldap:ldap /var/lib/ldap
# cd /etc/openldap && mv slapd.d slapd.d.original
# cp /root/slapd-minimal.conf /etc/openldap/slapd.conf # use the one provided
with this message
# chown ldap:ldap /etc/openldap/slapd.conf
# chmod 600 /etc/openldap/slapd.conf
# Add this line is /etc/sysconfig/ldap
SLAPD_OPTIONS="-h \"ldap://127.0.0.1 ldaps://127.0.0.1\""
# service slapd start
# chkconfig slapd on

Check that you can connect (the Manager password is "openldap") :

# ldapsearch -h localhost -x -w openldap -D 'cn=Manager,dc=example,dc=com' -b
'dc=example,dc=com' 'objectclass=*'

Time to populate our ldap server with our provided file (one user "user1" with
password "openldap" belonging to 29 secondary groups):

# ldapadd -h localhost -x -w openldap -D 'cn=Manager,dc=example,dc=com' -f
/root/ldap-init.ldif

You can check that everything went fine with the previous ldapsearch command.

Copy our sssd configuration file:

# cp /root/sssd-minimal.conf /etc/sssd/sssd.conf
# chown root:root /etc/sssd/sssd.conf && chmod 600 /etc/sssd/sssd.conf
# service sssd start
# chkconfig sssd on
# # not sure if the authconfig is strictly necessary here
# authconfig --enablesssd --enablesssdauth --enablelocauthorize
--enablemkhomedir --enablepamaccess --updateall --nostart
# service sssd restart

In /etc/nsswitch.conf, check for :

passwd: files sss
shadow: files sss
group: files sss

# cat /etc/sssd/sssd.conf
[sssd]
config_file_version = 2
services = nss, pam
domains = ldap_local

[nss]
filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd
override_shell = /bin/bash

[pam]


[domain/ldap_local]
override_homedir = /home/%u
auth_provider = ldap
ldap_schema = rfc2307
ldap_search_base = ou=people,dc=example,dc=com
ldap_group_search_base = ou=group,dc=example,dc=com
id_provider = ldap
ldap_uri = ldap://localhost/

You can now run your script or mine. Just adapt the initgroups.py call or use
the one provided with this message:

python initgroups.py user1 50001 29 $num_proc $delay)

And run:

# ./run_initgroups.sh
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
.wrongs number of secondary groups in process 17626 : 0 instead of 29 (sleep 16ms)
wrongs number of secondary groups in process 17630 : 0 instead of 29 (sleep 26ms)
wrongs number of secondary groups in process 17634 : 0 instead of 29 (sleep 49ms)
wrongs number of secondary groups in process 17615 : 0 instead of 29 (sleep 53ms)
4/24 failed

OR

# ./reproduce.sh
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
wrongs number of secondary groups in process 15664 : 0 instead of 29 (sleep 10ms)
wrongs number of secondary groups in process 15672 : 0 instead of 29 (sleep 9ms)
wrongs number of secondary groups in process 15673 : 0 instead of 29 (sleep 10ms)
3/20 failed
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
wrongs number of secondary groups in process 15747 : 0 instead of 29 (sleep 3ms)
wrongs number of secondary groups in process 15734 : 0 instead of 29 (sleep 4ms)
wrongs number of secondary groups in process 15735 : 0 instead of 29 (sleep 10ms)
wrongs number of secondary groups in process 15748 : 0 instead of 29 (sleep 3ms)
wrongs number of secondary groups in process 15743 : 0 instead of 29 (sleep 7ms)
wrongs number of secondary groups in process 15745 : 0 instead of 29 (sleep 7ms)
wrongs number of secondary groups in process 15736 : 0 instead of 29 (sleep 5ms)
wrongs number of secondary groups in process 15742 : 0 instead of 29 (sleep 4ms)
wrongs number of secondary groups in process 15731 : 0 instead of 29 (sleep 10ms)
wrongs number of secondary groups in process 15732 : 0 instead of 29 (sleep 14ms)
wrongs number of secondary groups in process 15739 : 0 instead of 29 (sleep 4ms)
wrongs number of secondary groups in process 15749 : 0 instead of 29 (sleep 4ms)



Tell me your able to reproduce that. If this is not the case, we have something
on the machine that have weird interaction with sssd. Beside some local
configuration (ntp, network, selinux disabled, some system packages), I don't
see anything.

If you cannot reproduce it, I'll give you access to the machine (ssh or VMware
ova export or something like that if you prefer).

Thank you for helping.

Jean-Baptiste
Lukas Slebodnik
2015-05-11 11:55:00 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Jakub Hrozek
I think it would save time on both ends unless we can reproduce
ourselves :-)
We've got a "recipie" and configuration files to reproduce the bug from scratch,
on a vanilla CentOS 6 distro (the ldap part is inspired from
http://wiki.openiam.com/pages/viewpage.action?pageId=7635198)
# yum install sssd sssd-common openldap-servers openldap-clients perl-LDAP.noarch
# cp /usr/share/openldap-servers/DB_CONFIG.example /var/lib/ldap/DB_CONFIG
# chown -R ldap:ldap /var/lib/ldap
# cd /etc/openldap && mv slapd.d slapd.d.original
# cp /root/slapd-minimal.conf /etc/openldap/slapd.conf # use the one provided
with this message
# chown ldap:ldap /etc/openldap/slapd.conf
# chmod 600 /etc/openldap/slapd.conf
# Add this line is /etc/sysconfig/ldap
SLAPD_OPTIONS="-h \"ldap://127.0.0.1 ldaps://127.0.0.1\""
# service slapd start
# chkconfig slapd on
# ldapsearch -h localhost -x -w openldap -D 'cn=Manager,dc=example,dc=com' -b
'dc=example,dc=com' 'objectclass=*'
Time to populate our ldap server with our provided file (one user "user1" with
# ldapadd -h localhost -x -w openldap -D 'cn=Manager,dc=example,dc=com' -f
/root/ldap-init.ldif
You can check that everything went fine with the previous ldapsearch command.
# cp /root/sssd-minimal.conf /etc/sssd/sssd.conf
# chown root:root /etc/sssd/sssd.conf && chmod 600 /etc/sssd/sssd.conf
# service sssd start
# chkconfig sssd on
# # not sure if the authconfig is strictly necessary here
# authconfig --enablesssd --enablesssdauth --enablelocauthorize
--enablemkhomedir --enablepamaccess --updateall --nostart
# service sssd restart
passwd: files sss
shadow: files sss
group: files sss
# cat /etc/sssd/sssd.conf
[sssd]
config_file_version = 2
services = nss, pam
domains = ldap_local
[nss]
filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd
override_shell = /bin/bash
[pam]
[domain/ldap_local]
override_homedir = /home/%u
auth_provider = ldap
ldap_schema = rfc2307
ldap_search_base = ou=people,dc=example,dc=com
ldap_group_search_base = ou=group,dc=example,dc=com
id_provider = ldap
ldap_uri = ldap://localhost/
You can now run your script or mine. Just adapt the initgroups.py call or use
python initgroups.py user1 50001 29 $num_proc $delay)
# ./run_initgroups.sh
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
.wrongs number of secondary groups in process 17626 : 0 instead of 29 (sleep 16ms)
wrongs number of secondary groups in process 17630 : 0 instead of 29 (sleep 26ms)
wrongs number of secondary groups in process 17634 : 0 instead of 29 (sleep 49ms)
wrongs number of secondary groups in process 17615 : 0 instead of 29 (sleep 53ms)
4/24 failed
OR
# ./reproduce.sh
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
wrongs number of secondary groups in process 15664 : 0 instead of 29 (sleep 10ms)
wrongs number of secondary groups in process 15672 : 0 instead of 29 (sleep 9ms)
wrongs number of secondary groups in process 15673 : 0 instead of 29 (sleep 10ms)
3/20 failed
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
wrongs number of secondary groups in process 15747 : 0 instead of 29 (sleep 3ms)
wrongs number of secondary groups in process 15734 : 0 instead of 29 (sleep 4ms)
wrongs number of secondary groups in process 15735 : 0 instead of 29 (sleep 10ms)
wrongs number of secondary groups in process 15748 : 0 instead of 29 (sleep 3ms)
wrongs number of secondary groups in process 15743 : 0 instead of 29 (sleep 7ms)
wrongs number of secondary groups in process 15745 : 0 instead of 29 (sleep 7ms)
wrongs number of secondary groups in process 15736 : 0 instead of 29 (sleep 5ms)
wrongs number of secondary groups in process 15742 : 0 instead of 29 (sleep 4ms)
wrongs number of secondary groups in process 15731 : 0 instead of 29 (sleep 10ms)
wrongs number of secondary groups in process 15732 : 0 instead of 29 (sleep 14ms)
wrongs number of secondary groups in process 15739 : 0 instead of 29 (sleep 4ms)
wrongs number of secondary groups in process 15749 : 0 instead of 29 (sleep 4ms)
Thank you very much for scripts.
I used some of them and I'm able to reproduce problem with rfc2307 schema

I can see messages "0 instead of 29"
I hope it is the same problem as in case you pasted earlier in this thread.

[***@dirac linux]# python initgroups.py cmp12 119549 95 24 200
wrongs number of secondary groups in process 4363 : 5 instead of 95 (sleep 78ms)
wrongs number of secondary groups in process 4366 : 5 instead of 95 (sleep 95ms)
wrongs number of secondary groups in process 4353 : 5 instead of 95 (sleep 90ms)
wrongs number of secondary groups in process 4362 : 5 instead of 95 (sleep 108ms)
wrongs number of secondary groups in process 4358 : 5 instead of 95 (sleep 110ms)
wrongs number of secondary groups in process 4371 : 5 instead of 95 (sleep 121ms)

BTW: I had rfc2307bis schema in my 1st reproducer.
So I will try whether it could be reason why I couldn't reproduce it.

LS
Jean-Baptiste Denis
2015-05-11 12:14:08 UTC
Permalink
Post by Lukas Slebodnik
Thank you very much for scripts.
I used some of them and I'm able to reproduce problem with rfc2307 schema
Loading Image...
Post by Lukas Slebodnik
I can see messages "0 instead of 29"
I hope it is the same problem as in case you pasted earlier in this thread.
wrongs number of secondary groups in process 4363 : 5 instead of 95 (sleep 78ms)
wrongs number of secondary groups in process 4366 : 5 instead of 95 (sleep 95ms)
wrongs number of secondary groups in process 4353 : 5 instead of 95 (sleep 90ms)
wrongs number of secondary groups in process 4362 : 5 instead of 95 (sleep 108ms)
wrongs number of secondary groups in process 4358 : 5 instead of 95 (sleep 110ms)
wrongs number of secondary groups in process 4371 : 5 instead of 95 (sleep 121ms)
This run was from Chris Petty, which was following the thread and was able to
reproduce the problem with his setup.
Post by Lukas Slebodnik
BTW: I had rfc2307bis schema in my 1st reproducer.
So I will try whether it could be reason why I couldn't reproduce it.
If I'm not mistaken, Chris is using AD (hence rfc2307bis right ?), so I won't
beg on it, but maybe it could give you hints on where the problem is within the
code.

Jean-Baptiste
John Beranek
2015-05-18 08:24:42 UTC
Permalink
For what it's worth, in my environment at work, I am *not* able to
reproduce this. Ran the test on a RHEL 6.7 Beta (sssd-1.12.4-31.el6.x86_64)
VM, with SSSD configured for our corporate Active Directory.

Cheers,

John
Lukas Slebodnik
2015-05-18 08:33:08 UTC
Permalink
Post by John Beranek
For what it's worth, in my environment at work, I am *not* able to
reproduce this. Ran the test on a RHEL 6.7 Beta (sssd-1.12.4-31.el6.x86_64)
VM, with SSSD configured for our corporate Active Directory.
Just small clarification.
The upstream ticket[1] is not fixed RHEL 6.7 Beta (sssd-1.12.4-31.el6.x86_64).
However patches are ready and just nned to be reviewed :-)

LS

[1] https://fedorahosted.org/sssd/ticket/2634
Jean-Baptiste Denis
2015-05-18 14:20:51 UTC
Permalink
Post by Lukas Slebodnik
Just small clarification.
The upstream ticket[1] is not fixed RHEL 6.7 Beta (sssd-1.12.4-31.el6.x86_64).
However patches are ready and just nned to be reviewed :-)
That's great news, thank you for the patch and the update !

Jean-Baptiste
Lukas Slebodnik
2015-05-22 15:16:42 UTC
Permalink
Post by Jean-Baptiste Denis
Post by Lukas Slebodnik
Just small clarification.
The upstream ticket[1] is not fixed RHEL 6.7 Beta (sssd-1.12.4-31.el6.x86_64).
However patches are ready and just nned to be reviewed :-)
That's great news, thank you for the patch and the update !
Patches have just been pushed to upstream
and my testing repo[1] contains latest packages with fixes you need.

LS
[1] https://copr.fedoraproject.org/coprs/lslebodn/sssd-1-12-latest/
Jean-Baptiste Denis
2015-06-18 09:09:56 UTC
Permalink
Post by Lukas Slebodnik
Patches have just been pushed to upstream
and my testing repo[1] contains latest packages with fixes you need.
Ooops.

Didn't see your message until today, what a waste of time on my side with a
redhat support ping pong game :D They gave me RPMs to test yesterday and it
worked. So cool. Support also told me this will be backported in RHEL6 in a near
future (no ETA though).

Sorry again about the lack of feedback on my side after all your effort.

Jean-Baptiste

Continue reading on narkive:
Loading...