SSSD Tips and Tricks Vol. 2 – LDAP

Multiple Search Bases

Starting with SSSD 1.7.0, the System Security Services Daemon now has the ability to search for users, groups and netgroups in multiple search bases. Some LDAP deployments divide groups into different trees so that individual clients can receive different “views” of a group. (This is especially useful for some application software that restricts access to hard-coded group names. In these cases, you can install the same application in different configurations by providing different views of that group).

In SSSD 1.7.0, we extended the ldap_search_base (and ldap_[user|group|netgroup]_search_base) options so that they will now take three pieces of information instead of one (the base itself). Search bases are specified by the pseudocode regular expression:

<base_dn>?<search_scope>?<search_filter>

The search filter can be either an empty string (indicating that no special filtering should be done) or it can be a valid LDAP search filter, which will be ANDed with the standard search filters SSSD uses internally.

“But that’s only one search base, you say!”. And you would be correct. You can add additional search bases by inserting another ‘?’ after the search filter of the preceding base.

For backwards-compatibility, a single search base can be specified simply by base_dn, which will be treated as

<base_dn>?subtree?

Paged lookups

For performance reasons, many LDAP servers limit the number of records that can be retrieved by an LDAP search at one time. Beginning with SSSD 1.6.0, the System Security Services Daemon can now perform paged searches to retrieve responses containing more entries than this limit. This is particularly useful for deployments with large group memberships or those with large numbers of users and groups with enumeration enabled.

SSSD now provides an option ldap_page_size that specifies how many records to request for each individual lookup. The default was set to 1000 records (chosen as it was the lowest common default value of OpenLDAP, 389 DS and Active Directory). However, if your LDAP server has been tuned to have a higher or lower value for this limit, you may wish to update this value to match.

As a general rule, you will want this page size to match the limit on the server, for maximum performance. A larger page size will mean fewer round-trips across the wire to LDAP.

Password Changes and Read-Only Replicas

Due to its use of the openldap client libraries, SSSD has a limitation when it comes to performing password changes. Normally, the way that a read-only replica will be set up is that it will be configured to return a referral to a read-write master LDAP server when a password-change attempt occurs. The client is expected to process this referral and then open a new connection to the read-write master.

Unfortunately, due to an open bug in the OpenLDAP client libraries (which SSSD uses internally for much of its LDAP communication), this referral is not processed correctly. As we’ve not had much traction with the OpenLDAP developers in fixing this bug, SSSD has implemented a workaround. We provide a new option ldap_chpass_uri (valid when using chpass_provider = ldap). It’s semantics are identical to that of ldap_uri, except that it will only be used when a password-change is implemented. This option allows the system administrator to specify a list of read-write servers against which the password-change should occur. The ldap_uri option can remain pointed at the read-only replicas for performance, without compromising the ability to perform password changes on the client.

Protected: The Truth About Passwords

This post is password protected. To view it please enter your password below:


SSSD Tips and Tricks vol. 1 – Kerberos

Automatic TGT Acquisition

You probably already know all about how the System Security Services Daemon can make your offline life easier by enabling cached-credential login to your system while you don’t have access to the central authentication servers.

What you might not know, however, is that when using SSSD to perform Kerberos auth, it’s also possible to configure it to automatically acquire your network credentials when you go online. By setting the ‘krb5_store_password_if_offline’ option to ‘True’ in the [domain/DOMAINNAME] sections of sssd.conf, you can configure SSSD to store a user’s password when they log in while offline (for example, working from home). Then later, if access to the network KDC is restored (for example, connecting to the VPN), SSSD will perform a kinit on your behalf to automatically acquire a TGT for single-sign-on with your network resources.

Now, some of you will be saying to yourselves: “Wait, doesn’t this mean that my password is being stored on the system in a readable way?”. This is true but not the whole story. Yes, the password is stored on the system in such a way that SSSD (and theoretically the root user on the system, with some effort) can read the password. Without doing so, there would be no way for us to acquire the ticket granting ticket on your behalf. However, we do store the password in the most secure way possible: in the kernel keyring. This makes it very difficult for root to gain access to this password and essentially impossible for any non-root process. The risk factor is not zero, which explains why this is an optional feature, disabled by default. However, in the common laptop case (where it’s assumed that the owner of the laptop is likely to be its only user), this security/convenience trade-off is probably worthwhile.

Automatic Ticket Renewal

The second advanced Kerberos feature I’d like to discuss today is automatic ticket renewal. User processes sometimes need access to the user’s Kerberos credentials, even when the user is no longer logged in. An example might be a regular cron job that the user wants to run every day a few hours after leaving work. With traditional Kerberos configurations, this user would be forced to remember to manually renew his Kerberos credentials before leaving for the day, to ensure that the expiration time on his TGT did not expire before his cron job completed.

With SSSD 1.5.0 and later, it can be configured to automatically renew Kerberos tickets for the full renewable life of the TGT. This is different from the automatic TGT acquisition above, as we do not need to store the user’s Kerberos password to accomplish this. It does require some additional configuration on the KDC server, however.

If the KDC permits users to request “renewable” TGT tickets, then what it is allowing the user to do is to use their current TGT in place of their password in order to acquire an updated TGT (with a later expiration).

SSSD 1.5.0 and later can set two options to enable it to automatically renew the user’s TGT for as long as the KDC permits.

The first option is krb5_renewable_lifetime. When set, it specifies the maximum renewable duration that the SSSD will attempt to request from the KDC. Note that this is only a request, and the KDC itself may choose to return a much shorter duration, or disallow renewals entirely.

Assuming that a renewable ticket was granted, the second option is krb5_renew_interval. This option specifies how often the SSSD should poll to see if any of the user TGTs have gone beyond 50% of their current lifetime. If they have, SSSD will perform a TGT renewal on the user’s behalf, extending the lifetime of the TGT.

Code reviews, collaboration and the SSSD 1.5.0 release

Yesterday, the SSSD team hit a major milestone: we released SSSD 1.5.0. This is probably the largest release we’ve done since 1.0, with over 150 commits. With several new features, particularly related to access-control, we’ve worked closely with our users to ensure that 1.5.0 is the most attuned project to their real-world needs. As I’ve mentioned in the past, SSSD relies on a very stringent code-review process for its development. Nothing makes it into the codebase without the close inspection and testing of one of the other developers. Simo Sorce, one of the original designers and developers of the System Security Services Daemon wrote an excellent blog post on how this review policy has enabled us to write very secure code.

One of the major changes we made to our process in SSSD 1.5.0 is that we started using the Coverity Integrity Manager as part of our continuous-integration environment. By running these scans on the full codebase regularly (twice a day), we were able to keep ahead of common coding mistakes such as resource leaks or missing NULL checks. While this scanning did reveal several legitimate bugs in the SSSD, it also revealed that we had been doing an exceptional job of avoiding serious bugs in the code. For a detailed analysis of our Coverity results, see Simo’s blog post.

So, what have we learned from the SSSD 1.5.0 cycle? First, there is no substitute for feedback from real-world users of your project. By maintaining our relationship with our most vocal users, I think we’ve achieved some truly great things. Secondly, when developing a security product, there’s no such thing as too many eyes on the code, be they other developers, users or automated scanning technologies.

 Continue reading 

Net Neutrality and you

By now, you’ve probably heard the term “Net Neutrality” banded about on the Internet. Perhaps you wonder what it means. Perhaps you’ve been watching certain biased news networks and have a warped understanding of the term. In either case, I’d like to try and provide a human-readable explanation of what the positions are on Net Neutrality.

I think the first thing I need to describe is a little bit of how the Internet actually works. When you type in an address in your web browser, you are not connecting directly to that web server. Instead, what happens is that your request travels through intermediaries. First it goes to your Internet service provider that you bought your connection from (e.g. Comcast, Verizon, AT&T, NetZero, etc.). From there, it’s transmitted through several high-bandwidth providers, sometimes owned by the same company, sometimes another company, until it arrives finally at the server you wanted to talk to. We’ll call these middle hops (including your ISP, and the backbones that they talk to) “intermediaries”. I will use several of these companies as examples below, however they are only hypothetical examples and should not be misconstrued as an endorsement or accusation of misconduct by any particular entity.

Next, lets describe bandwidth a little bit. The classic example is to compare it to modern running-water plumbing. Bandwidth in this case could be described by the diameter of the water pipe. The wider the pipe, the more water than can flow through it each minute. Similar to pipes as well, you have a problem when the amount of water (data) you try to send through the pipe (Internet connection) is greater than its ability to pass it through.

Let’s try and define the term “Net Neutrality”. What does it mean, and how does it affect you?

Net Neutrality is a proposal that the Internet needs to be legislated to guarantee that the intermediaries between you and the server you’re trying to talk to cannot make deals to disallow access to certain services. For example, lets say that your Internet service provider is Comcast. Comcast owns NBC/Universal, which provides a website for streaming the latest episodes of its television programs. They have other competitors in the television market, so they decide that anyone using their communication lines can only connect at full speed to nbc.com, and that sites like hulu.com, cw.com and cbs.com are going to be limited to 10% speed so that they don’t use up the available bandwidth that Comcast wants to be used for nbc.com. As mentioned above, Comcast has a certain amount of bandwidth that it can take advantage of. As more and more people get on the Internet with an assortment of devices (computers, smartphones, Internet-enabled televisions, etc.), there is an increased demand on Comcast’s bandwidth.

From Comcast’s perspective, it would now have two choices:

  1. Increase its bandwidth. This is very expensive, as it involves expending millions of dollars on new equipment
  2. Find ways to divide the bandwidth it already has

The second option is where we start to get into Net Neutrality territory. In the example above, I provided the example that, in situations where the available bandwidth is at its limit, Comcast might choose to restrict the passage of data destined for one of its competitors, so that it could reserve a larger part of its available bandwidth for its own services. This is not necessarily a particularly good example, because situations like this might lead to antitrust lawsuits.

There is, however, another case that is the crux of the Net Neutrality argument. It is essentially this: Should Comcast be allowed to sell different tiers of service to different customers. For example, can hulu.com come to Comcast and offer to pay double the going rate for traffic through its network, and be granted a larger percentage of the available pipe? This is the source of the largest piece of confusion in the Net Neutrality debate. Those who do not understand the technologies behind the Internet see this discussion as a free market situation. In other words, if a customer of Comcast can afford to pay more for higher quality of service, then they should be allowed to do so.

However, think back to the plumbing explanation. Comcast is not building brand-new infrastructure to support these new higher-paying clients. They are merely guaranteeing them a higher percentage of the available bandwidth pass-through. This cannot happen without lowering the available bandwidth for those services that cannot afford the higher tier. Furthermore, if large numbers of clients with deep pockets pay for the privilege of higher bandwidth, they are further reducing the available bandwidth for the lower-paying customers.

In other words, no real product is gained by paying for the new service. It is merely redistributed (and how’s that for socialism, guys?). At the same time, it reduces the ability of smaller players from being able to deliver new, disruptive technologies and products.

Where would the world be if companies like Google and Apple had not been able to do business on the Internet because Yahoo and Microsoft were the big players and owned all of the bandwidth? Without a guarantee of Net Neutrality, the next great invention may never see the light of day because there will be no way to deliver it into the hands of customers.

The guarantee of Net Neutrality is that all bandwidth through an intermediary must be treated with the same priority as any other. There should be no artificial slowing of any customer’s data simply because another customer has deeper pockets.

Many of the larger telecommunications providers argue that it’s their right as the owner and maintainer of the communication lines to do with them as they please. There is certainly a point to be made from this, but at this point, the economy of the United States and the world is at stake if the fundamental operation of the Internet changes.

I hope I’ve explained this issue in a way that is easy to understand. If so, please write your local legislators and tell them that you support Net Neutrality for the rights of consumers and small business.

On the importance of collaboration

First of all, thank you. Yup, I was talking to you.

People like me tend to get the credit when things go right, and the axe when things don’t, but in the open-source world it’s you that ultimately decides the fate of a project. Engineers and managers and designers work hard, this is true. None of that matters unless we have an involved community simultaneously pointing at the shiny object up in the clouds while holding our feet firmly to the ground.

If you’re confused, please allow me to explain. Engineers, managers and designers tend to have a critical failing: being creators has a tendency to go to our heads, and we assume (when not told otherwise) that our way of doing things is The Right Way. Sometimes, we need to be deflated. It can be oh-so-tempting as an engineer to throw yourself head-first into an exciting new feature, but it’s our community that reminds us when this just isn’t the best way to spend our time. Sometimes it’s not the most glamorous work to play bug whack-a-mole instead of working on the shiny feature, but this doesn’t make it any less important. So having a community that’s active (rather than reactive), following our progress and reminding us where our priorities lie is an invaluable advantage.

Similarly, the existence of a proactive community can bring thoughts to mind that we might never have dreamed of on our own. In the last six months, we’ve had people throw some fantastic ideas at us for the SSSD. Some have appeared in the 1.1.x release and others in the 1.2.x series. We originally approached the SSSD from the perspective of a laptop user being able to maintain a single consistent user account whether connected to their corporate network or not. Thanks to some tips from our community, we’ve realized that we also have a place in the datacenter, helping servers ride out the occasional internal outage.

These are examples of the true power of open-source. It’s easy to come out and say “well, it’s better because more people can look at the code and find fault with it”. The true power of open-source is that people can look at the code and make of it anything that they can dream. In a closed development style, the product is always going to be limited by the imaginations of the engineers and product management team in the company, with maybe the occasional support call having some effect (though not often, in my experience in the closed-source world). Not only does open-source welcome such ideas, we encourage them and nurture them. If we don’t have the time to implement a feature, we’ll try to hold your hand and guide you to do so for yourself.

Now, that last topic I need to discuss in a bit more detail. It’s all very good to say “we appreciate our community” and “we welcome patches”, but sometimes it can be hard to hold to that. There are times when it’s difficult to include a community-member’s ideas. Sometimes the reasons are obvious: the community member wants the world, on a silver platter, with the crusts cut off. Providing us with our next great idea is a wonderful thing, but sometimes people expect miracles. We’re only human (or in some cases, extremely sophisticated shell scripts).

Then there’s the flip side. Sometimes you have a community member that is so driven to see a feature that they want included that they will spend the time and do the work themselves. Proud with the work they’ve done, they volunteer a patch. Unfortunately, as I mentioned above, sometimes being creators has gone to our heads. There may be no disagreement that the patch would solve a problem or provide a great new feature, but sometimes it’s just not feasible to accept a patch as-is. Sometimes this is an innocuous thing: maybe the patch fails to meet the project’s style guidelines. Sometimes it might get ignored for an extended period because the main team is working under time constraints by someone that’s paying them for their work. Sometimes the patch may truly be flawed, and no amount of massaging can make it into anything usable (in a reasonable amount of time).

At times like these, communication becomes the key. As I said above, an open-source project lives or dies by its community. It’s all too easy to alienate people, especially with the aforementioned hubris we developers often suffer from. It should be understood that while contributions in general are always welcome, sometimes specific contributions are not (or at least are not ready for the mainstream). At times like these, it becomes more important than ever to have established a good relationship with your community.

I’m going to admit here that I don’t have the answer to many of the difficult questions like “How do I deal with an unresponsive upstream?” or “Why does this person keep submitting the same patch that I told them doesn’t work?” It’s a learning process for everyone involved, on both sides. And come to think on it, maybe thinking of it as sides is part of the problem.

Why you should use talloc for your next project

Memory management is hard. This is one of the first things a programmer learns (usually by trial and much error) when they leave academia and get out into the real world. It is very easy to make mistakes when managing memory, especially when a particular piece of data needs to live beyond the life of the function that created it. It can become difficult to know when the memory is safe to destroy, as well as when it is optimal to destroy it.

In standard C, a programmer would use malloc() and free() to manage their memory. The problem with this is that every section of memory is allocated independently. There are no inherent relationships between bits of data. The programmer is required to maintain any relationships between data in their own code.

Enter talloc, which is a hierarchical memory-management tool wrapped around C’s malloc(). The basics of talloc are easy to pick up. With talloc, you have the option of declaring that the memory you are allocating is a child of another piece of memory. The advantage to this approach is that calling talloc_free() on any piece of talloc-allocated memory will not only delete that memory, but will recursively descend through any children of that memory and free them first.

To provide a trivial example, consider that you wanted to create a new struct containing student data:

struct student {
   char *name;
}

In a traditional C approach, you would allocate memory for a new student in this manner:

student1 = malloc(sizeof(struct student));
student1->name = strdup("steve");

and would sometime later be freed with:

free(student1->name);
free(student1);

That works fine in the trivial case, but start considering what happens when you have much more complicated data structures. It becomes a challenge to ensure that you free all memory in the proper order so as to ensure that you don’t leave any dangling memory behind. Traditionally, this would be done by creating a cleanup function for your structure. Internally, this cleanup function would recursively call the cleanup functions for every subordinate structure, until finally it removed the toplevel memory.

The problem with this approach is that it requires the creation and maintenance of large numbers of cleanup functions.

The same problem with talloc is markedly simpler.

student1 = talloc(NULL, struct student);
student1->name = talloc_strdup(student1, "Steve");

Later, the struct can be freed with the single command:

talloc_free(student);

Now, in the trivial case this doesn’t look terribly impressive, but try considering when you have nested structs, structs containing large numbers of strings, etc.  talloc_free(<toplevel>) will recursively clean up all of the child memory. No need to write complicated cleanup scripts to ensure that the memory is all gone.

Furthermore, talloc makes it very easy to abort the changes in a function. For example, partway through a complicated function, a fatal error occurs. In a traditional model, one would now need to examine all the memory that has been allocated thus far in the function and free it. A cleanup function may not be of any help here, as it would expect a fully-constructed structure to remove. With talloc, you simply need to delete the parent context and you’ll be certain to know that it will be completely cleaned up, regardless of its partially-constructed state.

So lets talk about more advanced and useful applications of talloc. Consider the case of asynchronous services. A request comes in (on a pipe, a TCP connection, etc.) requesting some information. Assuming that the service is unable to return a reply without performing additional functions (for example, contacting a remote server for authoritative data), the program would allocate memory to hold the data provided for the request, and then queue it up internally, to be processed when resources allow.

This request might require multiple trips to and from a remote server, it might require memory allocation and deallocation in many places, and it could fail with an error or be cancelled if the requesting process disconnects or otherwise indicates that it no longer cares about the reply.

So now we have a new concept: requests. With talloc, the way one would handle a request would be to create a request context. This request context would be a structure containing all of the data necessary to execute the event. As the event is processed by the mainloop, it may have additional subrequests (such as the example remote server query) attached as children to it. If at any time the request needs to be terminated, such as the original client has disconnected, all that is needed is to call talloc_free() on the original request and it will iterate through all of the allocated memory and clean up after itself.

Now, one thing I’ve glossed over is the case where just freeing the memory might not be enough. In the case of a request, before freeing memory it might be necessary to send a disconnect command to a remote server, or close a file descriptor. Talloc makes it easy to add a destructor to any allocated memory, such that when talloc_free() is called, it will first invoke this destructor and allow cleanup to commence. So in the case described above, one might add a destructor to the remote server query sub-request that would terminate the server connection in a non-destructive manner (or cancel a transaction but leave the connection in place, etc.)

By now, I think you begin to see the power inherent in the use of talloc over malloc. It’s five O’clock – do you know where your memory is?

How NOT to run a community

As you probably know, I am generally in favor of community-driven software development. I think being able to work alongside others of similar (or different!) goals can result in excellent progress in many different directions. It’s a great boon to development to not be forced to reinvent the wheel in order to move forward.

However, sometimes the naysayers have it right. There are times when, no matter how much you try to be a good citizen of a community, they just won’t let you.

I’ve been working for some time now on molding the fantastic Review Board software into a deployment for the Fedora Hosted infrastructure. Today, I was doing some testing on the upgrade feature, to make sure we wouldn’t get bitten in the future. Well, I’m glad I did, because it didn’t work.

After a bit of intense Google-searching, I finally happened upon the source of the problem: django_evolution has a long-standing (years) issue when used with PostgreSQL. That bug report, however, has a link to a patch that one intrepid user constructed as a means to work around the problem. I tested it myself and found that it worked. However, this is where we begin our cautionary tale.

Mistake number 1) Offhanded disregard for a community-submitted patch. The response from the upstream maintainers for this godsend of a patch was less than helpful. “Why did you copy the code from here instead of trying to make a common change?” and “Your patch breaks our tests. Go fix it.” (paraphrased). These are not friendly responses to an obviously helpful individual.

Since the discussion thread on that bug pretty much ended there, I decided to try myself to pick up where the original author left off. I downloaded the patch and modified it so that it would apply cleanly on the HEAD of the django_evolution repository. I tried it out on ReviewBoard, and miracle of miracles: the upgrade completed successfully.

So, armed with the knowledge that I now have a working solution to the problem, I decided to see what I could do to massage the patch into a format that would be accepted by upstream (given their unhelpful replies). So I dug into the source code… and discovered that I couldn’t figure out how to run this much-vaunted test suite. So I found my way to the Django IRC channel and started to ask questions about how to set up django_evolution to test my patch.

Mistake number 2) The denizens of that channel were… less than helpful. In the first place, I was berated for attempting to write a patch for “a dead project”. They paid no attention to my assertions that django_evolution worked just fine for ReviewBoard, and I just needed to solve this one little problem to grease the wheels and start the ball rolling again. They continuously insisted that I switch the project over to use a project called South and give up on django_evolution. Now, while I certainly understand the desire to always be using the Next Big Thing, I’m not actually a developer on the ReviewBoard project. I in fact have very little say about the architectural direction that the project takes. I certainly have no control over the use of django_evolution. These reasoned arguments were ignored, opting instead to extol the virtues of South and why it will work better and cure cancer in the process. (I exaggerated that last part).

Now, this is the behavior shown to an interested participant in their community. Moreso, it was a person who was trying very hard to improve upon a project, and was seeking only enough aid to simplify any review that might need to be done before accepting the patch. If this is how we treat those who are interested in the work we do, is it any surprise at all when our project fails? Why should we expect anyone who isn’t already intimately familiar with our work to offer even a second glance?

A community needs to be run with an understanding that not every member is going to be a lifelong hacker with three advanced degrees that are all directly applicable to the project. A community needs to be welcoming and understanding. A community needs to be willing to mentor and market itself in a positive light.

A community needs to be communal.

The road to SSSD 1.0.3

So, today we released SSSD 1.0.3. We fixed a few bugs, but more importantly we did it in an open way.

I generally assume that anyone reading this blog has heard of open-source before. If not, I suggest you log on to the Internet more. More eloquent bloggers than I have done a better job explaining the open-source concepts. The SSSD is more of a case-study in why open-source works. In this last week, we’ve done a bit of work on our own inside the core SSSD team headed in the direction of our next feature release, but at the same time we were working with outside developers.

One of these developers took a look at the SSSD and decided that it might come in handy for use on embedded devices like smart routers. So he tried to build the SSSD on the ARM processor. This failed, because most of our development on the SSSD takes place on i686 or x86_64 systems (with some limited testing done on PowerPC). Here we have the first example of why open-source works: someone came in from Outside with an idea: that the SSSD might have a place beyond personal computers. Beyond that, with the source code available to them, they began the effort of implementing this idea. Unfortunately, in its current state of development, the SSSD was unable to run on the ARM platform. This developer then took it upon himself to dive into the code and identify the source of the problem (if you care about the details regarding alignment of memory, feel free to look at our git history. I’m not going to bore you any more than I have to). Once he identified the issues, he created a patch and submitted it to our mailing list. We reviewed it and included it in the 1.0.3 release today. Open-source works because people like this  not only identify new ideas, but they have the ability to run with them as well. In a traditional, closed-source development team, the process would have looked more like this: 1) Open a bug. 2) Wait to see if the company developing the product agrees with this idea. 3) Wait to see if the company has resources to expend on implementing the idea. 4) Wait for a release including this fix.  With open-source, even had the SSSD team decided not to release 1.0.3 for weeks or months, this gentleman’s patch was available for anyone to apply atop our public source and get to work on.

The second story I want to tell is about the benefits of cooperation between distributions. Another prominent GNU/Linux distribution recently began work on updating their SSSD package to 1.0.2 (previously they had been running a very old and buggy 0.5.0). They ran into a bit of trouble and contacted us through the usual channels to help them track down what was going wrong. It turned out that one of the differences between these two distributions was this (and I apologize, but there’s no real way to avoid going into a little bit of detail on this): on Fedora, when you link against a shared library, you automatically inheret any link that this library has. In our case, we linked against libldb, which in turn linked against libkrb5. On the other distribution, however, there is no implicit linking. All links need to be made explicitly. So we tracked down where we needed to link properly and added that. “But Steve,” you ask. “How is this an example of distro cooperation improving the product for both distros?”.  The answer is this: working with this other distribution helped us to avoid a painfully embarrassing situation in the coming months. The reason for this is because one of the features coming in Fedora 13 is the very same requirement for explicit linking. So thanks to the developers from the other distribution, we were able to fix this ahead of time and save ourselves a lot of trouble (and support calls!) when Fedora 13 arrives.

Code review methodologies

One of the most important parts of any development process is probably its least-beloved: the code-review. Any project that entertains a notion of one day being stable and production-ready knows that it needs to do at least periodic reviews of the source.

There are many different approaches to the code review process. Some projects allow open commits during the first phase of development, then lock the tree and do in-depth code reviews, followed by a “hard freeze” wherein only reviewed and approved changes are further permitted. On the other end of the spectrum, many open-source projects follow the “hard freeze” party for their entire development cycle. For these projects (among which the SSSD is one), every patch is reviewed by at least one gatekeeper.

The former approach can serve a young project quite well. At the start of development, there are usually many small commits, few of which are functional or testable. By the time the code hits its first stable (or at least public) release, I suggest that it’s time to switch to the gatekeeper approach. The team should examine its membership for a very short list of gatekeepers. These will be the only members from there on that will have privilege to commit to the master repository. These individuals are the final say on what patches are accepted into a particular development branch.

This may sound a bit authoritarian. Perhaps it even sounds contrary to an open development process. I contend that stability should be the ultimate goal of any project, above and beyond “shiny features”. To ensure that this remains an open process, it becomes necessary for the code review process to be designed to be completely transparent. For example, it is usually not necessary for the gatekeepers to be the only code-reviewers on a project. In most cases, the reviews can and should be performed by any member of the team. The gatekeepers need be only a sanity check to ensure that the code-review was performed by a qualified individual.

Ok, so that’s all well and good, but how does one design a transparent and effective code-review system? There are two common approaches, which I will discuss briefly, and then I’d like to discuss a less-common approach that is gaining traction.

All of the common code-review processes involve somehow getting a patch file in the hands of a reviewer.  One process favored by very well-established projects is the ticket-driven review process. For old and stable projects, it is common for the process to require that all changes to the source correspond to an issue-tracker ticket (such as a Bugzilla or Trac ticket). The code-review process will often be designed around this ticketing system. When a developer has a patch ready for review, it will be attached to the ticket and the appropriate flags/status will be set to indicate that it needs a reviewer. Potential reviewers will then be informed that the patch is ready (perhaps because a saved search will now reveal that the ticket is awaiting review). Code review comments and resubmissions will then be handled within ticket comments. Once the review is passed, a gatekeeper will take the patch and apply it to the master repository.

The second (and probably most popular) code-review process for open-source development is by email. When a patch is ready, a developer will email the patch as an attachment to an email sent to the developer mailing list for the project. Another developer will then take up the review, responding with an ack or nack (indicating corrections that need to be made). The gatekeeper will then take the patch and apply it to the master repository.

Both of these approaches have their merits and disadvantages. The ticket approach works well for well-established projects, providing easy history and a link to every issue or feature that the patches are meant to address. On the other hand, for young projects, the additional overhead of filing a ticket for every change can be unnecessarily heavy. Young projects tend to have a much more rapid development pace and filing a ticket for every change will usually serve only to train engineers to write short, worthless ticket descriptions.

The email approach is much more agile, but it is rather less transparent, as it requires all participants to be subscribed to the relevant list. It adds a barrier to entry for new contributors that may have only a single small patch to submit (and may not want to deal with a heavy-traffic development list). Further, the email approach adds additional effort on the part of the gatekeepers. It is necessary for gatekeepers to develop for themselves a system to ensure that patches are reviewed in a timely manner, as well as not forgotten.

To address these issues, some projects have begun looking to new code-review tools. As an example, I recently contributed to the packaging effort for the ReviewBoard tool. Tools such as this provide many benefits over the above approaches. It has built-in functionality to keep track of pending reviews and their status (approved, awaiting review, needs update etc.) Furthermore, they provide additional user interface enhancements, such as the ability to make review comments attached to the specific code that is being discussed. Such tools provide much greater visibility than the email approach, while retaining the same agility (especially if the tool is configured to announce updates to reviews on the developer mailing list).

As mentioned above, I’ve been working on packaging ReviewBoard for Fedora and EPEL, with the goal of making ReviewBoard available for projects hosted by the Fedora Hosted project. I think that the advantages provided by such a code-review system would contribute greatly to the security and stability of open-source software.