by

An up-coming Tuleap feature is the ability to search for anything anywhere. This is currently in lab-mode (see your user-preferences) and available on Tuleap platforms where the fulltextSearch plugin is installed and activated. This plugin leverages on elasticsearch.

elasticsearch-logoElasticsearch is both very powerful and very fast and its data structure is really different when you’re used to classic MySQL. The challenge with NoSQL databases is getting a good trade-off. SQL has been around for quite some time and I’m not expecting to see any major optimisations or performance gains from it- there’s only so far you can go in one direction before starting to max-out. Elasticsearch, on the other hand, is relatively new. The approach is very different and what we loose in solid data structure, we gain in searchability.

Now, one of most appreciated features of Tuleap is its permission granularity. For example, in the document manager, you have linux-like permissions. Using a project’s user groups, you can determine who has which type of access to which folder, the folders inside it and the documents inside them. This is particularly useful when you have public projects that contain some sensitive documents that only certain people should access.

What does this have to do with elasticsearch? Well, a lot. It’s all very nice being able to search for anything anywhere but you don’t want people to be able to find things they shouldn’t. The current recommended way to ensure that only certain people can find certain documents in Elasticsearch is via an Nginx set-up (and here). Unfortunately, the solution doesn’t seem to have the level of granularity that Tuleap needs. So are Tuleap and Elasticsearch just not meant for each other ? How can we solve this?

The first attempt: over-restrictive

Let’s use a simple example: one file in one folder. Two user groups: builders and designers. Using these elements, we have multiple permutations including:

  1. builders have access to the file and the folder;
  2. designers have access to the file and the folder;
  3. builders and designers both have access to the file and the folder;
  4. builders have access to the file and designers have access to the folder;
  5. designers have access to the file and builders have access to the folder;

In our first attempt, we indexed our elasticsearch document in such way a that only a group that had access to both the file and folder was listed as having access. In the query, we then passed the list of groups to which a user belonged in order to check if one of them matched. This looked something like this:

#the document for permutation 3
{
    "name" : "my doc",
    "permissions" : ["@builders", "@designers"]
}
#the query for a user who only belongs to the builders group
{
    "query" : {
        "filtered" : {
            "filter" : {
                "terms" : {
                    "permissions" : [@builders]
                }
            }
        }
    }
}

Now, this works perfectly well until you encounter a simple case: what if a builder is also a designer? Well, for permutations 1,2 and 3, all is fine. However, for 4 and 5, the document is not returned by our query. No group has access to both file and folder so the document is indexed as having an empty array of permission groups.   So how do you index a document with multi-level permissions in elasticsearch?

Blowing things up

The first step is to figure-out what we really need so let’s take permutation 4:  builders have access to the file and designers have access to the folder. The permissions of the document cannot be defined by a single group and we have already decided that an array of groups corresponds to the list of different groups that can access the document. Let’s continue with these rules. In order to index the permissions of a file that requires the user to be a member of both the developers and builders groups we can use concatenation to create a new group. Using a rule that groups are concatenated in alphabetical order, we get

"permissions" : ["@builder@designer"]

The problem this approach creates is within scalability. Let’s have an imaginary person named Dave and say that Dave is both a designer and a builder. Thus, Dave’s permissions has both groups in it. However, Dave also needs to access our document so he must also have the concatenated group in his permissions.

"permissions" : ["@designer", "@builder", "@builder@designer"]

This means that for Dave, who belongs to 2 groups, we must add 3 groups in the search query. No big deal? Let’s do some maths. It’s no coincidence that 2 user groups result in 3 in the query; it’s because there’s a natural a sequence behind it. For anyone interested, there’s an algorithm

f(1) = 0+1 = 1
f(2) = 2+1 = 3
f(3) = 3+2+1 = 6
f(4) = 4+3+2+1 = 10
f(5) = 5+4+3+2+1 = 15
..

We can see that this gets bigger quite quickly but it let’s say we consider it manageable in most cases.

The real problem comes when we try to index a document that’s within a big folder hierarchy. Let’s say that there are also many user group permissions on each folder. The number of permission combinations that need to be indexed in this case can quite quickly become massive. I got as far as finding that is was something like n! modulo the number of groups that were present in more than one folder.

With this not looking overly-promising, I turned to the Internet.

Let’s go scripting

The first thing I tried to do was to try to piggy-back off some other person’s trial and errors. This page on permission filtering seems interesting but it doesn’t seem to provide a full answer (I may be misunderstanding it). After many hours of unfruitfully searching the internet, reading the documentation seemed like a good idea. The nice thing about elasticsearch’s documentation is that its easy on the eye and doesn’t over-complicate things. It’s also full of little snippets and walk-throughs. Eventually, I came across scripting and have decided to give it a go. Elasticsearch doesn’t come with a large range of search keywords and that’s not a bad thing if you take scripting seriously. I think, the idea here is if the function you oh-so-desperately-would-love-to-have doesn’t exist, you can create it in exactly the way you want it!

There are multiple ways of adding a script to your elasticsearch but I currently prefer the plugin option. There is a very good example on the net and they’re simple enough that you don’t need to have JAVA as your primary language in order to write them (although it would be faster).

The way I intend to use scripting is as follows. I can index a document’s permissions as an array of arrays- one sub-array for each folder

"permissions" : [
    ["@designer"], #file permission groups
    ["@builder"]   #folder permission groups
]

Next, I will create a scipt in my plugin called something like all_match_at_least_one. This script, will run through each of the sub-arrays and check that each one contains at least one of the user’s permission groups, i.e. that none of the intersect arrays are not empty. Finally, I can runa query.

"filter" ; {
    "script" : {
        "script" : "all_match_at_least_one",
        "lang" : "native",
        "params" : {
            "field" : "permissions",
            "matches" : ["@designer", "@builder"] #Dave's user groups
        }
    }
}

I haven’t quite finished this yet but I will publish the results when it’s done.
If anyone reading this thinks that scripting is not the solution to the problem then I would really like to hear why ?

Note: for the first approach, there is an alternative with prime numbers. To each user group, you can associate a different prime number. Then a user’s permerissions equate to the product of all the numbers that correspond to the groups they belong in. The permissions for a document would equate to an array of numbers. The rule would then be that if one of the document’s numbers divides the user’s number then the user has access to it. E.g. If Dave belongs to three groups, each one associated to either 2, 3 and 5, then his number would 2 x 3 x 5 = 30. If a document then required that the user belongs to designers and builders that are represented by 2 and 3, respectively, then the document would require that the user’s number is divisble by 2 x 3 = 6. The is_divisible operation doesn’t exist in elasticsearch but it could easly be created via a script. However, decdidng if two numbers can be devided by each other is not a light-weight operation and we have to consider the possibility that there may be thousands of different user groups.

Note 2: The prime number approach also highlighted that the first approach could be further optimised in certain scenarios. What it tells us is that if one of document’s set of numbers divides another one, then we don’t need to index the larger one. This is intuitive when you consider a document that requires membership of either designers or designers and builders; only designers needs to be indexed.


About

Manon Midy

Manon has been working in the software engineering world from 2008. She enjoys the challenge to create economic value for a company where innovation and open source are the DNA. She is convinced it’s possible to provide professional services embracing FLOSS values (open-mind, transparency, co-elaboration) with business objectives. She believes the real strength of Enalean comes from the valuable men and women in their teams, as well as the powerful Tuleap techno.