On lists and sets

You probably don’t even need to think about what Collection type to use when: If you need to preserve the order or duplicate entries, you take a List. Otherwise if order does not matter and every element should only be contained once, you’ll use a Set.
Although nearly every programmer is able to explain that when ask, in practice I observe the tendency to use Lists by default, even if the data is logically unordered and unique. I’ve seen that in different teams and code bases now. Did you observe the same? Do you have an explanation for that?

I came up with a couple of possible reasons:

  • In some programming languages it’s common to use arrays for collections. I remember back in school when learning programming in Pascal, there was no Collection framework like in Java that would distingiush between Lists and Sets, there were only plain arrays. And if I’m not mistaken, vanilla JavaScript used to be the same. Is this the reason why many devs are just used to arrays and therefore use lists, which are close in terms of their properties?
  • The widespread Java Persistence API does always returns lists from a query, no matter if you specify an order criteria or not. Which kind of makes sense, because the result of a database query comes always as an observably ordered list of records, even if you don’t specify the order. So if you logically have a collection of results without specified order, you need to convert it to a Set explicitly.
  • The REST API of your Java service provides the data converted to JSON, so any collection is returned as an array, which inherently has an order if element – even if not intended. If the presentation layer is UI instead of rest, still the elements are displayed in some order. Do the devs feel that it’s not worth the effort to convert the JPA-returned List to a Set for the business layer and then have it converted back to a List in the presentation layer again?
  • Often the same code also works with a List, even if you logically deal with a Set. Are the devs just too lazy to think about it and make the distinction, because it’s comfortable to just use lists everywhere? You hardly ever actively want to remove duplicates, and you also don’t care if there is an (unneeded) order, right? If all you do with the collection is persisting it, iterating it, and check if something is contained, the is no difference?

There is actually a reason why this bothers me. I our current project, we write a lot of tests using Mockito; for a unit test e.g. in the presentation layer you then want to verify that the method under test did call the (mocked out) service layer with exactly the expected parameters.
A typical way to to that is this:

// ... some test code that calls the method under test ... 

// afterwards verify the expected behavior:
MyClass expectedParameter = new MyClass("foo", "bar");
verify(myService).doMagic(expectedParameter);

It checks that the passed parameter object has exactly the values foo and bar.
Now, what if the method under test also computes a collection of elements, that is passed to the mocked method as parameter? Let’s say, a collection of user ids:

// ... some test code that calls the method under test ... 

// afterwards verify the expected behavior:
List<Integer> expectedUserIds = Arrays.asList(123, 234, 345);
MyClass expectedParameter = new MyClass("foo", "bar");
verify(myService).doMagic(expectedParameter, expectedUserIds);

See how we unintentionally test the order of the user ids?
In case the method under test calls the service with the List [234, 345, 123], the test will fail, since the given List [123, 234, 345] does not match the actual argument.

I picked user ids as an example, because it’s easy to understand that you usally have a Set in mind: You want to pass all the users that are concerned; there should no user id occur twice, and the order doesn’t matter. Therefore, the Set type would be the better pick for the method parameter; the List probably works, but it forces you to specify the order of the expected elements. If the underlying implementation changes, you will see the test fail and have to adjust it. Or worse, if the method under test calculates the user ids in unpredictable order, the tests will sometimes succeed and sometimes fail without any code change.

Because of this, I would strongly recommend to consistently use Sets whereever a Set is logically what the collection represents, and only use Lists when order or duplicates are required.

On the other hand, if you are already working in a codebase where you have lists anywhere, it’s a tough call. We started to introduce Sets over time, doing it “right” for all the new code, but changing the existing code only where required to make it compile. Since once incoming http-request invokes Controller, Services and Repositories of both legacy and new code, we have a lot of inconsistencies now. A service method may be invoked with a List, converting it to a Set for two further method calls but passing the List to two others. It’s tempting to change those two to Set’s as well on the fly, but they again call others methods and are called by different Controllers, and each of the callers and called methods are covered with tests that specify that there are either Sets or Lists… it’s a mess!

My advice for you: Using Sets the right way is great. But being consistent is nearly as great. When you feel like changing List parameters to Sets where it makes sense, discuss with your team first (or whoever is contributing to your code base). If you have different opinions, don’t start changing the previous style on your own! That just creates chaos.
If you are all in to change, consider to refactor the whole code base (or at least one encapsulated component) in one go. Changing only some methods and leaving others as-is results in chaos as well. So be sure it’s worth it!

Leave a comment

Your email address will not be published.