Sequential Bias

When dealing with containers we may loosely focus on three major attributes:

order, whenever elements or slots are ordered or not
uniqueness, whenever one element can show up multiple times
association, whenever slots are identifiable by another value

And so a list could be considered an ordered, non-unique, associative container while a set could be an unordered, unique and non-associative one.

Of course, there are two significant gotchas here. This is a simplified classification that completely forgets to consider operations supported by container and a number of abstract containers are defined exactly by that. And let's not forget about the structure and implementation details, especially for the very specialized data structures.

The other gotcha is that in normal conditions we would reserve "associative" for containers like associative arrays and not simple lists. Here, I decided to classify it as such, because it is indexed and accessible. A stack, where elements are not necessarily accessible unless they are at the top and where indexing is a hidden detail would be considered as non-associative in this scenario. In general, I'd consider order, indexing, accessibility, and association to play closely together.

Detailed nature of the classification method, while interesting, is not the topic for today. For today's purpose, this simplified approach seems to be coherent and sufficient.

With that out of the way, let's talk about generating documents. A sudden change, I know.

In one of projects we were generating quite a lot of documents. In situations where we were handling data that had no significant order by itself I noticed I have tendency to sort it before putting it "on paper". Vague reason was that this way each revision of the document "felt similar". It felt similar because it was quicker to compare it by hand. This was reinforced by the documentation process which heavily relied on redlines and manual reviews. Consistent ordering across lifetime of a multi-revision document kept redline size small and easy to skim through.

But if I represent unsorted data as sorted in two different orders, is the data any different? No, because order does not matter. Why do I sort it then? Because it's easier to see differences. And why is that? Because I use tools (wetware or software) designed to handle sequential data. Why? Because these tools are commonly available.

The exact same thing happened when we were migrating data from the very same system into markdown text files tracked by git. Since both git and our brains were looking for ordered data, we exported ordered data.

Now, consider a Python script:

def do_one_thing():
	pass

def do_another_thing():
	pass

def do_both_things():
	do_one_thing()
	do_another_thing()

If we move do_both_things to the top, will the script change? Yes. It's a list of statements. Python interprets the code sequentially and the code itself is represented as a text. But will it really change? No. It's a map of functions. Python will bind the names the same way in both cases and the behaviour will not change. Both answers can be reasoned to be true.

Of course, if we take a look at the AST, we will confirm that from parser standpoint the first is true. Body of a module is a list, so the order matters. This makes sense since Python allows to put behaviour at module level. What if we simulate this situation in different languages?

Take C++ next. One can shuffle things around but only after preparing enough declarations first. Funny enough this reinforces both sides equally. If you shuffle things without declarations, they will easily break - meaning the program changed. If you prepare all the declarations needed, you can shuffle or split things as much as you want - meaning the program does not change.

In C++ case, we could summarise it as legacy and compatibility reasons. Can I continue to analyse and reason it for every single case or will it make sense to use an unordered representation sometimes? How would changing the assumptions change the problems, tooling and processes? One way to find out. For programming languages, I currently see one concrete thing: semantic diff and merge tools. You may already find those in the wild, but it also sounds like a cool weekend project to work on. Moreover, on the more theoretical side of things, I think there are still things to consider regarding structure and representation. I'm preparing a post about those, but the summary is "I'm quite stupid."

Nonetheless, I plan to keep this little bias in mind. Whether it is a result of my brain, tools, legacy or anything else. If I perform actions on certain representation of data with all its constraints, I will get results for exactly that and not the actual abstract data that is being represented. That's why analysing and representing problems should always be on of the priorities.