Empowering Excel with pycel and xargs

How to easily run SQL queries against your org-mode spreadsheet.

background

a lawyer client approached me with a non-typical problem. his office renders services to a realestate purchase group in another country.

the group, consisting of a heterogenous investors, is subject to regulation that puts constraints on how the individual members investments are distributed across the multiple projects they finance.

a issue that emerged was the risk dispersion among members of the group in such a way that would not expose a single investor more than the rest of the group. this is done by striving to diversify the investment of each participant across as many objects as possible.

the law office then found it difficult to reliably predict & decide upon the optimal distribution of equity among its clients, the investors.

challenge

the client was eventually able to come up with an elaborate spreadsheet that calculated risk based on how equity was divided across objects. he couldn't, however, use that spreadsheet to iterate through all possible combinations with the purpose of finding the optimal one.

search

with a rather constrained budget, i realized that re-implementing his model in code would be way too labor intensive, and require constant maintenance in parallel to the spreadsheet, and thus not be acceptable.

i proceeded to look for an approach to programmatically compute the same spreadsheet after changing its input values an arbitrary amount of times. a solution was found in the form of pycel.

trouble

at this point, thinking that the tricky part is complete, i began looking at the amount of possible permutations to the sheet's input values. oops - it was a factorial of 30. the universe would end before brute-forcing so many permutations would complete.

seeking advice

i then consulted a few smarter colleagues. they all suggested, in one way or another that the only way to solve this would be by reformulating the problem. instead of using the client's spreadsheet as a black box, the algorithm needs to be understood and reimplemented in a way that is either more efficient to compute, or enables us to find the result mathematically.

as usual, i didn't take their advice. and this time for a good reason: it was beyond the scope of this project and the client's allocated budget.

a flawed but practical solution

upon consulting with the client, it was decided try & randomly generate as many permutations as possible, in order to get a "feel" for where the ballpark of the best result would be.

pycel invocation was significantly sped up by switching to pypy. from around 10 permutations per second, we went up to 30.

i then employed a shell script and xargs to launch a bunch of instances in parallel over my desktop machine's cores.

the speed went up to 330 permutations per second. leaving this for the night resulted in 5 million computed permutations, which, upon parsing, resulted in a single result significantly better than the rest in the same set.

epilogue

despite the fact that the result was not subject to mathematical validation

nor does it provide any guarantees about being the best, it is acceptable.

the client's spreadsheet is an approximative risk model

this makes the point of reaching a "perfect" solution rather moot.

we were able to reach a result that was verifiable by the client, using their own sheet, and providing a non-scientific, yet strong indication regaring the optimal way to distribute equity among the investors. and all in just 5 hours of work.

this problem stroke me as particularly interesting

due to the fact that it stands right on the crossroads between practicality and academia.

a satisfactory result was delivered

within a short timeframe and a low cost to the client thanks to an efficient & seamless integration into the client's workflow (microsoft excel), and the fact that everyone involved was pragmatic to quickly accept that a relatively large sample of random permutations was the best we could achieve.