BioImplement: The Cognitive Cost of Switching Technology Stacks

I do kinda feel like my head is full!
My context switching penalty is high and my process isolation is not what it used to be.

- Elon Musk, Reddit AMA, Jan 5, 2015

Cognitive load is a term applied to the overall effort used in working memory for an individual performing a task. Faced with any technology choice, we tend to concoct an approximation in our minds of the cost of effort, compared to the benefit of change. The cost that has been on my mind recently – is that of cognitive load. Even thinking about the irony of that statement adds to my cognitive load.

I moved to Singapore in 2007, with roads and driver’s seats opposite those I learned on in Canada. When driving there, conversations with passengers were halting, stressful and mentally draining. I could feel my brain fighting to avoid old reflexes, which seemed to conspire against my progress. Switching contexts between driving and navigating was a chore. I circled and doubled back quite often.

This experience left me sensitized to my own cognitive load. I began to notice how I reacted to context switching between computing technology and my own scientific domain, biology. Like Elon Musk, my context switching comes with a penalty best described as a “mental lag”. A period of time where I can remember nothing about what I know. This lag is a brief moment of stupidity, lasting seconds to minutes. It is as though my brain needs time and more clues to rebuild the branching needing to recall those things that do indeed reside deep in my memory. It seems like that the path into deep memory gets displaced by whatever I was last doing. The more cognitive load my last task used, the longer the lag seems. The discomfort of switching contexts seems to drive me to try to reduce my cognitive load.

Educators design instructional material to reduce cognitive load in a few ways.

Physical integration of information. (Think Wikipedia as our savior for any trivia question.)
Eliminating unnecessary redundancy. (We Canadians fill out government forms in one or the other official language, never both, no matter how fluently bilingual we are. )
Worked examples.
Open-ended exercises.

So my hypothesis here is that technology stack components that are successful – ones that entice people to switch to them – seem to reduce cognitive load in ways that approach the list above. At the same time they have a low switching cost. It is as though knowledge workers carry this approximation in their heads, balancing the real and cognitive costs and benefits of switching to new technologies, while at the same time watching network behavior so as not to get stranded on the wrong side of emerging successes.

Over the last 3 years, I have changed my complete computing stack, including infrastructure, operating systems, databases, and language. I attribute this big changeover to my fundamental need to reduce my cognitive load, and I am pleased to say it has.

Switching from physical infrastructure to a hybrid IaaS system has made life much easier. Aside from the usual number-of-cores-on-the-head-of-a-pin, or CapEx-vs-OpEx arguments, I would argue that the popularity of IaaS and PaaS cloud computing relates directly to reducing the cognitive load of developers. The cloud based web developer doesn't need to carry around the mental baggage of physical systems knowledge. Maybe this is obvious to everyone.

Cloud computers are invisible computers, and “out of sight – out of mind” seems to me to be a key to minimizing cognitive load. Cloud VMs remove the burden of needing to know much about the idiosyncrasies of physical server hardware installations: power, security, firewalls, networking, cabling, storage formatting, remote console booting, and OS driver updating. By using cloud computing, developers get a free chunk of cognitive load back to use for other fun software development stuff.

Another example is the rise of server-side Javascript. Through Node.js, Javascript has become a general purpose computing language, adding to its original role as the common language in web browsers. This is a clear case of – Two birds, one stone. Developers work more efficiently in a single language that spans the back-end and front-end of modern web applications. Efficiency improvements come from re-using code on both the server and browser side, but also from the reduced cognitive load needed to code in two separate languages and by avoiding context switching penalties.

The popular rise of Docker also seems to fit the pattern of a reduced cognitive load – as it chunks software deployments, abstracts Linux system calls, and promises to make it easier to deploy across heterogeneous cloud platforms. Packaging a Docker image consolidates familiar git based builds and package management based installations, carrying with it all the application dependencies. The Docker ecosystem promises to democratize the compute image itself, making it vendor neutral, and, at the same time, reducing the worry over cloud vendor lock-in. Docker embodies the concept of the physical integration of computing information, which was the first bullet point from the list above. But it will take at least another year of ecosystem development before Docker is widely used in production systems.

While these three are arguably successful and appear to reduce cognitive load, other new systems are eating your newly-freed cognitive capacity for lunch.

There has been a sprawl of database and storage technologies supporting unstructured data and Big Data. These aspire to provide solutions that scale by changing fundamental compute paradigms of databases and storage. If you believe Big Data listicles, our fate seem sealed and driving towards a NoSQL MongoDB, Hadoop/HDFS world for Data Science. But Hadoop, in particular, has not won many converts among my kindred spirits in Bioinformatics. The Bioinformatics stack depends on general purpose Unix/Linux computing and POSIX file systems. Somehow the cost to convert tools to the Hadoop/HDFS world is not yet justified to my peers.

So - what happens when general purpose computing systems catch up with these diverse and specialized Big Data systems? Consider two new open-source systems PostgreSQL 9.4 and Manta, both released in open source in late 2014. Both of these systems offer consolidations that enable unstructured data and Big Data computing within existing computing paradigms.

In December 2014 PostgreSQL 9.4 was released with the JSONB data type and extensions for JSON indexing and querying. These extensions allow any SQL column containing JSON data to work as a fast, SQL indexed and query-able JSON store. In practice, this makes NoSQL/JSON databases functionally redundant and embeds the functionality in an SQL system. One can query arbitrary JSON structures with SQL syntax extensions, and importantly, connect rich SQL-based visualization and computing tools like Tableau, and R, to the database. The specialization movement towards NoSQL/JSON databases, as implemented in MongoDB, has now lost its raison d'être. SQL has caught up in functionality. PostgreSQL could be a winner in the long-term and start edging out NoSQL and MySQL systems in a couple of short years.

In terms of Big Data/Data Science technology stacks, consider the on-disk data (and concept) redundancies that exist between traditional POSIX data storage, persistent cloud object storage (e.g. AWS S3), and Hadoop HDFS block storage. These are all separate systems that the popular Data Science stack depends upon for Big Data computing. Data Science practitioners must recall all the implementation details necessary to manage conventional POSIX file systems, cloud storage objects (buckets and eventual consistency models) and some obscure bits, like how to write Java tools to pack small files into optimally sized HDFS Sequence Files for Hadoop processing.

Manta consolidates this fragmentation of storage by providing a persistent object store with general purpose Map/Reduce computing using the CPU cores of commodity storage nodes. It provides a strongly consistent, hierarchical, ZFS (POSIX compliant) copy-on-write file system-based, object store. It allows secure container-based compute tasks to be moved to storage, rather than moving storage to computing nodes. This eliminates AWS S3 – EC2 – HDFS transfer time, duplicated data costs, and saves us from having to optimize HDFS by packing small files into a larger SequenceFile via a peculiar re-invention of tar.

On Manta, a user can build complex Map/Reduce pipelines involving any runtime language or compiled special purpose software (e.g. ffmpeg, OpenCV, BLAST, CRAM), and run Map/Reduce in a secure multi-tenant environment. Manta functions without the drag of Hadoop’s Java-based process control code and HDFS management code running on top of the operating system and storage. Manta ZFS storage is LZ4 compressed at the operating system level, a unique feature making the better use of commodity disk space. Open-source Manta software development kits (SDKs) allow Map/Reduce systems to be rapidly constructed with one-line Bash or PowerScript scripts, as well as in Node.js, Python, Go, R, Java and Ruby.

Despite the elimination of real storage redundancies and cognitive load redundancies imposed by maintaining three separate storage paradigms, Manta has a switching penalty – it is not a Linux based system. It runs on the x86 illumos based distro – SmartOS, a server operating system forked from OpenSolaris. SmartOS itself is specialized in running secure operating system containers (Zones), as well as the KVM hypervisor which can run Linux or Windows virtual machines.

Released as a cloud service in 2013, Manta had four distinct cognitive costs making adopters take pause. Now that it is open source, as of Nov 2014, there are three left. First, it has to be installed on physical or hosted commodity computing infrastructure. Someone has to rig the required networking and learn how to administer the system. Second, is the cognitive cost of switching your operating system command memory from Linux to SmartOS (i.e. reading the docs, cheat sheets, list-servers).

Third, and arguably the biggest sticking point, is the cognitive cost in cross-compiling Linux specialized code to SmartOS. Although over ten thousand packages have been ported to SmartOS’s package manager, writing software that is cross-platform between Linux and Unix systems is, in many cases, a dying art form. It would seem that the success of Linux has allowed developers to reduce their cognitive load and stop caring about cross-platform software compatibility.

But this last cognitive load barrier is falling fast. In 2014 Joyent’s engineers began updating the lx-branded zone, based on the old OpenSolaris Linux operating system call emulator. Now in open beta, lx-branded zones are containers that run current versions of CentOS and Ubuntu, and a majority of current 32- and 64-bit Linux software. Extensive community testing is helping find bugs, which are being eliminated fast and furiously.

The lx zone provides a way for Linux software to run on SmartOS at bare metal speeds without cross compiling code and, gives die-hard Linux users their cherished apt and yum package managers.

A key piece of Linux software targeted for the lx zone is the Docker daemon itself. For Manta this is most significant as it will allow Docker images to form the building blocks of Big Data computing on storage. When Joyent succeeds at this effort, it will tip the balance. The actual cost of maintaining data and moving it between three separate storage paradigms, POSIX, S3 and HDFS, will outweigh the cognitive cost of switching to Manta.

Then the Big Data/Data Science stack may be ready for serious disruption.

Disclaimer: I no longer work for Joyent, and have no competing financial interests.

PostgreSQL 9.4 JSONB

SlideShow:

http://www.slideshare.net/jkatz05/webscale-postgresql-jsonb

What's New:

https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.4#JSONB_Binary_JSON_storage

Hadoop Sequence File

Docs:

https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html

Manta: Object Storage with Integrated Compute

Overview and Docs:

https://www.joyent.com/object-storage

Open Source:

https://www.joyent.com/blog/sdc-and-manta-are-now-open-source

GitHub:

https://github.com/joyent/manta

Docker Rising:

https://www.joyent.com/blog/2014-in-review-docker-rising

SmartOS & lx-branded zones (LXz)

lx-branded zones on the SmartOS Wiki for Beta Testers:

https://wiki.smartos.org/display/DOC/LX+Branded+Zones

Bryan Cantrill on OS emulation in - The Dream Is Alive Video:

http://youtu.be/TrfD3pC0VSs?list=PLH8r-Scm3-2VmZhZ76tFPAhPOG0pvmjdA

Why SmartOS in my Lab?

http://smartos.blueprint.org/home/why-smartos-in-my-lab

Linux to SmartOS Cheat Sheet:

http://wiki.joyent.com/wiki/display/jpc2/The+Joyent+Linux-to-SmartOS+Cheat+Sheet

2 comments:

Tejuteju said...: Really Good blog post. provided a helpful information. I hope that you will post more updates like thisBig data hadoop online Course; 6:49 AM
Unknown said...: Best mouse trap I must admit that your post is really interesting. I have spent a lot of my spare time reading your content. Thank you a lot! and if you need then contact us!; 4:51 AM

Thursday, January 15, 2015

The Cognitive Cost of Switching Technology Stacks

2 comments: