I do kinda feel like my head is
full!
My context switching penalty is high and my process isolation is not what it used to be.
My context switching penalty is high and my process isolation is not what it used to be.
-
Elon Musk, Reddit AMA, Jan 5, 2015
Cognitive load is a term applied to the overall
effort used in working memory for an individual performing a task. Faced with
any technology choice, we tend to concoct an approximation in our minds of the
cost of effort, compared to the benefit of change. The cost that has been on my mind recently – is
that of cognitive load. Even thinking about the irony of that statement adds to
my cognitive load.
I moved to Singapore in 2007, with roads and
driver’s seats opposite those I learned on in Canada. When driving there, conversations
with passengers were halting, stressful and mentally draining. I could feel my
brain fighting to avoid old reflexes, which seemed to conspire against my
progress. Switching contexts between driving and navigating was a chore. I
circled and doubled back quite often.
This experience left me sensitized to my own
cognitive load. I began to notice how I reacted to context switching between
computing technology and my own scientific domain, biology. Like Elon Musk, my
context switching comes with a penalty best described as a “mental lag”. A
period of time where I can remember nothing about what I know. This lag is a
brief moment of stupidity, lasting seconds to minutes. It is as though my brain
needs time and more clues to rebuild the branching needing to recall those
things that do indeed reside deep in my memory. It seems like that the path into
deep memory gets displaced by whatever I was last doing. The more cognitive
load my last task used, the longer the lag seems. The discomfort of switching
contexts seems to drive me to try to reduce my cognitive load.
Educators design instructional material to
reduce cognitive load in a few ways.
- Physical integration of information. (Think Wikipedia as our savior for any trivia question.)
- Eliminating unnecessary redundancy. (We Canadians fill out government forms in one or the other official language, never both, no matter how fluently bilingual we are. )
- Worked examples.
- Open-ended exercises.
So my hypothesis here is that technology stack components
that are successful – ones that entice people to switch to them – seem to reduce
cognitive load in ways that approach the list above. At the same time they have
a low switching cost. It is as though knowledge
workers carry this approximation in their heads, balancing the real and
cognitive costs and benefits of switching to new technologies, while at the
same time watching network behavior so as not to get stranded on the wrong
side of emerging successes.
Over the last 3 years, I have changed my complete
computing stack, including infrastructure, operating systems, databases, and
language. I attribute this big changeover to my fundamental need to reduce my
cognitive load, and I am pleased to say it has.
Switching from physical
infrastructure to a hybrid IaaS system has made life much easier. Aside from
the usual number-of-cores-on-the-head-of-a-pin, or CapEx-vs-OpEx arguments, I
would argue that the popularity of IaaS and PaaS cloud computing relates directly
to reducing the cognitive load of developers. The cloud based web developer doesn't
need to carry around the mental baggage of physical systems knowledge. Maybe
this is obvious to everyone.
Cloud computers are invisible computers, and “out
of sight – out of mind” seems to me to be a key to minimizing cognitive load. Cloud
VMs remove the burden of needing to know much about the idiosyncrasies of
physical server hardware installations: power, security, firewalls, networking,
cabling, storage formatting, remote console booting, and OS driver updating. By
using cloud computing, developers get a free chunk of cognitive load back to
use for other fun software development stuff.
Another example is the rise of server-side
Javascript. Through Node.js, Javascript
has become a general purpose computing language, adding to its original role as
the common language in web browsers. This is a clear case of – Two birds, one stone. Developers work more efficiently in a single
language that spans the back-end and front-end of modern web applications.
Efficiency improvements come from re-using code on both the server and browser
side, but also from the reduced cognitive load needed to code in two separate
languages and by avoiding context switching penalties.
The popular rise of Docker also seems to fit
the pattern of a reduced cognitive load – as it chunks software deployments,
abstracts Linux system calls, and promises to make it easier to deploy across
heterogeneous cloud platforms. Packaging a Docker image consolidates familiar
git based builds and package management based installations, carrying with it
all the application dependencies. The
Docker ecosystem promises to democratize the compute image itself, making it
vendor neutral, and, at the same time, reducing the worry over cloud vendor lock-in.
Docker embodies the concept of the physical integration of computing information,
which was the first bullet point from the list above. But it will take at least another year of
ecosystem development before Docker is widely used in production systems.
While these three are arguably successful and
appear to reduce cognitive load, other new systems are eating your newly-freed cognitive
capacity for lunch.
There has been a sprawl of database and storage
technologies supporting unstructured data and Big Data. These aspire to provide
solutions that scale by changing fundamental compute paradigms of databases and
storage. If you believe Big Data
listicles, our fate seem sealed and driving towards a NoSQL MongoDB, Hadoop/HDFS
world for Data Science. But Hadoop, in particular, has not won many converts among
my kindred spirits in Bioinformatics. The Bioinformatics stack depends on general
purpose Unix/Linux computing and POSIX file systems. Somehow the cost to convert
tools to the Hadoop/HDFS world is not yet justified to my peers.
So - what happens when general purpose
computing systems catch up with these diverse and specialized Big Data systems?
Consider two new open-source systems PostgreSQL
9.4 and Manta, both released in open source in late 2014. Both of these systems
offer consolidations that enable unstructured data and Big Data computing
within existing computing paradigms.
In December 2014 PostgreSQL 9.4 was released
with the JSONB data type and extensions for JSON indexing and querying. These
extensions allow any SQL column containing JSON data to work as a fast, SQL indexed
and query-able JSON store. In practice, this makes NoSQL/JSON databases
functionally redundant and embeds the functionality in an SQL system. One can
query arbitrary JSON structures with SQL syntax extensions, and importantly,
connect rich SQL-based visualization and computing tools like Tableau, and R, to
the database. The specialization movement towards NoSQL/JSON databases, as
implemented in MongoDB, has now lost its raison d'être. SQL has caught up in functionality. PostgreSQL
could be a winner in the long-term and start edging out NoSQL and MySQL systems
in a couple of short years.
In terms of Big Data/Data Science technology
stacks, consider the on-disk data (and concept) redundancies that exist between
traditional POSIX data storage, persistent cloud object storage (e.g. AWS S3),
and Hadoop HDFS block storage. These are all separate systems that the popular Data
Science stack depends upon for Big Data computing. Data Science practitioners
must recall all the implementation details necessary to manage conventional POSIX
file systems, cloud storage objects (buckets and eventual consistency models)
and some obscure bits, like how to write Java tools to pack small files into
optimally sized HDFS Sequence Files for Hadoop processing.
Manta consolidates this fragmentation of
storage by providing a persistent object store with general purpose Map/Reduce
computing using the CPU cores of commodity storage nodes. It provides a
strongly consistent, hierarchical, ZFS (POSIX compliant) copy-on-write file system-based,
object store. It allows secure
container-based compute tasks to be moved to storage, rather than moving
storage to computing nodes. This eliminates AWS S3 – EC2 – HDFS transfer time, duplicated
data costs, and saves us from having to optimize HDFS by packing small files into a larger SequenceFile via a peculiar re-invention of tar.
On Manta, a user can build complex Map/Reduce
pipelines involving any runtime language
or compiled special purpose software (e.g. ffmpeg, OpenCV, BLAST, CRAM), and
run Map/Reduce in a secure multi-tenant environment. Manta functions without the
drag of Hadoop’s Java-based process control code and HDFS management code
running on top of the operating system and storage. Manta ZFS storage is LZ4
compressed at the operating system level, a unique feature making the better
use of commodity disk space. Open-source Manta software development kits (SDKs)
allow Map/Reduce systems to be rapidly constructed with one-line Bash or
PowerScript scripts, as well as in
Node.js, Python, Go, R, Java and Ruby.
Despite the elimination of real storage redundancies
and cognitive load redundancies imposed by maintaining three separate storage
paradigms, Manta has a switching penalty – it is not a Linux based system. It runs
on the x86 illumos based distro – SmartOS, a server operating system forked
from OpenSolaris. SmartOS itself is specialized in running secure operating
system containers (Zones), as well as the KVM hypervisor which can run Linux or
Windows virtual machines.
Released as a cloud service in 2013, Manta had
four distinct cognitive costs making adopters take pause. Now that it is open
source, as of Nov 2014, there are three left. First, it has to be installed on
physical or hosted commodity computing infrastructure. Someone has to rig the
required networking and learn how to administer the system. Second, is the cognitive
cost of switching your operating system command memory from Linux to SmartOS (i.e. reading the docs, cheat sheets,
list-servers).
Third, and arguably the biggest sticking point,
is the cognitive cost in cross-compiling Linux specialized code to SmartOS. Although
over ten thousand packages have been ported to SmartOS’s package manager, writing
software that is cross-platform between Linux and Unix systems is, in many
cases, a dying art form. It would seem that the success of Linux has allowed
developers to reduce their cognitive load and stop caring about cross-platform software
compatibility.
But this last cognitive load barrier is falling
fast. In 2014 Joyent’s engineers began updating the lx-branded zone, based on the old OpenSolaris Linux operating system call emulator. Now in open
beta, lx-branded zones are containers that run current versions of CentOS and Ubuntu, and a
majority of current 32- and 64-bit Linux software. Extensive community testing
is helping find bugs, which are being eliminated fast and furiously.
The lx zone provides a way for Linux
software to run on SmartOS at bare metal
speeds without cross compiling code and, gives die-hard Linux users their
cherished apt and yum package managers.
A key piece of Linux software targeted for the lx zone is the Docker daemon itself. For Manta this is most significant
as it will allow Docker images to form the building blocks of Big Data
computing on storage. When Joyent succeeds at this effort, it will tip the
balance. The actual cost of maintaining data and moving it between three
separate storage paradigms, POSIX, S3 and HDFS, will outweigh the cognitive cost
of switching to Manta.
Then the Big Data/Data Science stack may be ready for serious
disruption.
Disclaimer: I no longer work for Joyent, and have no competing financial interests.
PostgreSQL 9.4 JSONB
SlideShow:
What's New:
Hadoop Sequence File
Docs:
Manta: Object Storage with Integrated Compute
Overview and Docs:
Open Source:
GitHub:
Docker Rising:
SmartOS & lx-branded zones (LXz)
lx-branded zones on the SmartOS Wiki for Beta Testers:
Bryan Cantrill on OS emulation in - The Dream Is Alive Video:
Why SmartOS in my Lab?
Linux to SmartOS Cheat Sheet:
2 comments:
Really Good blog post. provided a helpful information. I hope that you will post more updates like thisBig data hadoop online Course
Best mouse trap I must admit that your post is really interesting. I have spent a lot of my spare time reading your content. Thank you a lot! and if you need then contact us!
Post a Comment