Textonic - An explanation

Overview

Textonic is an open-source software project designed to permit users to classify and code large sets of simple qualitative data quickly, easily, and at reasonable cost.

Why Textonic matters

Imagine that you have a survey with a thousand respondants and one of the questions was "When did you last drive a motor vehicle?" You have two options in how you might let your respondants answer. First, you can decide what data you want up front and require respondants to select among those options ("in the past day", "in the past week", "in the past month", "none of these"). Second, you may let the respondants fill in whatever answers they want ("yesterday", "Tuesday", "I don't drive").

Traditionally, with a survey population of 1000, especially when the survey has a number of questions, the first option is chosen. It is extremely simple to convert a pre-defined set of answers into a computer-readable format, and thus it is easy to do statistical analysis on the results of a survey formatted this way. There are a number of other advantages to surveys designed in this manner, but by far the most important one is ease of analysis. It simply doesn't take that much work to go from the raw surveys to analyzable data sets. And with more and more surveys being distributed digitally, it's getting easier all the time.

The disadvantages of doing things this way are numerous. You give up a lot of serendipity, as users can not give answers which surprise you because you've decided what the possible answers are in advance. Your data is considerably less flexible as you commit yourself to a specific set of analysis up-front. In our "in the past day/week/month" example, consider the impossibility of figuring out who had driven in the past three days without an entire additional survey. Of course, historically, the analysis advantages of pre-defined surveys have always been such that doing a new survey (or making do with what data you have on hand) was more feasible than collecting data in another format.

It all boils down to the fact that a human can categorize most responses to a question with relative ease, but no one really wants to be involved in the drudgery of categorizing thousands of such responses, and even if they were willing to do so it turns out to be an exhausting and time-consuming process. Further, the bigger the dataset and the more complex the answers, the worse it gets. But time marches on, and technology changes the way we do things...

What Textonic actually does

Textonic seeks to solve this problem by splitting categorization tasks into small chunks and distributing them across dozens (or hundreds, or even thousands) of different people. By distributing the work, and doing so in a scalable manner, Textonic allows incredibly large data sets to be categorized very quickly in order to minimize turn-around time between data collection and analysis.

The most important piece of technology that Textonic currently uses is Amazon's Mechanical Turk service. The Mechanical Turk service allows you to submit simple tasks which other people will complete for the payment of a few cents.

The work-flow is relatively simple. You load your collected data into Textonic, for each question in your survey you identify what categories you want data divided into, you set your budget, and Textonic takes care of the rest. It provides updates on the progress that has been made on your data set, and it will notify you when it needs you to look at something directly in order to make a decision about how to proceed.

How Textonic works (the technical details)

Textonic is still in the early stages of development, so it's definitely rough around the edges. It is, in many senses, in alpha, but it's a functional alpha. So I'll be differentiating between currently-built functionality and actual functionality in this section.

Textonic's user-facing bits are written in Python using the Django framework to create a simle web-based user interface. This allows Textonic to be easily deployed on a local or a remote machine. It draws your qualitative data set from a MySQL database. Currently Textonic itself doesn't handle getti9ng data into this database, it simply assumes you're data collection system puts it there.

The front end allows you to define a template for the types of categories you want your data to be classified into. You can then save these templates for reuse in later tasks. Once a classification schema has been defined you pick a set of data to be classified, and a schema, and you click a button, and presto, Textonic works its magic.

Textonic's magic is another set of Python code, built upon the Boto library (a Python library for working with Amazon Web Services). Using Boto, the tasks are submitted based on a user-defined budget. Task completion status is checked on a regular basis and users are notified when their tasks are done at which point they are directed to a download link for a CSV file (which is generated by AWS automatically) which can then be used in whatever way the user needs.

Where is Textonic going?

Future endeavors are mainly built around higher rates of automation. Ideally, Textonic should allow you to upload CSVs which it can then put inside its database automatically. Similarly, Textonic needs to retrieve completed classification data automatically and store it in its database so that you can create more flexible and customized CSVs from the data.

Basically Textonic works, but it's not really easy enough to be generally useful yet. Unfortunately, while the project is still a compelling one, neither I now the other people who originally put it together have time to contribute these days. Fortunately, the project is open-source and all our code is available on Github.