Using Node_modules in Manta

How to use Node.js with node_modules in Manta

Continuing in the series of “common real world Manta questions”, another common one we hear a lot, at least from newcomers, is how to run node with node_modules. If you’re not familar with node, it’s basically the same problem as using ruby gems, perl modules, python eggs, etc. You have a particular VM, and a particularly built set of add-ons that go with it, and you want all of that available to run your program.

This post is going to walk you though writing a node program that uses several add-on modules (including native ones) to accomplish a real world task, which is an ad-hoc comparison of how Google’s Compact Language Detector compares to the “language” attribute that exists on tweets. Because tweets are obviously very small, I was actually just curious myself how well this worked, so I built a small node script that uses extra npm modules to test this out.

Concepts

If you’re not familiar yet with Manta, Manta is an object store with a twist: you can run compute in-situ on objects stored there. While the compute environment comes preloaded with a ton of standard utilities and libraries, sometimes you need custom code that isn’t available, or is customized in some way. To accomplish this you leverage two Manta concepts: assets and init; often these two are used together, as I will show you here.

The gist is that you create an asset of your necessary code and upload it as an object to Manta. When you submit a compute job, you specify the path to that object as an asset and Manta will automatically make it available to you in your compute environment on the filesystem, under /assets/$MANTA_USER/.... While you can then just unpack it as part of your exec line, this is actually fairly heavyweight, as exec gets run on every input object (recall that if Manta can, it will optimize by not evicting you from the virtual machine between every object). init allows you to run this once for the the full slice of time you get in the compute container.

Step 0: acquire some data

Since most of the twitter datasets are for-pay, I needed to get a sample dataset up. I wrote a small node2manta daemon that just buffers up tweets into 1GB files locally and then pushes those up into Manta under a date-based folder. Beyond pointing you at the source (below, and the only npm module in play besides manta is twit), I won’t go into any detail as it’s pretty straight-forward. In the scheme I used, we have 1 GB of tweets per object under the scheme /$MANTA_USER/stor/twitter/$DATE.json. Note you need a twitter developer account and application to fill in the ... credentials in this snippet.

pump tweet stream into Manta
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
var assert = require('assert');
var fs = require('fs');
var manta = require('manta');
var Twit = require('twit');
var util = require('util');

var M = manta.createClient({
  sign: manta.privateKeySigner({
    key: fs.readFileSync(process.env.HOME + '/.ssh/id_rsa', 'utf8'),
    keyId: process.env.MANTA_KEY_ID,
    user: process.env.MANTA_USER
  }),
  user: process.env.MANTA_USER,
  url: process.env.MANTA_URL
});
var T = new Twit({
  consumer_key: '...',
  consumer_secret: '...',
  access_token: '...',
  access_token_secret: '...'
});

var bytes = 0;
var f = '/var/tmp/tweets.json';
var live = false;

var fstream = fs.createWriteStream(f, {encoding: 'utf8'});
fstream.once('open', function () {
  var k = util.format('/%s/stor/twitter', process.env.MANTA_USER);
  M.mkdir(k, function (err) {
    assert.ifError(err);
    live = true;
  });

var tstream = T.stream('statuses/sample');
tstream.on('tweet', function (tweet) {
  if (live) {
    var s = JSON.stringify(tweet) + '\n';
    bytes += Buffer.byteLength(s);
    fstream.write(s);
    if (bytes >= 1000000000) {
      live = false;
      fstream.end();
      var k = util.format('/%s/stor/twitter/%s.json',
                          process.env.MANTA_USER,
                          new Date().toISOString());
      var opts = {
        type: 'application/json'
      };
      fstream = fs.createReadStream(f);
      M.put(k, fstream, opts, function (err) {
        assert.ifError(err);
        fstream = fs.createWriteStream(f, {encoding: 'utf8'});
        fstream.once('open', function () {
          live = true;
          bytes = 0;
        });
      });
    }
  }
});

After running that script for a while, I had this:

$ mls /$MANTA_USER/stor/twitter
2013-07-23T00:11:23.772Z.json
2013-07-23T01:47:49.732Z.json
2013-07-23T03:18:16.774Z.json
2013-07-23T04:49:40.730Z.json
2013-07-23T06:41:58.752Z.json
2013-07-23T09:03:19.772Z.json
2013-07-23T11:20:00.741Z.json
2013-07-23T13:07:43.800Z.json
2013-07-23T14:37:33.797Z.json
2013-07-23T16:03:44.764Z.json
2013-07-23T17:36:13.063Z.json

Step 1: mlogin, and write your code

Ok so we’ve got some data, now it’s time to write our map script. In this case I’m going to develop the entire workflow out of Manta using mlogin. If you’ve not seen mlogin before it’s basically the REPL of Manta. mlogin allows you to login to a temporary compute container with one of your objects mounted. This is actually critical for us in building an asset with node_modules as we need an OS environment (i.e., compilers, shared libraries, etc) that matches what our code will run on. So, I just fired up mlogin, and setup my project with npm (the export HOME bit is only to make gyp happy). Then I just hacked out a script in the manta VM by prototyping with this:

$ mlogin /mark.cavage/stor/twitter/2013-07-23T00:11:23.772Z.json
mark.cavage@manta # export HOME=/root
mark.cavage@manta # cd $HOME
mark.cavage@manta # npm install cld
mark.cavage@manta # emacs lang.js
mark.cavage@manta # head -1 $MANTA_INPUT_FILE | node lang.js

And the script I ended up with was:

Parsing a stream of newline separated tweets
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
var cld = require('cld');
var readline = require('readline');
var sprintf = require('util').format;

var FMT = '%d %s %s';
var rl = readline.createInterface({
  input: process.stdin,
  output: false
});

rl.on('line', function (l) {
  try {
    var obj = JSON.parse(l);
    var lang = cld.detect(obj.text);

    console.log(sprintf(FMT, (lang.code === obj.lang ? 1 : 0), lang.code, obj.lang));
  } catch (e) {}
});

So every tweet gets mapped to a 3 column output of $match $cld $twitter, which we can reduce on. Anyway, now that we’ve got this coded up, let’s tar and save into manta (again, from the mlogin session):

mark.cavage@manta # tar -cf tweet_lang_detect.tar lang.js node_modules
mark.cavage@manta # mput -f tweet_lang_detect.tar /$MANTA_USER/stor
...avage/stor/tweet_lang_detect.tar ==========================>] 100%  14.00MB
mark.cavage@manta #

To be pedantic while we’re here, we’ll go ahead an write the reduce step as well, even though it’s trivial. I’m just going to output two numbers, the first being the number of matches, and the second being the total dataset size. Note the reduce line below uses maggr, which is just a simple “math utility” Manta provides for common summing/average operations. Other uses report success using crush-tools. Use what you like, that’s the power of Manta :)

mark.cavage@manta # head -10 $MANTA_INPUT_FILE | node lang.js | maggr -c1='sum,count'
6,10
mark.cavage@manta #

So given 10 inputs, we’ve got a 60% success rate with cld. Let’s see how it does on a larger sample set.

You can now exit the mlogin session, we’re ready to rock.

Step 2: Run a Map/Reduce Job

Ok, so to recap, we hacked up a map/reduce script with an asset using mlogin, and now we want to run a job on our dataset. Twitter throttles your ability to suck their feed pretty aggressively, so by the time I wrote this blog I only had 11GB of data. That said, they’re just text files, so that should be a fairly large number of tweets. Let’s see how it does:

$ mfind -t o /$MANTA_USER/stor/twitter | \
    mjob create -o -s /$MANTA_USER/stor/tweet_lang_detect.tar \
                --init 'tar -xf /assets/$MANTA_USER/stor/tweet_lang_detect.tar' \
                -m 'node lang.js' \
                -r 'maggr -c1="sum,count"'
added 11 inputs to f7af6bcf-2126-4b1d-b9d5-c0f25a162786
2610121,3860111

How did I figure out what the -s and --init options should be you ask? Simple, I ran mlogin again with -s specified, and tested out what my untar line should be.

Side point, if you’re interested, this initial prototype took 1m19s to run. An enterprising individual would likely be able to cut that (at least) in half by “pre-reducing” as part of the map phase; in my case, ~1m latency was fine, because I’m lazy. Also, the entire time it took me to prototype this from no code to actually having my answer was about 20m (not counting the time it took to ingest data – I just ran that overnight).

Step 3: There is no step 3

We’re actually done now. Clearly you could go figure out more interesting statistics here, but really I just wanted to quickly see what cld did on a reasonable dataset (I had ~3.8M tweets); it turned out surprisingly close to my original prototype of 60% (~67% accurate on tweets).

Also, while this example used node to illustrate a real world “custom code” problem, the same technique would apply to python, ruby, etc.; you need to get your “tarball” built up in the same way, and just push it back into Manta for future jobs to consume.

Hopefully that helps!

Comments