First Tinkerings with a Raspberry Pi

I’ve settled for DietPi (the Jessy Version) as an operating system, mostly because I wanted the easy peasy WiFi possibilies it offered.
I’m on a Mac, and needed to get the DietPi image on an SD Card, so I downloaded the 7z-file, noticed that I had nothing to unarchive 7zip-files, got myself the command line version of p7zip via sudo port install p7zip (Macports,mostly because brew refused to follow through), and unzipped the file 7za e DietPi_RPi-Jessie.7z . So far so un-exciting.
Then, I’ve looked for the SD card plugged into my Mac: diskutil list.

My SD Card is disk4 – and it needs to have FAT32 partitioning.
Then, just unmount the disk and copy the image file to the SD Card:
diskutil unmountDisk /dev/disk4
sudo dd bs=1M if=/Users/flffy/Downloads/DietPi_v94_RPi-Jessie.img
of=/dev/disk4

Then I edited the dietpi.txt on the SD Card, to have WIFI enabled, and also included the credentials.
Plugged-in the SD-card, turned on the Pi and … it worked. Including the WiFi. Including ssh (to find the IP, there’s nmap). Amazing. 🙂

Data Science Retreat Berlin – A Resume

Data Science Retreat‘ is a 3 months-full-time data science course for people who have a background in programming (R or python) and experience in data analysis. The course is taught in Berlin, by a number of specialist mentors, and focuses on predictive modelling and big-data-applications.

In November, I stumbled upon the webpage of the Data Science Retreat here in Berlin, completely by chance. I liked what I saw, and applied for the programme, expecting to never hear anything from them. Surprisingly, I actually heard back soon after, and a couple of interview calls later, I was accepted.

Also in November, I was working in one of Berlin’s big digital agencies. I had started as an referent to the CEO, and now worked in company’s newly funded online marketing unit as (mostly) an analyst.

Why did I fill in that application questionnaire? Because I wanted more programming and analytics in my working life. I had spent a significant part of my PhD years in front of an R console crunching data – and loved it. Unlike 2009, now people actually seemed to care about this language, and there might be an enjoyable career around the corner. The curriculum of the Data Science Retreat looked interesting, and like a good fit for what I needed.

All I was expecting from this course was an opportunity to find out whether I am good enough in what is now called ‘data science’ to do it as a job. I could probably have done this with online courses too, but meeting people (a.k.a. building a local network) was very important to me. My previous experience from learning programming and statistics in a non-programming/little statistics research environment was somewhat painful. Without the proper network you’re screwed (or just really slow in producing something useful).

Was that expectation met? Absolutely. The course was refreshingly non-academic: lectures came entangled with programming exercises, and a significant part of the time was spent building a project/product chosen by each participant himself. As I wasn’t looking for a certificate (I have more than enough certificates), but rather for a hands-on course, with mentors to point the finger at why your function isn’t running properly, this was perfect for me.

Also I met lot’s of smart and helpful people: My co-students came from anywhere (US to UK, Germany to China, and anything in between), and were nice and knowledgeable men (yes, only men!) with quite some experience in programming and/or impressive degrees in mathematics or similar. The mentors were super-accessible, and experts in their fields. In terms of personal network, DSR was an absolute win.

And the result? I went back to my former employer in a new role, before I got an offer for a position  that was so interesting, I could’t refuse. I’ll start in a months time, and am very much looking forward to it. Data Science Retreat Berlin FTW! 🙂

10 thoughts on leaving academia

Prelude: I’ve left academic research 3 years ago – and went from modelling gene networks to working in an internet agency. These years often felt like “That’s not what I signed up for!!!”, and I needed about half a year to find out what the hell I was even supposed to do there – and now I love it.
Recently, the subject of leaving academia keeps coming up in my conversations with random science people. Here’s a very abridged version of my view.

1. Universities produce too many life science PhDs.

2. Universities employ very few professors and permanent research staff.

3. The market is flooded with PhDs looking for permanent academic positions. The pay is low, and the contracts are short, and at the age of 40, your chances of being jobless or unhappy in a research tech job are quite high.

4. Getting out of academia if you’re older than 35 is not easy.
You will have a lot of experience, but that’s probably a lot of the wrong experience. I have been 28, when I quit my postdoc. At this point, I had spent 10 years in science (I started working in labs when I started uni) – and had some “minor” (irony alert!) issues with the transition.. Which brings us to:

5. Academia is very different from industry.
There are processes, meetings and deadlines every day. It’s productive. Also you don’t just think about a problem really long and hard. You talk about it in a series of scheduled workshops and then find a good solution for everybody. It feels slow, but in the end it’s faster than the academic option of just doing what you think is best.

6. Industry is (in my experience) MORE free than academia (while not being a professor).
In industry, you take the job if you like it. If you get bored, or don’t like it anymore, you just find a new one. Good luck trying that in academia.

7. The pay is a whole lot better (if you have the right skills).
Obviously you need to make sure you acquire the right skills. If you really feel like saving the world, how about saving the world with the money you make and the influence and knowledge you have because you worked your way up?

8. PhDs degrees are very useful. If you are in an environment where few people have them.
Somehow people actually think we are smart.

9. You really love research? Which part of research exactly?
What’s the part that is most fun? Solving problems? Uncovering knowledge? Doing things that nobody else is able to do? For me it was the latter…and now I’m doing more of this than ever in my lab years.

10. I hate these “10 reasons to …” posts.
(I always hated them. Leaving academia doesn’t change your personality that much.)

How to run Spark on EC2


I have recently been trying to run Spark on Amazon’s EC2. Here’s how it worked – and how it didn’t.

I have Spark 1.2.1 installed locally on my machine. Spark already comes with a script for running it on EC2 – that does most of that work for you. You just need

  • well, you need Spark. Get it from the Spark project site. The Documentation there on how to set up Spark on EC2 is quite good.
  • a working AWS account. This is easy, but takes a few minutes.
  • a user specific access key + key id. This needs to be generated at the (not ultra intuitive) AWS web interface.
  • amazon_keysa key pair. This is required for the ssh login. It’s easiest created in the amazon web interface:amazon_keys2You will only see that RSA private key one, so save it in a text file “mykey.pem” and put it somewhere useful, e.g. ~/.ssh/

Then, open a shell, and

1. set variables:

 export AWS_SECRET_ACCESS_KEY=l+m2qlU+DlERPQvnXRZEXAMPLE 

 export AWS_ACCESS_KEY_ID=AKIAJJ222FEXAMPLE 

 

2.  change permissions on “mykey.pem”, such that your user can access the file:

 chmod 400 ~/.ssh/mykey.pem 

 

3. cd into the right folder (your-spark-version/ec2), and launch a cluster instance:

 ./spark-ec2 –spark-version=1.2.1 –key-pair=mykey –identity-file=mykey.pem launch test-cluster 

Now wait. This is probably going to stall at for a while – telling you it’s just getting all instances ssh-ready.

When it’s done, you should see something like this:

Screen Shot 2015-03-01 at 22.34.50

If you don’t, then try to resume the job:

 ./spark-ec2 –spark-version=1.2.1 –key-pair=mykey –identity-file=mykey.pem launch test-cluster -resume 

After that, I saw what I wanted to see. The failed tests are actually non-critical errors, these processes are just not running when it’s trying to shut them down.

4. Now, ssh into your cluster:

 ./spark-ec2 -k mykey -i /Users/daniela.drechsel/mykey.pem -w 400 login test-cluster 

and see this:

Screen Shot 2015-03-01 at 22.34.57You’re in! At some point, you should probably also log out (quit), and shut down your cluster. That’s one with:

 ./spark-ec2 destroy test-cluster  

DSR Day 13 – You need a Corgie!

Did you know that Corgies are from Wales, and that the Welsh apparently say that in the old days, fairies used them to ride on? And that the white spot on the back of many Corgies is hence called a “fairy saddle”?
You didn’t? Well now you know, and so do I – thanks to today’s presentation training. We’ve covered Corgies, How Shazam works (super interesting!), Sleep (and the lack thereof), Random Graphs, and How to master a new skill.

What have we learned from that?

If you can choose a topic, pick something everybody can relate to.
Everybody likes Corgies! And everybody suffers from a lack of sleep. Also, if you can find something remotely scientific (if presenting for a lay audience) – people will love that. Science papers just look impressive. Careful with scientists though: I personally felt the urge to just look for a paper that claims the opposite of the one presented. And knew there would be one.

Make it personal.
Somebody told his personal story today, the story of how he wanted to learn touch typing. He also told how he failed, and what made him finally succeed. A good personal drama just works.

Be concise.
Goethe (the german Shakespeare) once wrote in a letter: “Please forgive me for sending such a long letter, I simply hat no time to write a short one.” Running overtime is annoying because it means your talk was not well planned, and you don’t value your audience’s time.

Talk to your audience.
Don’t use the “whiteboard of death” (teacher’s words) because it’s difficult to talk and write at the same time, and you will inevitable talk to the whiteboard. Personally I don’t mind whiteboards. You just shut up while you’re writing, then turn around and get the attention back on you. No big problem at all.
Also remove obstacles between you and the audience. A stand doesn’t help you.

– Apart from that…the usual: Visually appealing, non-cluttered slides. Know how to pronounce specific terms (Poisson ≠ Poison!), don’t mumble or stutter if you’re thinking – just make a strategic pause. Be funny.

– Also (my opinion), a bit of jargon (a.k.a. buzzwords) does some good. If you only have ten minutes, you don’t want to eight of them to explain what you’re talking about. Just use the damn buzzword, even if it’s not 100% right. Chances are that your audience will not know the difference anyway – but they immediately understand your talk.

Btw: Shazam recognises songs by fourier-transforming the song (using time slots) – i.e. you end up with a function of frequency at a given time slot. This info is then converted into a (not so) gigantic hash table that can be easily compared to the existing database.

Coming Up: International Open Data Hackathon, Feb 21st 2015

This year’s open data hackathon is just around the corner. The event is about the possibilities of open data, but also aims to connect people working in the field across the world, as well as to attract interested folks who would like to have a first taste of what open data can actually do.

Screen Shot 2015-02-10 at 22.46.04

There’s a wiki listing events around the world. Alternatively, you could just check the german website. Or just directly to Berlin’s Hackathon kicking off at 10 am at Correct!v.

DSR, Day 4 – From WTF to OMG

R, R all over again. It’s fun: I feel like being back on track now. Things are still a little slow for me. I have done very little coding for the past two years, and it’s noticeable. I moved from “WTF!” (I can’t do anything anymore) to “OMG” (this is awesome!).

We’re still covering variable types, and fairly simple operations on them. Today was data frames (love them, extremely versatile), matrices (always hated them, but now made my peace with them); also working with attributes and factors.
I liked factors, because they used to make things run quicker for categorial data, but apparently that feature is gone. Yet, they still make life simpler by easy renaming and ordering (good for plotting graphs!).

Apart from that…cut(), cut bins data into different buckets according to breaks. That does make categorising much easier than using the plain old data$category[which data$x>5]<-category5.

Dolores burritos for lunch – very packed, both the place and the burritos, yet both absolutely great.

Data Science Retreat, Day 3 – Queen of Typos

– Learning two different programming languages at the same time is strange – like learning two foreign langages at the same time. I’ve been using R for most of my PhD, and feel quite okay with it, and with it’s philosophy. I’m learning lots of new things – the kind of stuff, one only learns when somebody with a wide knowledge is explaining it PROPERLY. But still, the R thinking has made me lazy. Learning Python makes me notice. I have absolutely no idea, why python needs a while or for loop so frequently (well, I have, but why is there no such thing as lapply in python?).

– I am the queen of typos: writing more or less complex function is no massive problem, but there will be inevitably a stray bracket somewhere.

– Project ideas! I need one. One my my “colleagues” came up with an idea for media recommendation that involved scraping data from a company (“I just change my IP as soon as they block me.”). Well, if they’re blocking your IP when you scrape data, it means they don’t WANT you to steal their data anonymously. At this point one might at the very least ask them whether they would give you access to the dataset you need?

– I am thinking about a project moving around either social media and prediction of x. Or something around finding flats for sale/sold flats, and trying to predict the price development. Please not a recommendation algothithm. The world doesn’t need another one.

– I haven’t seen Sascha Lobo today.

Data Science Retreat, Day 2 – Intimately R

– Today was R day! 🙂

I’ve been using R since 2008, and I’ve learnt a lot today. My favourite was that

"<-"(x,1)

works the same way as

x<-1

That seems minor, but it is very neat, es. when writing more complex calls. I knew that basically everything relies on functions in R, but somehow I had never thought about what this means for “<-“, “[” etc.

– Also, R is the language for lazy programmers: There are many in-built functions that work like loops. Mostly there’s no need to rely heavily on if/for/while loops. That explains why I usually feel like cheating when saying I’d be quite ok in R “programming”: the language does a lot for you. So, for me, using R was always more about stitching together existing functions and finding the right packages than about fancy programming.

– I still feel a little weird because of that

but that’s going to pass, probably. We will see.

– Oh, and I saw Sascha Lobo this morning on my way to Zalando.

Data Science Retreat, Day 1 – The Nerd Shock

I was waking up really early (5.30 am!), got myself a coffee, and decided to quickly update my computer to MacOS Yosemite . Spoiler: This was a spectacularly stupid decision. But more to that later.

On the way to the tube I ran across a Sascha Lobo.

I got to Zalando’s offices fairly early, entered, was welcomed by the organiser in the entrance hall and ushered to the seminar room in the 9th floor. There: billions of cables. I think they had every cable ever produced, to connect monitors to the laptops people brought. The place looked like the mutant child of an unholy wedding of an IT storage room and a student union office.

The view from the 9th floor of Zalando’s Mollstrasse offices is fabulous – even when it’s very overcast. You can see until Prenzlauer Berg.

The people were … frankly, mostly a bunch of nerds. Who on earth would ask super detailed questions that will most probably be cleared up during the intro (because everybody will have the same questions)? Who would start the small talk at lunch with the question “So what do you prefer, MongoDB or MySQL?”.

Women (among trainees): 1 out of 10 (incl. me). Women (among teachers/mentors): 0 (up to now.)

There was no internet. How can there be no internet? At some point there was really slow WIFI. And there was cable LAN – but I don’t have an ethernet plug at my computer. I also don’t usually carry a thunderbold-to-ethernet adapter.

My computer lagged massively, up to the point that it was completely unusable. (Remember? Yosemite update!). Later it turned out to be caused by Ghostery (my tracking blocker browser plugin), and the horrible WIFI. As soon as I got rid of both – everything was running smoothly again.

The teaching was pretty good. We covered the basics of python (I am a python idiot novice), and I found that afternoon really useful.

It’s noticeable that the programme is still fairly new. There’s the odd non-organisation here and there. But hey. Next time, they’ll probably send around the software requirements in advance. Then the 80min it took for everybody to install python, anaconda, and iPython notebook might be filled with more useful teaching.

They’re mostly nerds, but they’re quite charming. And smart. And did I mention charming? Yet, I usually work for an internet agency that sells brand consulting, shiny web applications and marketing. Our style is very different.