Category Archives: devops

One year with Stackdriver

A year ago this week, I started at Stackdriver.  As the first engineer with the company, there wasn’t anything when I started. We had some ideas, solid funding and a small core group to start figuring out our product and building it.

Out first "office"
Working out of our first “office” in a classroom at Northeastern

Going in, I thought I had some idea of what I was doing and what I was in for. I mean, I’d been at later stage startups and heard the stories. I’d read the Eric Ries Lean Startup book. I had read all of the “blogs you’re supposed to read” and had them in my RSS reader so that I could soak up collective wisdom regularly.

In some ways, I was right… in others, I was wrong. Here’s a non-exhaustive list of things that are sticking with me after a year. That said, if you’re reading this because you’re thinking about or planning on joining a startup at an early stage, there’s a good chance you’ll have different ones that are important for you 🙂

  • Hiring is the single most important thing you can do. Good hires amplify everyone around them and make the team smarter and more effective. It’s pithy and everyone says it. But they say it because it’s true.
  • Hiring takes way more time than you think it will. Your network (and your network’s network) is the best way to source candidates. You’ll post your positions on various job boards and you’ll very very very occasionally get lucky with it. Recruiters will come from everywhere promising great results but will take up even more time since their candidate flow is way higher. Another challenge with recruiters is that when you are before the launch of your product, it can be difficult to convey what you’re doing as well as exactly the type of person you are looking for to join the team.
  • Fancy new technology is sexy and fun to work with. It also has a lot of problems you don’t have experience with. If you can minimize the number of these you have to deal with, you’ll be able to spend more time focusing on building the right product for the right problem
  • It’s hard to spend too much time talking with your target customer. If you think that what you have (be it a mockup or a prototype or something else) is “ready” to show to them, you’ve probably waited too long and could have gotten feedback already and learned something
  • Fail fast. Some things you try won’t work. Don’t continue to fixate on them and sink more time on them if they’re not coming together.
  • If you’re building something that’s a B2B product, think of your beta as a chance to test selling in addition to the product. Sure, you’re not going to have them pay today but you do want to know that you’re building something that people will pay for when you’re ready to switch to that.
  • Think iteratively. Once you start to hone in on a degree of product market fit, you’re going to discover that some of the things you built for prototyping/testing purposes don’t work. Don’t be afraid to replace them. But do so in a way that lets you regularly checkpoint the replacement to test that you are making things better. Grand rewrites are rarely as grand as you think and always take longer
  • If you’re going to spend a lot of time on something product-side, focus more on getting the interfaces right than the implementation. So, for example, if you decide that something is going to have clear boundaries passing messages over a queue, then you can switch between RabbitMQ, SQS and others with minimal effort as you learn the constraints that actually matter for your implementation.
  • Try to find the one piece of your product that immediately pops in a demo for most of your target users. This is one of those things where you’ll know what it is when you see it. And then use it as a hook to start drawing people in.
  • Interactions with the customer don’t end when they sign up for your product. Continue to nurture them and do regular feedback calls with them as you iterate on the product. This will help to make them into advocates for your service and they’re already bought into your vision making it
  • Bugs happen. Fixing them and providing awesome customer service is a great way to foster great customer relationships

It’s been a wild and crazy year but I have had a blast. I’ve done a bit of everything and learned things I didn’t even know there were to learn. Launching the beta of our cloud monitoring product a few months ago was an awesome experience and watching as we’ve started ramping up our sales engine to engage with customers and try it out has been phenomenal. And I’m really looking forward to the next step of launching the paid product and starting to track everything that goes along with that.

Here’s to the next year and many more after that!

Thoughts on DevOpsDays NYC

I’m currently on the train on my way back from DevOpsDays in Brooklyn. The conference was great — lots of smart people facing a lot of similar problems and trying to see what we could learn from each other. The scale was small, with only like 100-ish people present and not a ton of huge, in your face sponsorship. And the venue was a college campus. And so I kept making these comparisons in my head to LUG meetings, installfests and small scale Linux conferences.

Obviously the subject matter was a bit different — talking about and thinking about running large scale production infrastructures is a little bit different than the next cool Linux distribution. This tended, I think, to more discussion around patterns and best practices than about the specifics of “you should do X to get Y to work”. So a higher level and more abstract discussion.

The composition of the audience and attendees was a pretty similar make-up. Linux events always had a strong majority of the attendees who self-identified as sysadmins and then there tended to be a smaller number of developers. And many of the latter group had ended up in that camp due to necessity. The breakdown for DevOpsDays felt pretty similar with an interesting twist where there were speakers who said they were (paraphrasing) “developers first and fell into operations because they needed to”.

One thing that felt more evolutionary than anything else was that the side channel discussion for the event took place on Twitter rather than on IRC. I have (fond) memories of many conferences where attendees sat in an IRC channel and then basically continued to interact on IRC long after the conference had ended. In fact, I made many friends in this fashion. Similarly there was an ongoing discussion on Twitter using the #devopsdays hash tag and I have followed (and am being followed by) a number of the other attendees and hope to keep in touch and call them friends in the future.

And maybe the thing that struck me the most strongly was where people were “from”. Not in the sense of where they lived but rather where they worked. The attendees were almost all from startups. We were in Brooklyn and not the heart of downtown Manhattan, but NYC is probably home to more financial services companies than anywhere else in the world. And all of those companies have *many* people working in software dev and operations-y roles. But they weren’t there.

So it feels like “the DevOps movement” is going through a similar growth and evangelism pattern as open source and Linux did years ago. Maybe that’s why it feels so comfortable to me.

Can’t code with AWS outages so blogged instead

Although I haven’t really talked about it here, I joined a new startup a couple of months ago called Stackdriver where we’re working on building a hosted solution to make infrastructure monitoring and management suck less for users of the public cloud.  After a having to duct tape the various pieces together a couple of times now, it’s super clear that the need is there so it’s exciting to be working on solving it.  More on the side of being at a very early startup to come in the future.

Today I had planned to do some work around some of our provisioning and deployment code and Amazon had another EBS outage making the AWS API pretty unavailable for much of the afternoon.  So after doing some other things, I took a look at what fails along with EBS to help us remember what fails along with EBS and thought it was interesting enough to share.

A repo for the chef-omnibus packages

I finally got around to trying the Chef omnibus installer and it’s a step up from what I was doing previously but still not great.  Grabbing a shell script with curl or wget and piping it to your shell is an anti-pattern which I wish had never taken off.  Luckily, in this case, the shell scripts is just pulling down an rpm and installing it.  One step nicer would be if there were just a repo that you could use via yum and have things a yum install chef-full away.  And as I thought that this afternoon, I remembered the baseurl support in createrepo.  Thus, without further ado, I’ve thrown together a quick set of repos that just point to the files in the opscode s3 bucket and minimizes the amount of storage I have to do 😉  If you want to use them, just drop a file into /etc/yum.repos.d named something obvious like chef.repo

[chef]
name=Chef Omnibus Packages
baseurl=http://katzj.fedorapeople.org/chef-omnibus/el$releasever/$basearch
enabled=1
gpgcheck=0
#gpgkey=

I’ve only tested the EL6 x86_64 package but I went ahead and created the repos for EL5 and EL6, both i686 and x64_64.  Yes, the packages aren’t signed right now.  Hopefully that’s something that can be remedied relatively easily.  And even better would be if Opscode would just integrate the simple call to createrepo into their build process for the omnibus installer.

A Puppet User Trying Chef

I have a decent amount of experience at this point with puppet both from experience using it to manage the infrastructure running Fedora as well as setting it up at a pretty large scale at HubSpot.  But in a new gig, I decided it was worth rounding myself out a bit and giving chef a try.  Not out of any deep seated dislike of puppet but there are a few pieces that I’ve continued to run up against which are a little grating and so I figured it was worth broadening my horizons.  The nice thing is that both are fairly successful open source communities and realistically, as long as you’re using a system, you probably can’t go that wrong or switch in the future.

Side-note: I’ve also been playing with Michael Dehaan’s new project, ansible which is also interesting. But I don’t think it’s mature enough to use for a production environment yet and I also was mostly interested in it as a better remote execution layer as opposed to another full fledged config management tool. But yeah. It’s there.  It’s interesting. I’ll probably write more about it later.

With a little bit of chef time under my belt, I have to say that I’m not struck by drastic differences.  The terminologies are different, the DSL used on the config side is a bit different but they act pretty similarly and you can get either of them to do what you want.  That said, there are a few things (good and bad) that I’ve noticed about chef and figured I’d share for others who are looking at deciding for themselves.  Note that a few of the things in the dislikes section may well just be me missing something and being a n00b… suggestions welcome!

Things I’ve Liked 

  • Hosted Chef is a very very nice option to have.  Props to the Opscode team for building an infrastructure to run the server side for youand especially for making the barrier to entry nearly zero by letting you manage up to five hosts for free.  Given some of my headaches around running a puppetmaster previously, I’m glad not to be having to pull together everything to run a chef server
  • Knife is actually pretty cool.  I was skeptical before using it but it does a pretty nice job of encapsulating a lot of common tasks for you
  • Knife gets really cool with the addition of the ec2 plugin.  Launch servers, register them with hosted chef and have them ready to go.  I’ve built all of the surrounding bits and as the environment I’m dealing with grows, I think I’ll grow out of being able to use knife ec2 effectively, but it’s great for an easy starting point
  • Chef solo seems to work okay and have a few niceties over a master-less puppet setup but I didn’t spend much time with masterless puppet, so it’s probably just that I didn’t find the related nice pieces

Things I’ve Disliked / Been Annoyed By

  • The package support in the Fedora/CentOS/RHEL universe is pretty poor.  I realize that all the cool kids use Ubuntu these days but tons of server infrastructures are not.  Todd does a great job with the puppet (+ ecosystem) packages for Fedora and EPEL. Would love to see someone do similar for all of the Chef stuff
  • A lot of the cookbooks that are out there and published are Ubuntu specific. Even the ones which strive to work across distros often end up coercing the Fedora universe to look more like Debian.  Which isn’t necessarily a path I want to go down
    • Probably just a side effect of this but a lot of cookbooks using things which aren’t the standard init system (eg, depending on runit)
  • knife-ec2 makes you think you can get away with using it but I keep tripping across things it doesn’t support and making me consider abandoning it
  • Trying out cookbooks from others drives me crazy.  I’m pretty sure I’m missing the good workflow here but polluting my checkout by adding vendor branches and auto-committing things.  There’s gotta be something I’m missing here
So am I now a rabid chef fan?  Nope.  But it’s a nice system with some definite advantages for certain use cases.  I suspect I’ll find more of them as I use it more.

Graphing Jenkins Statistics

Like many people, we use Jenkins at work as our continuous integration server and we require that all changes that are committed go through being built in CI before they can get deployed.  Yesterday, someone asked if we could add another jenkins slave to try to reduce the amount of time spent waiting on builds.  While the slaves are fully puppetized and so it’s not much work to bring an additional slave online, my own anecdotal experience made me think that we weren’t really held up often in a way that additional slaves would help.  I had a vague memory of some graphs within jenkins so eventually found them but didn’t really find them that enlightening.  The scale is funky, it’s a weird exponential moving average and I just didn’t find it that easy to get any insight from them.

So last night, I sat down and wrote a quick little script to run via cron and pull some statistics and throw them into graphite.  Already with less than a day of data, I’m better able to tell that we end up with a few periods of about ten minutes where having more executors could help that are correlated with when someone does a commit to one of the projects at the base of our dependency tree.  So that gives us a lot better idea of whether or not the cost of an additional machine is worth the few minutes that we’d be able to save in those cases.

Since it didn’t look like anyone else had done anything along these lines yet, I put the code up on github.  There are a lot more stats that could be pulled out via the jenkins api, this is really just a starting point for what I needed today.

Velocity 2011

I spent last week out in California for the O’Reilly Velocity Conference.  It was in Santa Clara, which I hadn’t been to and frankly, I would be perfectly happy to not return.  Parts of California are nice, Santa Clara is an office building wasteland.  No good food options, nothing really going on, etc.  But I was there for a conference and not for other stuff, so it sufficed.

The conference was actually very good.  It has been a few years since I’ve been to a conference between grad school, my daughter being born, and being at a startup where conferences weren’t the priority.  But it was good to get back to it.  Had a lot of good hallway conversations with people about things that are relevant to us and saw a lot of good presentations.  And Velocity is especially relevant to me at this point as it was all about various web performance and operations stuff.  Where, unsurprisingly, there’s a lot of cool stuff going on.

I mostly kept to the more operations-y tracks just because they map better to what I’m currently working on.  I’ve come away with a bunch of things to look into and posted a whole bunch of choice quotes over on Twitter, but a few takeaways boiled down for here would include

  • If you’re using a public cloud provider, plan for things to fail.  Build your systems expecting it and you’ll have less pain.
  • HubSpot is doing an awesome job with post-mortems.  DanM actually posted a great blog post over on our dev blog about things we’ve learned from doing a lot of them.
  • Everyone complains and focuses on javascript performance but that’s misguided.  The bottleneck is the DOM.  Interestingly, none of the browser guys talked about that apparently
  • DevOps has mostly been about putting developers into ops (hi!) but also needs to be about putting ops into dev
  • Web performance has been very successful in tying itself to business metrics.  Weirdly, operations has overall been less successful at that
  • There’s a lot of work going on to help with debugging and working on webapps for mobile platforms.  Very cool.

None of those are particularly earth shattering revelations, but still good to see/hear.

Also, on Tuesday night I did a talk for the Ignite track.  So 5 minutes, 20 slides, auto-advancing.  My topic was “Just Too Late” and was largely around some things I’ve discovered transitioning into a role where I’m doing more ops stuff and the fact that I feel like I get to things too late.  But then turning it around and showing that’s not really so.  Stay tuned for a longer blog post on the topic.  But the talk went really well.  It was fun, a lot of positive feedback and was good for me to get back to it.  Looking forward to submitting some (full-length) proposals for talks for some conferences later this year.

I also had a few thoughts on the way conferences have changed since I last went to one

  • Twitter really is a pretty big game changer.  Lots of conversation on twitter during the conference about which sessions were good, useful tidbits from sessions, etc.  I actually felt that the experience was pretty strongly enhanced by it
  • Conference wireless still sucks.  But you can get decent data now for devices and avoid the use of the conference wireless entirely.  This made it easier to stay on twitter during the conference
  • An iPad (or other tablet) is a pretty perfect device for looking at stuff during a conference.  It sits on your lap so you can just check it sporadically, the battery lasts all day, you can get data from a cellular provider, and it’s reasonably fast.

Anyway, good time was had.  Thanks to all the people that I met and chatted up.  And hopefully it won’t be as long before I make it to another conference 🙂

My new role

I’m still at HubSpot but my role within the company has changed a bit over the past few months.  Related to the article that Yoav wrote which was posted on onStartups today about how we’re trying to better empower our engineers and teams to really own things, I’ve shifted my focus some.

Instead of working on the product which is front and center to all of our customers or even working on the free tools at grader.com that millions of people use, I’m now instead focused quite a bit on various infrastructure related things for us. Obviously, I’ve done some of that all along, but at this point, it’s my primary job.

It’s a lot of fun. We are heavy users of EC2 and some of the other Amazon services. We also are using Rackspace Cloud some. And I wouldn’t be surprised if we add another provider in the future. So there is a challenge in making all of these environments look the same for the rest of our dev team as well as our on call folks.  We’re also working to make it so that we can easily continue to scale out as our compute needs increase.  All the sorts of things that I’ve spent some time thinking about over the years, but there’s no theoretical here — we’re really deploying, managing and everything else a pretty large distributed system. We are using a fair bit of open source stuff in addition to building some stuff ourselves.  The first thing was obviously ami-creator but there’s more to come almost certainly. In addition, we’ll probably be doing some work and submitting some patches to improve some of the tools and things that we use as it makes sense to do so.

And as we we are growing like crazy, I’m looking to hire some people to join my team to help us get even more things done. If I were writing a job description it would probably include bits and pieces like Linux administration, python, puppet, probably devops (as it’s something that’s in mind), cloud automation (… even though I still hate the word cloud), release and build tooling, monitoring, and more. Sound interesting? Drop me a line and let’s talk.