serialized.net

A study in fascination burnout

Using Ogres to Improve Your Designs

The problem

Creating new designs (in whatever domain) is always a bit of a tightrope walk – especially as an organization grows. How much do we try and figure out in advance, vs learning and adjusting as we go? How many people do we involve at what stage of the process?

When the design you’re working on will be largely owned and maintained by a 24x7 operations team, this “involvement” aspect can be particularly vital. It’s hard to imagine building a great and coherent design with 20+ people, but 20+ people still have to be full educated about how the system works, (largely) satisfied with the decisions that are made – and, more importantly, have years of collective experience that can help find weak points or make vital improvement suggestions.

An approach

Our current strategy is:

  • have a small team produce designs/prototypes/working code
  • have periodic larger-scale reviews for education, analysis and feedback.

What those “reviews” look like is a design that is, itself, undergoing incremental learning and feedback.

Our first stabs at it were fine, but clearly could be improved a lot. We brought the team together, walked through some slides, and did Q&A. In that format, though, there’s a pretty low upper bound on the number of people that can contribute at any moment. 20+ people in a “one person talks” scenario means a lot of standing around, which isn’t a great use of anyone’s time, and isn’t likely to even end up granting much attention and focus.

This week, we attempted at “version 2”, which while still not perfect, was a lot better.

Objectives

  1. Actually make the product great. (Or, barring that, “better.”)
  2. Create more (actual) shared ownership and understanding of the design.
  3. Enhance the social network between the teams so that ad-hoc collaboration becomes more likely in future.
  4. Improve shared vocabulary and methods around value and risk management (which we are bad at doing intuitively)
  5. Make good use of everyone’s time: be engaging and effective.
  6. Have fun!

How It Works, and what was that you were saying about Ogres?

Background

I love the book Gamestorming. It’s full of frameworks and patterns for having more than a few people interact in ways that’s both engaging and effective.

The review we did was based on the Gamestorming game Challenge Cards. In brief, you form two teams: one team creates challenges to the design, the other creates solutions.

The challenge team picks a card from the deck and plays it on the table, describing a scene or event where the issue might realistically arise. The solution team must then pick a card from their deck that addresses the challenge. If they have a solution they get a point, and if they don’t have a solution the challenge team gets a point. The teams then work together to design a card that addresses that challenge.

For fun, and to emphasize the game aspect (and dampen issues with taking criticism personally), I brought in a silly element of making the “solution team” be Knights, defending a castle – and the “challenge team” be gnarly Ogres.

The art and concepts were introduced in some pre-made cards for playing the game, and in the slide deck that introduced the game to the players. (All of that can be downloaded at the end of the post.)

Gameplay

The basic flow of the event (which took about 90 minutes total, each step being timeboxed) was:

  1. Get together, and do a brief review of the current design (focusing on recent changes, and in-progress work.) People get handouts of the architecture to review through the game.
  2. Split into two teams, Knights and Ogres, both of whom headed to their own rooms. (Anyone could choose to be on either team.)
  3. As a group, use the Heuristic Ideation Technique (more on that below) to brainstorm approaches for solution/challenge.
  4. Form smaller groups, and create the cards for the Challenge.
  5. Meet back up, and play through the decks as described above.
  6. Reconvene as a larger group and debrief.

Heuristic Ideation?

This is a clunky name for a simple and cool idea.

I used it to help brainstorm points of potential weakness in the design – breaking things down into “attributes and actions” and “components.”

Attributes are things about the system: Capacity, Security, Reliability, … Actions are things we know can change in the system: Upgrade, Deploy, Replace, Fail, ….

Components cover physical things: Hard Drive, Network Port, Switch, CPU, DIMM, …. They also cover logical things, like Authentication System, Filesystem, User Interface, …

I knew we’d end up with large enough lists that the grid layout called for in Heuristic Ideation wouldn’t work – our boards aren’t that big – so opted for this 2 column layout:

You can apply each attribute or action in the left column to each of the components on the right; by the time you get through all of them, you’ll have really thought through possible fragile areas of the system.

For example, take ‘Filesystem’ as a component; we can discuss it’s capacity, security, reliability, as well as what happens if we upgrade it, run a deploy of new code, if it fails, … and so on. Then, think about all of those attributes and actions applied to the next component! (Say, a network switch.)

Comparative Risk

Instead of just using blank index cards, I threw a bit of layout at it.

The extra fields help scope the risk a bit. (Obviously they are all going to be wild-ass guesses, but they still let us group things by order of magnitude.) Primarily, they’re just there to get people talking in these kinds of terms about relative possible impacts of unlikely things.

The #ragemode tag came from one of our customers who was on the bad end of a miscommunication surrounding some backups, and how fresh they were, (a bit too fresh, in his case), leading to them losing some data. It refers to the fact that some things, when they go wrong, give us the reaction “hey, it’s the internet, these things happen.” Others are infuriating. So it’s an attempt to let us weight possible failures by emotional impact.

When 2 cards are paired up, you can just do the math:

(Time to detect + Time to repair) * Customers Impacted * Expected times per year that this freakish thing might happen * #ragemode == A very fuzzy estimate of “customer-minutes of impact per year.”

Lessons

It was a lot of fun, and met (to at least some degree) all of the objectives I had for it. We’re still going through the cards and learning from them.

However, there was a lot of room for improvement.

First, Heuristic Ideation is awesome. We can use that in all kinds of different scenarios. I’m writing up a list of the Components/Actions/Attributes the teams brainstormed for the wiki, so we can use and further develop them in the future.

The biggest improvement to the game is in the Challenge/Solution dynamic. The Challenges tended to be very specific. (“Filesystem gets corrupted.”) However, the Solutions have to be pretty generic. (“Reinstall the server from scratch and restore from backup.”) Technically, that counted as a point-worthy exchange for the defense, but it didn’t really help us explore the problem.

I’d like to try it this way:

  • Still have the Knights do their defensive planning, but not actually make cards in advance
  • When the challenge card is played, come up with (on the fly) the best solution we can currently execute.
    • If that is “good enough” (as decided by the players, or arbitrated by the facilitator), the Knights get a point.
    • If it’s not, then the Ogres get the point, and then the two teams move on to collaboratively coming up with the solution.

That should keep the focus more on what really matters, and at the level of specifics, and less on the technicalities of “is your vague defensive card applicable here or not.”

A neat thing we discovered was that if someone proposed an attack for which our “defenses” were already good enough, that was a good data point that training/documentation/education needed to be beefed up around that aspect.

One of the Operations Managers had a great suggestion: now that everyone gets the basics, use this framework for fire drills. (Now known as “surprise attacks” or “Ogre Rush!”) Just pop into the NOC with a few of the “Attack” cards all filled out, and make sure people are solid on what the procedure would be to deal with that.

Share and Enjoy

Please feel free to rip off any part of this that was useful at all. I have compiled both PDF versions of the instructional slide deck and playing cards, as well as the original Keynote files.

Knights vs Ogres Game Materials

I made my best effort to track down royalty free art – please contact me if I’ve inadvertently used something of yours!

An Iteration of the Lean Meetings Concept

The Personal Kanban blog ran a great pair of articles on “Lean Meetings” (part 1, part 2.)

I recommend reading through them to get more context for what follows, but I’ll re-share one of the biggest insights: the point of a meeting is to have a conversation. Pre-set agendas feel like best practice, but they can get in the way of that ‘true goal’ more often than not.

Control, agendas, and procedures impeded conversation, focusing on the
structure of the meeting rather than the topics at hand. If you want
people to engage in and feel they’ve derived value from your meeting,
make them feel respected, not restricted.

After having orchestrated a few meetings using this idea, some helpful practices have emerged, which seemed worth sharing.

The basic framework still holds:

  1. Framework: Draw a Personal Kanban
  2. Personal Agendas: Invite all attendees to write their topics on sticky notes
  3. Democratization: Invite all attendees to vote on the topics on the table (each person gets two votes)
  4. Group Agenda: Prioritize the sticky notes
  5. Discuss

The riffs on that list are:

Framework

We found that converting (To Do, Doing, Done) to (To Do, Goals, Doing, Takeaways, Done) has been helpful.

Here is what those new substeps entail:

Goals

Before we actually start discussing a topic, create a shared understanding of what we hope to get out of it. Paraphrasing David Allen, “what does ‘wild success’ look like at the end of this?” What is the value we need to deliver to our customers or organization by the time we’re done?

Concretely, this can come down to basic things. Plan fall employee event? The goals might be “Define vision, date, coordinator, and budget.”

It can also be fuzzier: “Understand each other’s feelings about this topic.”

In practice, it’s been interesting how hard it can be for a team to get this articulated – and there’s a high correlation between “hard to find goals” and “conversations that would wander and go nowhere.”

Mechanically, as is hinted at in the diagram above, we found it natural to add a new sticky with the goals on it as a “rider” to the original topic note. That kept the goals clear and visible to everyone, and they were referred to frequently.

Takeaways

This is probably implicit in what it means to move a topic from “Doing” into “Done”, but it’s helpful as a reminder. “Who is responsible for taking things away from this item?” It’s a good hook for recording things, especially if there’s no formal note-taker.

We also use this as a hook to make sure to consider how any conclusions (or lack thereof) need to be communicated to the wider group that was not in attendance at the meeting.

Personal Agendas

At this point we’ve only really used this idea as-written, but I’m curious to play with various brainstorming/gamestorming or focusing techniques to help generate higher-value items for the team to choose from, rather than those that happen to be top-of-mind for the participants (or have been recorded in their own trusted systems.)

We did bring an idea from Kanban in – ‘item types.’ This was helpful at an all-day monthly meeting one of our teams have. We created a special designation (color of sticky works, as pictured above, or a separate section in the backlog) for items that are ”Today Or They Die” – things which will happen between now and the next meeting, regardless of if we discuss them or not. If it’s December 1, and your next Meeting is Jan 1, you should probably make your Christmas plans.

This can help with voting; those items may still not end up getting selected for discussion, but at least the team has a shared understanding that we’re letting the opportunity pass.

It’s mentioned in the linked articles, but the feature of being able to discover new items “on the fly” and work them seamlessly into the backlog is amazingly useful. In life things are fundamentally interconnected so it’s to be expected and encouraged that we’ll discover or create things that need discussion as a function of discussion. Agendas hate this idea.

Democratization

The ‘everyone gets N votes’ method works well. We also experimented with a few other methods of prioritization:

  • “If you only get to discuss one thing today…”: each member gets to select one item to bring into To Do. Perhaps more Socialist than Democratic, but it was a nice way to start off the day.
  • “Lightning Round”: Are there any topics remaining in the backlog we can get through productively in 5 minutes or less? This was a fun way to get back into things after a lunch break, but somewhat risky – one of the items selected turned into a 45 minute discussion.

One challenge with voting early in the discussion, particularly for all-day meetings, is allowing priority to evolve over time as perspectives and time pressures change. Using voting just to bring a few items into the ready queue, and then voting again (from scratch) to repopulate when it empties, seemed to work well.

Conclusion

This way of organizing meetings has really enhanced the quality of our discussions and the value they deliver. It ends up feeling much more natural – still providing the compass and context we’re frustrated about not having in a truly chaotic meeting, without the rigidity and creator bias that an agenda brings. After two separate meetings, participants made comments like “that was the most productive and engaging meeting I’ve ever been a part of.” That’s reason enough to continue exploring!

Thanks again to Jim and Tonianne for the education and inspiration.

Getting More Signal From Your Noise

At SCALE 9x I presented a talk in the DevOps track called Getting more Signal from your Noise.. You can download the slides (with notes, without notes), and this is a companion post which contains links and further information. I’d recommend reviewing that before reading more, as I won’t rehash what I covered there. Due to the 30 minute timebox, I cut out even discussing a few large areas that I can address (briefly) here.

The Point

  • Data + Open Source is an explosively growing field.
  • The complexity of our software systems – the way we deliver applications and services – is exploding.
  • The number of people trying/needing to build systems like this is exploding. (Not just Flickr and Facebook anymore.)
  • The demands on our time and attention are exploding.
  • The data that’s available to us from our infrastructures AND the “world around” (finance, customers, social media, etc) is exploding

The other thing that’s exploding are the number of businesses who will provide you some form of “shrink-wrapped” delivery of the kinds of tools (or at least results) discussed here. Depending on your business, going DIY and duct-taping together what you need may be the wrong idea. However, there are a few major reasons that DIY can be a good idea. * Flexibility: We are learning every day the kinds of things we need our systems to tell us in a hurry. Being able to quickly tune them makes a difference. * Latency and Connectivity: When you’re using a system for real-time decisionmaking, at least having it on-premise means you can throw GB/sec at it, and have results in seconds.

The Data Stack

In the talk, I introduced a model for thinking about what types of functionality the different tools available provide.

  • Collect
  • Transport
  • Process
  • Store
  • Present

Many tools provide just one part of this stack, but more than that are ‘hybrids’. Getting the data you need often means mixing and matching.

I called out graphite, collectd, OpenTSDB, reconnoiter, esper, and protovis as being particularly worth at least knowning about.

Other projects and ecosystems worth studying:

Hadoop

The literal elephant in the room that I discussed only tangentially, Hadoop (and the huge family of tools around it) can be an incredible asset to learning more about your world via storing/managing/questioning your data. Cloudera remains a great source of both software and education, and is a good place to start.

It’s now possible to get real-er time information from a Hadoop system, but historically it’s been essentially for things that are more time sensitive on the 1-day/1-month time range. (Trends, capacity, etc.)

Log Processing and Management

The state of the art with log management used to be syslog + logrotate = done. There are a lot more options today.

  • Many people are using HDFS (before or after processing) for both it’s scalability, resilience, and ability to integrate with the larger Hadoop family. Orbitz (awesome at sharing, first graphite, now this) have a great presentation about ‘Hadoop for Logs’ which is a good overview of what you’d be getting into.
  • Logstash and graylog2 can bring some of the utility of Splunk without the (ahem) cost structures. While graylog2 stores data in MongoDB, Logstash can optionally use ElasticSearch, which is a nicely packaged “throw text in and RESTfully search” engine. (Thanks to the author of Graylog2, @lennart, for clarifying that.) @timetabling suggests you can glue ElasticSearch to MongoDB, so that when the data changes you index it – but out of the box, that’s not an option.
  • A new entrant, Luwak runs on top of Riak, so you get the availability + scalability and a Map/Reduce interface to boot.
  • Flume and Scribe (as well as messaging systems like RabbitMQ) can replace syslog as the way to shovel raw logs around.
  • Google’s Sawzall is more on the processing side, but it allows you to describe patterns of information you want from your logs, and then in a map/reduce-like way aggregate them.

Data Analysis

A basic-to-advanced knowledge of statistics is becoming essential. There are powerful tools (like R, and many libraries available for different languages like SciPy) – but if you don’t know what operation you want them to do, they won’t help.

Our tax dollars have actually provided a pretty useful introduction, the NIST Handbook. I have been overwhelmingly happy with how useful the book Data Analysis with Open Source Tools has been – it takes some decent energy to get through, especially if your background is not so math/dev heavy, but it’s insanely rewarding.

Machine Learning

“Machine Learning” is still a pretty intimidating thing to Google for unless you’ve got a C.S. PhD. However, it’s starting to be packaged and democratized enough that mere mortals can start to play.

Apache Mahout has a lot of potential to be of tremendous use here. Many people are using it more in text-related spaces, but the ability to find patterns and trends across multiple disparate systems is exactly what we need for things like botnet combat. I’ve just started looking at this and so far am very inspired to dig further.

People to Watch

This is a by-no-means-comprehensive list of the people whose tweets, software, and writing I’ve found useful to keep exploring the possibilities and pitfalls here:

And so

Short of a series of books, updated monthly, all I can really provide is an appetiser. Hopefully you’ve been inspired to move on to your own data hacking “main course.”