At SCALE 9x I presented a talk in the DevOps track called Getting more Signal from your Noise.. You can download the slides (with notes, without notes), and this is a companion post which contains links and further information. I'd recommend reviewing that before reading more, as I won't rehash what I covered there. Due to the 30 minute timebox, I cut out even discussing a few large areas that I can address (briefly) here.
The other thing that's exploding are the number of businesses who will provide you some form of "shrink-wrapped" delivery of the kinds of tools (or at least results) discussed here. Depending on your business, going DIY and duct-taping together what you need may be the wrong idea. However, there are a few major reasons that DIY can be a good idea. Flexibility: We are learning every day the kinds of things we need our systems to tell us in a hurry. Being able to quickly tune them makes a difference. Latency and Connectivity: When you're using a system for real-time decisionmaking, at least having it on-premise means you can throw GB/sec at it, and have results in seconds.
In the talk, I introduced a model for thinking about what types of functionality the different tools available provide.
Many tools provide just one part of this stack, but more than that are 'hybrids'. Getting the data you need often means mixing and matching.
Other projects and ecosystems worth studying:
The literal elephant in the room that I discussed only tangentially, Hadoop (and the huge family of tools around it) can be an incredible asset to learning more about your world via storing/managing/questioning your data. Cloudera remains a great source of both software and education, and is a good place to start.
It's now possible to get real-er time information from a Hadoop system, but historically it's been essentially for things that are more time sensitive on the 1-day/1-month time range. (Trends, capacity, etc.)
The state of the art with log management used to be syslog + logrotate = done. There are a lot more options today.
A basic-to-advanced knowledge of statistics is becoming essential. There are powerful tools (like R, and many libraries available for different languages like SciPy) -- but if you don't know what operation you want them to do, they won't help.
Our tax dollars have actually provided a pretty useful introduction, the NIST Handbook. I have been overwhelmingly happy with how useful the book Data Analysis with Open Source Tools has been -- it takes some decent energy to get through, especially if your background is not so math/dev heavy, but it's insanely rewarding.
"Machine Learning" is still a pretty intimidating thing to Google for unless you've got a C.S. PhD. However, it's starting to be packaged and democratized enough that mere mortals can start to play.
Apache Mahout has a lot of potential to be of tremendous use here. Many people are using it more in text-related spaces, but the ability to find patterns and trends across multiple disparate systems is exactly what we need for things like botnet combat. I've just started looking at this and so far am very inspired to dig further.
This is a by-no-means-comprehensive list of the people whose tweets, software, and writing I've found useful to keep exploring the possibilities and pitfalls here:
Short of a series of books, updated monthly, all I can really provide is an appetiser. Hopefully you've been inspired to move on to your own data hacking "main course."