Building data pipelines and analytical systems at massive scale. My experience lies in distributed systems, focusing on data driven large-scale systems (10.000+ nodes).
For the highly concurrent world my choice of development environment is Erlang(BEAM) and Clojure (JVM). Using functional languages that supports thousands of lightweight threads communicating with message passing and having inverted concurrency control enables low latency and high throughput with thread safe software.
Storing data at scale has been an interesting subject to me, I am familiar with the RCFile whitepaper and the more recent publication about ORC and Parquet. I have been using columnar stores beside the classical row oriented stores (SQL servers) and key-value stores (Riak, Couchbase).
Analysis of large datasets is sometimes challenging. Using caching and sampling and few other techniques makes it possible to query these sets. I am familiar with few query engines (Hive, PrestoDB, Tez).
I am currently working at StreamBright