Less Boring Resume

ℹ️ This is a work-in-progress narrative version of my resume that I hope resembles what you'd hear from me if we were talking over coffee or a drink. My hope is that it is a bit more fun to read than my standard resume and gives a better sense of context for my contributions.

Citrine Informatics

Customer's Problem

Let's say you're a materials scientist in 2010. You make advanced polymers for car door panels and bumpers. Your boss comes to you and says: "Audi wants a door panel that is 15% lighter, but still doesn't dent or scratch easily, and doesn't deform in the temperature range of -40C to 55C. Can we do that? How much would it cost? Do we need to buy new equipment to make it work?" This is a very hard set of questions! You'd probably say something like "I'm not sure, but if you give me $1m and two years I'll get you an answer," and your boss would say "I can give you 18 months and $600k" and you'd grumble a little and get to work pursuing research ideas for lightweight glass-embedded polymers you've read about in papers or heard about at a conference recently.

Citrine's Product

Now let's say you're a materials scientist in 2022. Your boss comes and asks the same question. You say, "I'm not sure, but I used that Materials AI software we bought recently to generate some candidate materials that fit those criteria, along with estimated probabilities of success. We can systematically work through that list according to success probabilities and/or estimated performance, updating our estimates along the way, and either hit those targets or get back to Audi with a compromise spec within four months."

That is a much better answer! Everyone is happier about this. Your boss says "That software is amazing how did we ever live without it." Hundreds of :chart-with-upwards-trend: and :moneybag: emojis spontaneously appear at Citrine Informatics headquarters, and somewhere in San Mateo a venture capitalist smiles.

Citrine's Problem

Unfortunately, Citrine employees have no idea how to make glass-embedded polymers. Well, maybe one or two of them do, because Citrine poaches a lot of PhD materials scientists from academia, but they definitely don't know the specifics of how you make them. And even if they did, they're a software company, not a consultancy. They want to sell software to customers making steel, and polymers, and ceramics, and adhesives, and paints and coatings, and every other kind of thing. So, they need some way for each customer to provide training data but also, critically, the subject matter expertise required to make their thing. Without some ground truth analytic relationships (e.g.: Arrhenius Equation) injected into the models, there is almost never enough data to make decent predictions with materials data. Most companies have a few thousand real-world experimental data points at most. Materials R&D, unlike A/B tests on the internet, is an extremely expensive way to generate training data; individual data points constitute trade secrets that cost tens or hundreds of thousands of dollars to produce, so there aren't very many of them.

So, Citrine's software is useless until they can onboard the customer's data into a format their AI models can train on to generate candidate materials. Those data are often scattered among various excel files, pdfs, and in some cases physical records in labs and production facilities. There isn't a standard way to record processing and measurement data in the industry. The data are multidimensional, sparse, and small. They often can't be transformed into nice arrays-of-arrays to train AI models on without detailed subject matter expertise (read: additional user input).

What I did at Citrine

My very first task at Citrine was to collaborate with our in-house materials scientists, AI researchers, and product team to design a data model that could faithfully represent all of our customer's data. I cracked open "Information Architecture" from O'Reilly, read it cover-to-cover twice, and started asking a lot of questions about things like the scale of our customer's data (anywhere from small excel files to billions of simulated records from DFT models), likely access patterns, and the format our AI models needed the input data to be in. We came up with GEMD - an open-source specification and python library that served as the canonical data exchange format for the platform. Working with a growing team of engineers, we designed and built a python API client, backend API, storage layer, and data transformation system for GEMD.

I'm proud of that work, and particularly find of a few features that are only there because of me. The data structure is a doubly-linked graph, but we very carefully chose which points in the graph are one-to-many, one-to-one, and many-to-many in order to make certain traversals and queries as efficient as possible. Those restrictions are what make it possible to rapidly transform the graph into nice two dimensional matrices of data to use for AI model training. Citrine's backend implementation of GEMD also treats customer's unique identifiers as first-class citizens, so that folks can make use of the platform without ever needing to record a crosswalk from their identifiers to Citrine's. We even allowed for multiple customer IDs to exist on each record (namespaced to avoid collisions). Finally, I added a hierarchical tagging system that could be queried by prefix, allowing our customers and professional services teams to segment and query data with arbitrary granularity - without any backend development necessary. I referred to the tagging system as "transparently laundering the capabilities of DynamoDB directly to end users" which is a joke I think only a few other people ever really understood. The tagging system was heavily used by our customers, and yielded more concrete data about how our customers wanted to model and query their data than anything else in the system; it was an incredibly rich source of data-driven product and user feedback.

My team working on the GEMD API systems grew over time, from one engineer to as many as seven engineers at its peak. We worked closely with the professional services teams to provide speed and quality-of-life improvements for data onboarding. Initially, it would take weeks of work to onboard a new customer - our professional services teams would write reams of integration code to pipeline excel files or customer databases into GEMD records via the API, then select the data they wanted to vectorize and use in AI models. By the end of my time at Citrine we could take our customer's mess of data and have an AI-ready vectorized dataset, with no code written at all, in a matter of hours - where the bulk of that time was spent transforming arbitrary customer data into one of a few a standard data layouts in Excel or a CSV. All the API integration capabilities still existed and were extended over time to help customers with more sophisticated in-house data systems, and to power our browser SPA for data exploration and visualization.

As the team grew, I wore a lot of hats.

I was a de-facto engineering manager during a few 6m+ stretches of time when my team's on-paper manager had >20 direct reports. I did dozens of interviews, wrote performance reviews, managed projects, JIRA boards, ran standups for three years, and reported on project status to upper management and other stakeholders. I had regular 1-1s with every member of my team. I wrote our API and coding style guides, set up and managed our on-call rotation, onboarded and mentored junior engineers, and tried to build an inclusive, growth-minded culture.

I was the lead engineer for the team, prototyping new features and helping to triage and debug problems with the system. I did a lot of PR review and pairing with folks who were stuck or confused. I developed a detailed understanding of the domain and various personas we were trying to serve, and helped my team to understand why something needed to be implemented in a particular way, even if it was more work for us. Conversely, I communicated outward and upward as we delivered incremental progress, explaining the reasons why architectural and implementation choices were made, and what capabilities we would or would not have as a result. I gave demos, but also tried hard to grow that capabilities in individual team members - I always felt uncomfortable being the main face associated with the team's work.