Less Boring Resume
ℹ️ This is a narrative version of selected parts of my resume. It resembles what you'd hear from me if we were talking over coffee and should be a bit more fun to read than my standard resume.
Citrine Informatics
Customer's Problem
Let's say you're a materials scientist in 2010. You make advanced polymers for car door panels and bumpers. Your boss comes to you and says: "Audi wants a door panel that is 15% lighter but still doesn't dent or scratch easily, and that doesn't deform in the temperature range of -40C to 55C. Can we do that? How much would it cost? Do we need to buy new equipment to make it work?" This is a hard set of questions! You'd probably say something like "I'm not sure, but if you give me a million dollars and two years I'll get you an answer." Your boss would say "I can give you eighteen months and $600k" and you'd grumble a little and get to work pursuing research ideas for lightweight glass-embedded polymers you've read about in papers or heard about at a conference.
Citrine's Product
Now let's say you're a materials scientist in 2022. Your boss comes and asks the same question. You say, "I'm not sure, but I used that Materials AI software we bought recently to generate some candidate materials that fit those criteria, along with estimated probabilities of success. We can systematically work through that list according to success probabilities and estimated performance, updating our estimates along the way, and either hit those targets or get back to Audi with a compromise spec within four months."
That is a much better answer! Everyone is happier about this. Your boss says: "That software is amazing! How did we ever live without it?" Hundreds of 📈 and 💰 emojis spontaneously appear at Citrine Informatics headquarters. Somewhere in San Mateo, a venture capitalist smiles.
Citrine's Problem
Citrine employees have no idea how to make glass-embedded polymers. Well, maybe one or two of them do because Citrine often poached PhD materials scientists from academia, but they don't know the specifics of how you make them. And even if they did, they're a software company, not a consultancy. They want to sell software to customers making steel, polymers, ceramics, adhesives, paints and coatings, and every other kind of thing. So, they need some way for each customer to provide training data and, critically, the subject matter expertise to make their thing. Without some ground truth analytic relationships (e.g.: Arrhenius Equation) injected into the models, there is never enough data to make decent predictions with materials data. Most companies have a few thousand real-world experimental data points—at most. Materials R&D is expensive; individual data points constitute trade secrets that cost tens or hundreds of thousands of dollars. There aren't very many of them.
So, Citrine's software is useless until they can onboard the customer's data into a format their AI models can use. The data are often scattered among Excel files, PDFs, and in some cases physical records in labs and production facilities. There isn't a standard way to record processing and measurement data in the industry. The data are multidimensional, sparse, and small. They often can't be transformed into nice n-dimensional arrays without detailed subject matter expertise (read: additional user input).
What I did at Citrine
My very first task at Citrine was to collaborate with our in-house materials scientists, AI researchers, and product team to design a data model that could faithfully represent all of our customer's data. I cracked open "Information Architecture" from O'Reilly, read it cover-to-cover twice, and started asking a lot of questions about things like the scale of our customer's data (anywhere from small excel files to billions of simulated records from DFT models), likely access patterns, and the format our AI models needed as training input. We came up with GEMD - an open-source specification and python library that served as the canonical data exchange format for the platform. Working with a growing team of engineers, we designed and built a python API client, backend API, storage layer, and data transformation system for GEMD.
I'm proud of all of GEMD, but I'm particularly fond of a few features I snuck into the specification early on. The data structure is a doubly-linked graph, but I carefully chose which points are one-to-many, one-to-one, and many-to-many to make certain traversals and queries as efficient as possible. Those choices made it possible to rapidly transform the graph into flat n-dimensional matrices of data to use for AI model training. Citrine's backend implementation of GEMD also treats customers' unique identifiers as first-class citizens, so that folks can use the platform without needing to record a crosswalk from their local identifiers to Citrine's. We even allowed for multiple customer IDs to exist on each record (namespaced to avoid collisions). Finally, I added a hierarchical tagging system, allowing our customers and professional services teams to segment and query data precisely and quickly, disregarding the complex graph structure of the data when needed. I referred to the tagging system as "transparently laundering the capabilities of DynamoDB directly to end users" which I think only a few other engineers understood. The tagging system was heavily used by our customers, yielding concrete data about how users wanted to model and query their data; it was a rich source of data-driven product and user feedback.
My team working on the GEMD API systems grew from one engineer (me!) to as many as seven. We worked closely with the professional services teams to provide speed and quality-of-life improvements for ETL, authorization, data lifecycles, and systems integration. Initially, it would take weeks to onboard a new customer - our professional services teams would write reams of integration code to pipeline Excel files or customer databases into GEMD records via the API, then select the data they wanted to vectorize and use in AI models. By the end of my time at Citrine we could take our customer's mess of data and have an AI-ready vectorized dataset, with no code written at all, in hours.
As the team grew, I wore a lot of hats. My role morphed from being a sole IC into a Tech Lead. Rather than doing everything myself, I began prototyping new features before handing design docs and skeletal code to the team. I spent much more of my time triaging error reports and debugging. I did a lot of PR reviews, pairing with folks who were stuck or confused. I developed a detailed understanding of the domain and the various product personas we were trying to serve. I tried to help my team understand why something needed to be implemented in a particular way, even if it was more work for us. I also communicated outward and upward as we delivered incremental progress, explaining the reasoning behind our architecture and design decisions, and the impact those would have on the product.
I was a de facto engineering and product manager during a few six-month periods when my team's on-paper manager had more than twenty direct reports. I did dozens of technical and behavioral interviews for both engineers and others in the company. I wrote performance reviews, managed projects, maintained JIRA boards, ran standups, and reported on project status to upper management and internal stakeholders. I had regular one-on-ones with every member of my team. I wrote our API and coding style guides, set up and managed our on-call rotation, onboarded and mentored junior engineers, and tried to build an inclusive, growth-minded culture.
Citrine was a career stepping stone for me - it was thrilling to watch the product grow and get used by such brilliant people. I will always value my time there and remain grateful for the growth opportunities afforded to me.