A design isn't done because it's shipped. I test new work against the thing it replaces, validate the architecture of experimentation features with the people who run experiments for a living, and ground interaction decisions in how customers actually test today. This page collects four threads where evidence, not opinion, settled the design.
A new drag-and-drop creative builder was meant to replace a workflow where users uploaded full ad creatives as flat images and hand-edited CSS to make changes. "Better" was easy to assert and hard to prove. I needed evidence that the new approach actually beat the old one for the people who'd use it.
I ran head-to-head testing with enterprise customers, asking them to rate the new builder directly against the system they used today. The gap was decisive: the new builder scored in the high range while the current system sat near the bottom, and one media customer called it a thousand-percent improvement. I captured the qualitative wins too, the ability to test different headlines and calls to action without rebuilding an entire creative, so the rating was backed by specific reasons, not just a number.
The platform needed a built-in A/B testing flow so marketers could test creative variants. Experimentation UI is unforgiving: traffic logic, locking, and the control case all have to behave exactly right, and the people who'd use it run tests for a living and would spot anything that didn't match how experiments actually work.
I designed the flow against a full set of acceptance criteria and validated it directly with customers who run experiments daily. Sessions confirmed the core interaction decisions, inline traffic editing, lock behavior, and a split-traffic-equally action, and surfaced a subtle issue where applying a setting at the wrong level would corrupt the metrics, which I routed to engineering as a spike before it could ship. The flow was reviewed and validated live with a customer ahead of its release.
To measure lift, customers needed a holdout group, a slice of traffic that sees nothing. There was no real holdout feature, so customers faked one with invisible variants set to most of the traffic. That hack caused scroll-lock bugs, CSS conflicts, and support tickets. The feature had to be designed from what people were actually doing to cope.
I documented the workaround in detail and turned it into requirements for a dedicated holdout: a small default, pixel-based, rendering no creative at all, so measurement stayed clean without the side effects. I validated the surrounding architecture question, whether dismissal settings belong at the campaign or variant level, with multiple customers, and let the split in their answers drive a configurable default rather than forcing one opinion on everyone.
Before validating a design with customers, the team had to agree on what it even believed. A core system concept was being built on assumptions nobody had checked, and quiet internal disagreement about how it worked would have turned any customer test into noise.
I ran an internal assumption-mapping session that surfaced more than twenty assumptions about how the system behaved, and it exposed real disagreement on fundamentals, including whether editing a shared template changed work that already existed. Rather than paper over it, I separated what the team genuinely agreed on from what it only assumed, then turned the open questions into a concrete evidence plan: competitor research plus a customer survey routed through the advisory community.