What 27 AI-Tagged Tasks Revealed About AI Productivity Measurement Posted on May 22, 2026 (May 22, 2026) by Alex Morgan Estimated reading time: 6 minutes I pulled the AI productivity measurement data from our Wrike system last week, expecting a clean win story. Twenty-seven AI-tagged tasks across our most recent compensation consulting projects. Planned hours: 150. Actual hours: 183. We ran 22% over plan, not under. The headline looked like bad news for AI adoption on our own team. After two weeks with the data, I think it is the right number. It is also the most useful AI measurement story we have found so far. The AI Productivity Measurement Headline Looked Bad Our team had been tagging AI work in Wrike for several months before I sat down with the numbers. The tagging system captures six dimensions on every task that touches AI. Those dimensions include integration level, Recommended five-tag system for AI productivity measurement human-AI collaboration model, tool used, readiness barrier, skill development category, and conversion mechanism. Planned versus actual hours and fees are recorded alongside the tags. This post focuses on the five dimensions in which the data were most readable across the 27-task sample; the skill development category will receive its own analysis once the sample is larger. Our goal was simple. We wanted to measure AI productivity on our own US-based compensation team before recommending measurement approaches to clients. When I aggregated the first 27 completed AI-tagged tasks, the gap was hard to miss. Planned hours totaled 150. Actual hours came in at 183. Across the board, our analysts and consultants were spending more time per task with AI, not less. That number contradicted every assumption I walked in with. AI is supposed to save time. Vendor decks promise speed. Industry benchmarks report hours saved. Then I saw a 22% overrun on tasks where we were specifically using AI tools. My first instinct was to look for a measurement error. There wasn’t one. The data was clean. Our team really was investing more hours per AI-touched task than we had planned. After a week of staring at it, I started asking a different question. What if “over plan” is the right outcome for this stage of AI adoption? And what if “saved hours” is the wrong measurement frame entirely? One note on scope before I go further. These 27 tasks cover compensation consulting work. Categories include merit plan design, total rewards structuring, SimplyMerit deployment configuration, and client deliverable drafting. The findings reflect expert knowledge work, not high-volume transactional work. AI productivity measurement looks different in those two contexts, and I will come back to that distinction below. Why Saved Hours Misses the Point of AI Productivity Measurement Most AI productivity measurements I see published reports a single metric. Whether the source is vendor reports, consulting firm studies, or internal HR dashboards, the metric is hours saved. The Stanford AI Index Report 2025 documents the same pattern at a much larger scale: AI adoption is now widespread across organizations, yet productivity reporting still leans almost entirely on speed metrics, even as the underlying research points to quality and capability gains showing up earlier in the adoption curve. Saved hours are easy to count. Yet they only matter if they convert into something the business actually cares about. That means a change to financial outcomes, capacity, or risk exposure. Saved hours convert into value through one of four mechanisms. The first is Speed: the same work, done in less time. Quality is the second: the same time invested, but a better deliverable. Capability comes third: work that the team could not have done before, now in scope. Last is Risk reduction: the same work, with lower downstream exposure to errors, compliance gaps, or rework. MorganHR’s conversion mechanism framework for AI productivity measurement. Saved hours only matter if they convert into Speed, Quality, Capability, or Risk reduction. Here is what most measurement frameworks miss. In expert knowledge work, Speed is usually the last mechanism to show up, not the first. Quality, Capability, and Risk reduction arrive earlier in the adoption curve because that is where AI changes the work itself. Speed comes only after the team has built reliable patterns for using AI on real tasks. Building those patterns takes deliberate time. Our 22% overrun was the team doing exactly that work. They were drafting prompts, validating outputs, and comparing AI-generated language to client-tested language. The team also redid pieces that did not pass review and developed internal standards for what “finished” looks like when AI is in the loop. None of that work shows up as “hours saved.” All of it shows up as Quality, Capability, and Risk conversion, which most measurement dashboards do not track. Three Patterns That Reframed How I Read the AI Productivity Data Once I stopped reading the data as a speed report and started reading it as a conversion report, three patterns came forward. Each one surprised me. Together, they pointed toward measurement choices we would have missed if we had only tracked saved hours. Pattern 1: Collaboration Model Beats Tool Choice in AI Productivity Data Two collaboration models dominated the data. Tasks tagged “AI Drafts then Human Refines” ran 17% over plan. The same team used the same tools on tasks tagged “Human Outlines, then AI Expands, then Human Edits.” Those tasks ran 210% over plan. Same time signatures? Not at all. Most teams I talk to debate which AI tool to license. Based on our data, the more important variable is how humans and AI hand work back and forth. Handoff structure determines how much rework and validation the human has to absorb. Pattern 2: “Ready to Adopt” Was Our Most Expensive Readiness State Tasks tagged with no AI readiness barrier ran 235% over plan. By comparison, tasks tagged “Risk and Compliance Concerns” ran only 11% over plan. When the team felt no friction, they planned less rigorously, and the work expanded to fill the unprepared space. Caution forced better planning. Teams that flagged compliance concerns built in time for legal review, slowed down on prompt design, and ended up closer to the plan. By contrast, the team that felt comfortable with AI also felt comfortable underestimating, and the data caught it. This was the most counterintuitive pattern in the dataset. Since then, it has changed how I scope any AI-touched engagement. Pattern 3: Most AI Work Was Not in the Original Plan Across most projects, AI-tagged tasks were not in the original scope. They emerged during execution. That is not a planning failure on its own. Instead, it is a sign that AI-enabled work the team could not have scoped in advance. Picture the Capability conversion mechanism showing up in real data, meaning work that would not have existed without the tool. Most measurement frameworks miss this entirely. The reason is that they start from a planned-versus-actual frame and treat unplanned tasks as scope creep by default. Our data suggests that some of that “creep” is the actual value showing up in the AI productivity data. What HR Leaders Can Borrow From Our AI Productivity Measurement Setup If you are running AI productivity measurement in your own organization, three takeaways from our data may be worth testing against yours. First, do not stop at saved hours. Build the conversion mechanism (Speed, Quality, Capability, or Risk) into your tagging. That way, you can tell which kind of value showed up, not just whether time went up or down. Second, track the collaboration model separately from the tool choice. The handoff structure between human and AI is doing more work in the time data than the tool brand is. Third, watch your “no barrier” tasks. Comfort with AI correlated, in our data, with looser planning and bigger overruns. A simple decision framework for HR Directors: when an AI-touched project goes over plan, do not default to “AI failed.” Ask three questions before drawing that conclusion. Did the deliverable get materially better? Has the team taken on work it could not have done before? Were we able to reduce a downstream risk? If yes to any of those, the value is real even when the hours went the wrong way. A “no” on all three indicates genuine scope creep, which is also useful information. Honest AI productivity measurement is going to beat the optimistic version on every multi-quarter horizon. CFOs will not trust speed numbers that do not survive scrutiny. The fastest way to lose credibility on AI investment is to report a number that contradicts what people are seeing on the ground. Our 22% overrun is a number our team recognizes when they see it, which is what makes it usable. Key Takeaways AI productivity measurement that stops at saved hours misses where the value actually shows up first in expert knowledge work: Quality, Capability, and Risk reduction, not Speed. Our team ran 22% over plan across 27 AI-tagged tasks. That overrun reflects investment in prompt design, output validation, and workflow patterns the team did not have before. The collaboration model is a stronger predictor of time signature than tool choice. How human and AI hand work back and forth matters more than which AI brand the team uses. “Ready to adopt” tasks ran 235% over plan in our data. Comfort with AI correlated with looser planning. Caution forced better scoping. Tag every AI-touched task with a conversion mechanism so you can answer the CFO question: where did the value actually land? Quick Implementation Checklist Pick one AI-touched workstream to instrument first. Do not boil the ocean. Add five tags to every task in that workstream. The tags are AI integration level, collaboration model, tool used, readiness barrier, and conversion mechanism (Speed, Quality, Capability, or Risk). Record planned versus actual hours on every tagged task. Without that comparison, the conversion data has no anchor. Wait until you have at least 25 completed tagged tasks before drawing any conclusion. Smaller samples will mislead. Run the analysis monthly once the volume is there. Look for patterns by collaboration model and by readiness barrier, not just averages. Report findings honestly. If the headline number is bad, lead with it and explain the conversion data underneath. CFOs trust numbers that contradict expectations more than numbers that confirm them. Frequently Asked Questions About AI Productivity Measurement AI Productivity Measurement Questions From HR Leaders Q: How big a sample size do I need before AI productivity measurement gives me a real signal? A: At least 25 completed AI-tagged tasks per workstream is a reasonable minimum. Below that threshold, individual outliers will distort the patterns. Above it, the collaboration model and readiness barrier patterns become readable. Importantly, the goal is not statistical certainty; the goal is enough data to challenge your assumptions. Q: We do not have a project management system that supports custom tags. Can we still do this? A: Yes, though it is slower. Build a parallel spreadsheet with the same five tag categories and update it weekly. Many teams start here before they invest in tooling. After a quarter of manual tracking, you will know whether the patterns are worth automating. Q: What is the single most important field to add if I can only add one? A: Conversion mechanism. Without a Speed, Quality, Capability, or Risk tag, you cannot interpret time data correctly. Specifically, an over-plan task with a “Capability” tag is a different story from an over-plan task with a “Speed” tag. That difference matters when you report to leadership. AI Productivity Measurement Questions From Executives Q: How do I explain a 22% over-plan number to my CFO without losing credibility on AI investment? A: Lead with the number, not around it. Then walk the CFO through the conversion mechanism. Identify which AI-touched tasks delivered Quality improvement. Note which delivered Capability the team did not have before. Then identify which reduced the downstream risk. CFOs trust people who report data that contradicts expectations and explain it. They do not trust people whose numbers always confirm the budget pitch. Q: Is AI productivity measurement going to look the same in expert knowledge work as in transactional work? A: No. Transactional work includes claims processing, basic data entry, and ticket routing. That kind of work converts saved hours to Speed quickly because the work is well-defined, and AI replaces a clear step. Expert knowledge work (strategy, advisory, custom analysis) converts saved hours to Quality and Capability first, and to Speed last. Both are real AI adoption stories. They just unfold on different timelines. Q: When should we expect to see Speed conversion in expert knowledge work? A: Based on what we are seeing, Speed conversion in expert work typically appears 12 to 24 months into deliberate AI adoption, after the team has built repeatable patterns. The McKinsey State of AI 2025 survey reports a similar pattern at scale, with most organizations realizing quality and capability gains earlier than measurable speed gains. Until then, expect Quality and Capability gains and measure those instead. Otherwise, you will report an early-stage adoption story as a failure when it is actually working as expected. AI Productivity Measurement Questions From Implementation Teams Q: Our team is comfortable with AI. Should we still flag readiness barriers? A: Especially if you are comfortable. In our data, “no barrier” tasks ran the furthest over plan. Comfort correlated with looser scoping. Force the team to name at least one barrier on every tagged task, even if it is a small one. The planning discipline that follows is where the time accuracy comes from. Q: Does this approach apply to teams using SimplyMerit or other compensation administration platforms? A: Yes, with one adjustment. Platform deployments have a more defined scope than open-ended advisory work, so Speed conversion tends to show up earlier. Still, the same four mechanisms apply. Tagging deployment tasks by conversion mechanism gives you a cleaner read on where AI is helping versus where the team is still building patterns. Want to Compare Notes on Your Own AI Productivity Measurement? If you are building AI work tracking on your own team and want to compare notes, reach out. I am happy to share the tagging schema we use internally and walk through what each field is meant to catch. Honest measurement is more useful when more teams are doing it the same way. Contact the MorganHR team » Author’s note: This post was drafted with AI assistance, edited by hand, and logged in the same Wrike tagging system referenced above. Conversion mechanism: Quality (the AI helped me structure a long-form argument I had been circling for two weeks). Time signature: roughly 4 hours actual against 1.5 hours planned. About what the data above predicted. About the Author: Alex Morgan As a Senior Compensation Consultant for MorganHR, Inc. and an expert in the field since 2013, Alex Morgan excels in providing clients with top-notch performance management and compensation consultation. Alex specializes in delivering tailored solutions to clients in the areas of market and pay analyses, job evaluations, organizational design, HR technology, and more.