打造AI數據分析師：那些沒人告訴你的工程惡夢

Hacker News·4 個月前

本文深入探討了打造生產級AI數據分析師Harbor AI時遇到的意外工程挑戰，強調了從提示工程轉向可靠架構的重要性，特別是關於表級隔離和分層記憶體等安全問題。

Building an AI Data Analyst: The Engineering Nightmares Nobody Warns You About

Building production AI is 20% models, 80% engineering. Discover how Harbor AI evolved into a secure analytical engine using table-level isolation, tiered memory, and specialized tools. A deep dive into moving beyond prompt engineering to reliable architecture

When we launched Harbor AI in October 2024, we thought we were shipping a chatbot. Three months and two complete rewrites later, we realized we'd accidentally built something far more complex: a real-time analytical engine that combines conversational AI, statistical computing, visualization generation, and multi-tenant data isolation, all while streaming results in a matter of seconds.

This is the story of the engineering challenges we didn't see coming, and the solutions we built to solve them.

The Security Nightmare

Here's the mistake we almost made: giving the AI full database access and relying on prompt engineering to keep it honest.

Our initial design was beautifully simple, connect to the database and let the AI write SQL queries. We'd filter by user schema in prompts. Standard row-level security. Should be fine, right?

Wrong. Catastrophically wrong.

The problem revealed itself during security review: LangChain agents are autonomous. They write their own SQL. What if the AI simply writes a query that ignores your filters? Or gets creative with JOINs? Or someone manipulates the prompt to bypass restrictions?

We couldn't trust prompt engineering to enforce security. Prompts can be manipulated, forgotten, or creatively interpreted by the model.

The Solution: Read-Only Table-Level Isolation

We went nuclear on database security. Every AI connection gets credentials that can only SELECT from one specific table, nothing else.

The beauty of this approach:

The cost: We had to build a credential management system that manages database users, manages permissions, and handles cleanup on account termination. But the security guarantee is absolute.

The Memory Architecture: From Chat History to Contextual Intelligence

Our first memory implementation was hilariously naive, just dump everything into Redis and send all messages to OpenAI every time.

Within a week, users were hitting context limits. A 30-message conversation would consume 80K tokens. One power user's session cost us a lot of money in a single afternoon.

The Three-Tier Memory System

We built a memory architecture inspired by how human memory actually works:

Working Memory (Last 10 Messages): Recent context stays as complete, raw messages. Critical for immediate follow-ups.

Short-Term Memory (Summary): Messages 11-50 get compressed into a narrative summary using GPT-4o-mini. We preserve user's domain, recurring concerns, established baselines, and previous decisions, not word-for-word transcripts.

Long-Term Memory (Metadata Cache): Schema information, available metrics, date ranges, and data statistics get cached in Redis. This gets injected into every request as context without repeated database queries.

The result: Token usage dropped massively. At scale, that's hundreds of dollars saved daily.

Metadata as Permanent Context

Schema information (available metrics, date ranges, data statistics) gets cached and injected into every request:

The AI always knows what data exists without querying. This eliminates entire classes of "I can't find that metric" errors.

The Visualization Pipeline: From Matplotlib to User's Screen in Seconds

Generating charts in production is harder than it sounds. Matplotlib is synchronous and slow, blocking your async event loop for 2-3 seconds per chart.

Problem 1: Render Performance

We switched to the Agg backend (no GUI, no display server) and disabled font caching to prevent threading conflicts:

This cut rendering time significantly for complex multi-panel plots.

Problem 2: Storage and Retrieval

We generate a lot of chars per day. Disk storage is slow and doesn't scale horizontally. S3 has network latency. Database BLOBs kill performance.

Solution: Redis with Base64 encoding and automatic TTL.

Redis is in-memory (retrieval <5ms), auto-cleanup means no cron jobs, and it clusters horizontally. At 100KB per image and say 5,000 active images, we're only using 500MB of Redis memory. While this is not an ideal user experience, having images disappear after seven days, it is not a major issue for time-based analytics. Since the analytics are inherently time-driven, graphs typically become stale beyond that window for most use cases.

We do plan to support long-term message history in the future, but for now, the additional complexity of syncing and long-term storage does not justify the development effort.

From SQL to Statistical Computing

The biggest architectural shift was moving from "AI writes SQL" to "AI orchestrates specialized tools."

The Problem with Raw SQL

Letting the AI write arbitrary SQL failed in three ways:

The Solution: Domain-Specific Tools

We built 15+ specialized tools that each do ONE thing expertly. Instead of teaching the AI statistics via prompts, we gave it functions that encapsulate statistical expertise.

For example, our anomaly detection tool:

The key insight: Time-series analysis has well-established patterns. Why make the AI rediscover Z-score calculation every time? The tool handles the how, the AI handles the why and what it means.

Each tool automatically generates visualizations, handles missing data, skips insufficient samples, and returns both narrative and visual results. The AI orchestrates them intelligently based on user questions.

The Data Fetching Layer

Here's the secret sauce for handling large time-series datasets without drowning in data. Don't handle large time-series data.

Dynamic Downsampling

We automatically choose aggregation bucket sizes based on time range:

Then use TimescaleDB's time_bucket to pre-aggregate:

The impact: A 7-day query without aggregation could return 604,800 rows. With 10-minute buckets, that drops to just 1,008 rows, roughly 600× fewer rows, dramatically faster queries, and significantly lower memory usage.

The AI still gets the patterns it needs without drowning in raw data, and users aren’t left waiting tens of seconds for queries to complete.

Making AI Feel Responsive

Users hate waiting. Even if analysis takes 15 seconds, you can't leave them staring at a blank screen.

The Multi-Event Stream Protocol

We utilized streaming system with distinct event types:

Frontend: Smart Event Handling

The frontend handles these events differently, status updates show in a progress indicator, SQL queries populate a collapsible debugger, tokens append to the answer in real-time.

The "final" event trick: Network issues or rate limits sometimes cause the streaming LLM to drop tokens. The final event sends the complete answer from memory, ensuring nothing is lost.

This architecture makes the AI feel responsive even when doing heavy computation. Users see exactly what's happening at each step.

The Frontend: Making Complexity Feel Simple

The frontend has to handle streaming text, embedded images with zoom/pan, collapsible SQL debuggers, and smooth auto-scrolling, all while feeling snappy.

The Auto-Scroll Problem

Scrolling sounds trivial. It's not. You need to scroll smoothly as tokens arrive (or it's jarring), but don't scroll if the user has manually scrolled up to read something.

useLayoutEffect fires before paint (instant). useEffect fires after paint (smooth). This gives the perfect UX.

The Image Modal: Zoom/Pan with Touch Support

Users needed to examine charts closely, especially on mobile. We built a full-featured viewer with mouse wheel zoom, click-and-drag panning, and unified touch/mouse handling.

Key details: translate3d() triggers GPU acceleration for smooth 60fps performance. touch-action: none prevents page scrolling on mobile. Download button extracts base64, creates a blob, and triggers browser download.

Now is it the cleanest? probably not, but for now it works.

Local Storage: Persistent Conversations

Users can refresh the page and continue their conversations without interruption. Conversation history is stored client-side and remains available until the user explicitly clears it. While client-only persistence isn’t ideal from a UX perspective, it provides a simple and reliable solution for the current stage of the product. As we scale and the team grows, we plan to introduce full history synchronization and long-term storage.

The Persona: Teaching the AI to Think Like an Analyst

The most important engineering decision wasn’t technical, §it was philosophical. Before writing any code, we defined who Harbor AI should be. These principles were embedded directly into the system (pre) prompt to shape Harbor AI’s reasoning style and responses.

Harbor AI is not designed to behave like a generic chatbot, but like a Principal Data Analyst.

Hypothesis-Driven Analysis

Real analysts form hypotheses before running any analysis. We encoded this explicitly in the system prompt:

Because of this rule, when a user asks about temperature spikes, Harbor AI responds with reasoning first:

The analysis is intentional, not reactive.

The “So What?” Test

Raw statistics are rarely useful on their own. To prevent Harbor AI from producing shallow, dashboard-style output, we added a strict rule to the system prompt:

Before: Temperature and humidity have a correlation of 0.73.

After:Temperature and humidity are strongly correlated (0.73), suggesting your HVAC system may struggle to remove moisture during heat spikes. This may indicate insufficient dehumidification capacity.

Same data. Completely different value.

Socratic Guidance

When users ask vague or underspecified questions, Harbor AI is instructed not to guess. Instead, the system prompt directs it to clarify intent through targeted follow-up questions.

User: Check the temperature.Harbor AI: I can analyze temperature in several ways. Are you interested in current readings, unusual spikes this week, or long-term trends over the past month?

This may feel slower initially, but it teaches users how to ask better analytical questions. After a few interactions, users naturally begin asking more precise and sophisticated queries.

Never Mention Tools

Another explicit instruction in the system prompt: never mention internal tools, functions, or implementation details.

Users interact with Harbor AI the analyst, not an LLM pipeline.

Instead of: I’ll use the correlation matrix tool…

Harbor AI says: Let me examine how these metrics relate to each other…

This keeps the experience human, credible, and focused on insight, not mechanics.

The Tool Philosophy: Encapsulating Expertise

Each of our 15+ tools encapsulates domain expertise that would be impossible to teach via prompts:

Anomaly Detection Tool knows how to use Z-scores, handle missing data via interpolation, skip constant-value metrics (zero variance), and visualize results with reference lines for mean ± threshold.

Trend Analysis Tool knows how to perform seasonal decomposition with proper period selection, handle insufficient data gracefully, compute trend slopes, and measure seasonal strength.

Correlation Tool knows how to time-bucket data before correlation (raw timestamps would show zero correlation), filter out metrics with high missingness, and identify significant relationships.

Forecasting Tool knows how to select appropriate seasonality periods, choose between additive and multiplicative models, generate prediction intervals, and validate forecast quality.

The AI doesn't need to know how to compute autocorrelation or perform DBSCAN clustering. It just needs to know when these methods are appropriate. The tools handle implementation details.

The Failure Cases: What We're Still Working On

Honesty time, here's what still doesn't work well:

Multi-Metric Comparison at Scale

Asking "Compare all 50 of my sensors" bogs down. The AI tries to analyze everything, burning tokens and time. We need intelligent sampling and clustering.

Causal Inference

Harbor AI is great at correlation but can't reliably determine causation. It hedges appropriately ("suggests," "may indicate") but users sometimes want definitive answers we can't provide.

Domain-Specific Knowledge

Harbor AI knows time-series analysis but doesn't know that "0.3 bar" is critically low pressure in your specific industrial process. We're exploring industry-specific fine-tuning.

Proactive Monitoring

Currently reactive, you ask, it answers. We want proactive: "Hey, I noticed energy consumption trending up 15% this week. Should I investigate?"

Complex Multi-Turn Investigations

For deep investigations requiring 10+ tool calls, the AI can lose thread. We're experimenting with sub-agent architectures.

The Lessons: What We’d Do Differently

Building Harbor AI forced us to unlearn a lot of “standard” LLM advice. These are the lessons we’d bake in from day one if we were starting again.

1. Start With Hard Security Boundaries, Not Prompts

We rewrote our authentication and access logic twice before accepting a simple truth: prompt engineering is not a security model.

Scoped, read-only, table-level database credentials eliminated entire classes of failures. The AI cannot access what it isn’t physically permitted to read, no matter how clever, confused, or compromised it becomes.

If it’s a multi-tenant system, enforce isolation at the database level on day one.

2. Cache Aggressively, Invalidate Intentionally

Anything that doesn’t change frequently should never be recomputed or refetched.

Schema metadata, metric availability, date ranges, and statistics belong in cache with clear TTLs. This reduced latency, eliminated redundant queries, and simplified prompt construction.

Compute is expensive. Redis is cheap. Use that fact ruthlessly.

3. Streaming Is Not a Feature, It’s a Requirement

Users will tolerate slow analysis. They will not tolerate silence.

If a response takes 15 seconds, they need to see progress: status updates, SQL execution, partial answers. Streaming transformed perceived performance even when actual compute time didn’t change. If your AI blocks the UI, users will assume it’s broken.

4. Treat Visualizations as First-Class Output

In analytical products, charts are not decorations, they are the answer.

Automatically generating, embedding, and streaming visualizations alongside text changed how users reasoned about data. Text explains. Charts convince.

5. Memory Management Is the Product

Context windows are not memory. Passing everything to the model is both expensive and ineffective.

Summarization, tiered memory, and permanent metadata context reduced token usage dramatically while improving answer quality. Without this, we were burning money and still losing important context.

6. Use Tools to Encode Expertise, Not Prompts

We stopped trying to teach the AI statistics via instructions and started giving it tools that already know how to do the right thing.

Z-scores, seasonal decomposition, correlation analysis, forecasting, these are solved problems. Encode them once, correctly, and let the AI focus on deciding when and why to use them. Prompts describe intent. Tools implement truth.

What's Next: Teaching Harbor AI to Teach

Our 2026 roadmap isn't about adding more tools, it's about making Harbor AI a better teacher:

Proactive Learning: After analyzing a pattern, Harbor AI asks: "Would you like me to monitor this and alert you if it deviates?"

Collaborative Sessions: Share analysis threads with your team, add comments, export as reports. The AI becomes shared intelligence.

Industry Specialization: Fine-tuned versions that understand manufacturing, energy, agriculture, not just generic time-series.

Voice Interface: "What happened overnight with the reactor?" while staring at a dashboard at 3 AM.

Custom Tool Creation: Let advanced users define analysis functions that Harbor AI learns to call appropriately.

But the core philosophy remains: Harbor AI is a colleague, not a chatbot. It guides, it teaches, it collaborates. It doesn't just answer questions, it helps you ask better ones.

Conclusion

Building production AI taught us that the model is 20% of the problem. The other 80% is engineering.

Security (database scoping), performance (aggregation, caching), memory (summarization), visualization (generation, storage, serving), streaming (multi-event protocol), these are what make AI actually work at scale.

LangChain got us to demo. Engineering got us to production.

The hardest part wasn't the AI model, GPT-4 is genuinely impressive. The hardest part was everything around it: managing database connections at scale, streaming reliably, controlling costs, handling failures gracefully, and making the system transparent enough that users trust it with their critical data.

Building serious AI systems is not about stuffing everything into a prompt. It’s about discipline: knowing what information the model actually needs, what should be summarized, what should be permanent context, and what should never be shown at all. Context is a scarce resource. Treating it as infinite is one of the fastest ways to build something expensive, slow, and unreliable.

Persona matters just as much. We didn’t ask the model to “be smart”, we taught it how to think. Hypotheses before analysis. Clarification instead of guessing. “So what?” before conclusions. Those behaviors didn’t emerge from better prompts; they came from intentional system design.

And perhaps most importantly, we stopped letting the AI do math.

Models shouldn’t calculate Z-scores, handle missing data, choose aggregation windows, or guess statistical methods. Those are solved problems. The AI’s job is orchestration: deciding when an analysis is needed, which tool applies, and how to explain the result in human terms. The tools handle correctness. The model handles reasoning.

Once we embraced that separation, everything clicked. The AI became more reliable. Costs dropped. Answers became sharper. And user trust increased.

The takeaway is simple: good AI systems are designed, not prompted. They’re built from boundaries, memory, tools, and intent, with the model acting as a coordinator, not a calculator.

That’s the difference between a clever demo and something people can rely on.

From Python to Go: Why We Rewrote Our Ingest Pipeline at Telemetry Harbor

Subscribe to Harbor Scale: Engineering for Observability and SRE

— Hacker News