Show HN：DDL to Data – 從 SQL Schema 生成逼真測試資料

Hacker News·4 個月前

Hacker News 上推出了一款名為「DDL to Data」的新工具，使用者可以透過 SQL 的 CREATE TABLE 語句直接生成逼真的測試資料。該工具旨在保留外鍵關係，並生成符合約束和資料類型的資料，支援 PostgreSQL 和 MySQL，且無需額外設定。

Paste your CREATE TABLE statements, get realistic test data back. It parses your schema, preserves foreign key relationships, and generates data that looks real, emails look like emails, timestamps are reasonable, uniqueness constraints are honored.

No setup, no config. Works with PostgreSQL and MySQL.

https://ddltodata.com

Would love feedback from anyone who deals with test data or staging environments. What's missing?

I like the concept but the painpoint has never been around creating realistic looking emails and such like, but creating data that is realistic in terms of the business domain and in terms of volume.

If one were be able to use metrics as source then, depending on the quality of the metrics, it might be possible to distribute data in a manner similar to what's observed in production? You know, some users that are far more active than others, for example. Considering a major issue with testing is that you can't accurately benchmark changes or migrations based on a staging environment that is 1% the size of your prod one, that would be a huge win I think even if the data is, for the most part, nonsensical. As long as referential integrity is intact the specifics matter less.

Domain specific stuff is harder to describe I think. For example, in my setup I'd want seeds of valid train journeys over multiple legs. There's a lot of detail in that where the shortcut is basically to try and source it from prod in some way.

I've written seed data scripts a number of times, so I get the need. How do you think about creating larger amounts of data?

E.g., I'm building a statistical product where the seed data needs to be 1M rows; performance differences between implementations start to matter.

Streaming: Can't hold it all in memory. Generate in chunks, write, release, repeat.

Format choice: Parquet with row groups is fast and compresses well. SQL needs batched inserts (~1000/statement). Direct DB writes via COPY skip serialization entirely is usually fastest.

FK relationships: The real bottleneck. Pre-generate parent PKs, hold in memory, reference for children. Gets tricky with complex graphs at scale.

Parallelization: Row generation is embarrassingly parallel, but writes are serial. Chunk-then-merge is on our radar but not shipped yet.

What does your stat product need, realistic distributions or pure volume/stress testing?

The pricing seems extremely high for what's basically a call to https://github.com/faker-ruby/faker but that makes sense if it has to pay for OpenAI tokens.

(who knows though, plenty of B2B deals signed for sillier things than this - good luck, OP)

The difference from Faker: you don't write any code. Paste your CREATE TABLE, get data back. Faker is a library you have to integrate, configure field-by-field, and maintain as your schema changes. Different use case — more like "I need a seeded database in 30 seconds" vs "I'm building a test suite."

Fair point on pricing though, still figuring that out. Appreciate the feedback.

— Hacker News

你的個人知識庫

Show HN：DDL to Data – 從 SQL Schema 生成逼真測試資料