[Tagdb] Hi Everyone
Jeremy Dunck
jdunck at gmail.com
Tue Jun 28 15:52:29 GMT 2005
On 6/28/05, pow <jingxunpow at gmail.com> wrote:
> Hi all,
>
> I took the liberty of reading up a little on data warehousing (got
> meself kimball/inmon standard texts) and have finally opened up my eyes
> to the datawarehousing.
>
> The idea of being able to slice and dice the information with whatever
> ad hoc queries any one could come up with is perfect for Data 2.0, with
> possibilites to filter tags/users/counts/urls/domains/CDs etc, by just
> as many dimensions.
>
> Some points to note are:
> 1. Conventional datawarehousing assumes ETL, and that data will be
> coming in batches from the sources be it hourly/daily/nightly/weekly
> etc. Tag applications, however, require that the fact/dimension tables
> are updated on the fly as and when new tags/objects/users are created.
> The warehouse IS the actual DB that live "transactions" are occuring in.
Large systems often trade off accuracy for speed. What if popular
tags, etc, were driven off of fact tables, but the user interaction
lived off of views over pending transactions + history or some similar
strategy?
> 2. Tag queries often involve a lot of self joins, such as "related tag"
> queries, intersections, etc. Although I myself have not done any actual
> tests, would self joining a billion row fact table with itself multiple
> times cause big problems?
As long as you don't need to self-join whole tables, you could create
subqueries to reduce the effect, probably.
>This way, at any point in time, I can easily do a : SELECT
> tagcounter FROM tag ORDER by tagcounter DESC LIMIT 0,25. This will be an
> pretty instant query with minimal strain on the DB.
Actually, ordering a large table is very expensive, assuming you don't
have an index on the column. What if physical storage was partitioned
and you just took large datafile sizes to mean popular? ;-)
> 4. Finally, there also exist issues of labelling, subscriptions, etc. If
> we follow delicious's method of labelling, I have found that it is quite
> a challenge to implement.
I'm not familiar with this labelling bit. Can you point me at some
discussion of it?
>But soon abandoned the idea after realizing
> that writing a query that returned me a list of All labels/tags under
> those labels (with object counts), along with all tags that did not
> currently have any label mapped to them, was waaaaayyyy too complex.
Again, trade off accuracy for performance. You don't really need up
to the minute answers on stuff like this. Even google says
"Results 1 - 10 of about 270,000,000 "
So what you want is a separate table with periodic updates of these
statistics, or similar.
> I
> managed to achieve this via 2 queries (still quite complex, involving
> subqueries + joins + unions) which performed fine with a db of ~10k. but
> once i populated the DBs to about 1m, it was way too slow.
How are you populating the DB? If you're stuffing random data, this
can skew results, especially if real data is normal, and the columns
you're interested in are indexed (or physically clustered).
Anyway, can you give links to the books you're referring to?
Reminder: I'm talking out of my ass on this stuff; I've not run large DBs much.
More information about the Tagdb
mailing list