[Tagdb] Hi Everyone
pow
jingxunpow at gmail.com
Tue Jun 28 15:30:55 GMT 2005
Hi all,
I took the liberty of reading up a little on data warehousing (got
meself kimball/inmon standard texts) and have finally opened up my eyes
to the datawarehousing.
The idea of being able to slice and dice the information with whatever
ad hoc queries any one could come up with is perfect for Data 2.0, with
possibilites to filter tags/users/counts/urls/domains/CDs etc, by just
as many dimensions.
Some points to note are:
1. Conventional datawarehousing assumes ETL, and that data will be
coming in batches from the sources be it hourly/daily/nightly/weekly
etc. Tag applications, however, require that the fact/dimension tables
are updated on the fly as and when new tags/objects/users are created.
The warehouse IS the actual DB that live "transactions" are occuring in.
2. Tag queries often involve a lot of self joins, such as "related tag"
queries, intersections, etc. Although I myself have not done any actual
tests, would self joining a billion row fact table with itself multiple
times cause big problems?
3. There still is the problem of doing counts over the entire database.
Queries such as "Popular tags" (not filterd down to any particular
object or user) require a count of all tagids over the possibly billion
row long fact table. There doesn't seem to be any real solution that
will allow this to happen quickly w/o much taxing of the db. The only
solution i can think of is to have an indexed "counter" field for the
Tag table. Anytime a tag is used to map something, there will be an
UPDATE to increase the counter by one.
Like wise, if a user deletes a tag mapping, the counter will be
decreased. This way, at any point in time, I can easily do a : SELECT
tagcounter FROM tag ORDER by tagcounter DESC LIMIT 0,25. This will be an
pretty instant query with minimal strain on the DB. However, this will
increase the insert times of tag mappings by many times as every tag
mapping /deletion needs more update queries to increase the counters.
Whether this is worth the ability to instantly see which tags in the
billion row table are most used, is subject to question.
4. Finally, there also exist issues of labelling, subscriptions, etc. If
we follow delicious's method of labelling, I have found that it is quite
a challenge to implement. This is because: A) A user creates a label,
and can assign ANY tag he wants to be under that label, regardless of
whether he is currently using the tag or not. B) Tags can be members of
more than one label. I came up with a simple schema that had another
"tag-label mapping" table. But soon abandoned the idea after realizing
that writing a query that returned me a list of All labels/tags under
those labels (with object counts), along with all tags that did not
currently have any label mapped to them, was waaaaayyyy too complex. I
managed to achieve this via 2 queries (still quite complex, involving
subqueries + joins + unions) which performed fine with a db of ~10k. but
once i populated the DBs to about 1m, it was way too slow.
Hopefully we will be able to have some discussion over the above
mentioned issues.
Powster
More information about the Tagdb
mailing list