[Tagdb] Hi Everyone

pow jingxunpow at gmail.com
Tue Jun 28 15:30:55 GMT 2005


Hi all,

I took the liberty of reading up a little on data warehousing (got 
meself kimball/inmon standard texts) and have finally opened up my eyes 
to the datawarehousing.

The idea of being able to slice and dice the information with whatever 
ad hoc queries any one could come up with is perfect for Data 2.0, with 
possibilites to filter tags/users/counts/urls/domains/CDs etc, by just 
as many dimensions.

Some points to note are:
1. Conventional datawarehousing assumes ETL, and that data will be 
coming in batches from the sources be it hourly/daily/nightly/weekly 
etc. Tag applications, however, require that the fact/dimension tables 
are updated on the fly as and when new tags/objects/users are created. 
The warehouse IS the actual DB that live "transactions" are occuring in.

2. Tag queries often involve a lot of self joins, such as "related tag" 
queries, intersections, etc. Although I myself have not done any actual 
tests, would self joining a billion row fact table with itself multiple 
times cause big problems?

3. There still is the problem of doing counts over the entire database. 
Queries such as "Popular tags" (not filterd down to any particular 
object or user) require a count of all tagids over the possibly billion 
row long fact table. There doesn't seem to be any real solution that 
will allow this to happen quickly w/o much taxing of the db. The only 
solution i can think of is to have an indexed "counter" field for the 
Tag table. Anytime a tag is used to map something, there will be an 
UPDATE to increase the counter by one.
Like wise, if a user deletes a tag mapping, the counter will be 
decreased. This way, at any point in time, I can easily do a : SELECT 
tagcounter FROM tag ORDER by tagcounter DESC LIMIT 0,25. This will be an 
pretty instant query with minimal strain on the DB. However, this will 
increase the insert times of tag mappings by many times as every tag 
mapping /deletion needs more update queries to increase the counters. 
Whether this is worth the ability to instantly see which tags in the 
billion row table are most used, is subject to question.

4. Finally, there also exist issues of labelling, subscriptions, etc. If 
we follow delicious's method of labelling, I have found that it is quite 
a challenge to implement. This is because: A) A user creates a label, 
and can assign ANY tag he wants to be under that label, regardless of 
whether he is currently using the tag or not. B) Tags can be members of 
more than one label. I came up with a simple schema that had another 
"tag-label mapping" table. But soon abandoned the idea after realizing 
that writing  a query that returned me a list of All labels/tags under 
those labels (with object counts), along with all tags that did not 
currently have any label mapped to them, was waaaaayyyy too complex. I 
managed to achieve this via 2 queries (still quite complex, involving 
subqueries + joins + unions) which performed fine with a db of ~10k. but 
once i populated the DBs to about 1m, it was way too slow.


Hopefully we will be able to have some discussion over the above 
mentioned issues.
Powster


More information about the Tagdb mailing list