Hacking Drupal views and taxonomy: looking for 100% matching of terms

| drupal, geek

I’m working on a Drupal 6 site that helps match volunteers to speaking opportunities, or sessions. I use Taxonomy to keep track of the qualifications so that I can maintain the qualification hierarchy. Given a list of subject areas that a person is interested in, I need to find all sessions that match any of those subject areas. The quirk: the session must have at least one of the person’s terms, and the person must also have all the session’s terms.

Let’s say that our volunteer is interested in speaking about biology and physics. I couldn’t use a straightforward AND search. If I searched for biology AND physics, I wouldn’t get sessions for just biology. It also means I can’t use a straightforward OR search, because I shouldn’t list sessions that require both biology AND another subject the person hadn’t listed, such as chemistry.

Views didn’t seem to have a built-in way to do it. I couldn’t think of a standard-ish way to describe my challenge in order to find relevant posts on drupal.org. Content recommendation modules seemed similar, but I wasn’t familiar with any of them enough to know which one would be the closest to hack for my cross-type comparisons and 100% match requirements. So it was time to hack my Views query.

After several attempts, I settled on the approach of precalculating how many terms were associated with each session node. I created a table with the information and used the following query to populate it in my install file.

db_query("INSERT INTO {node_term_count} 
  SELECT nid, vid, count(tid) AS term_count 
  FROM {term_node} GROUP BY nid, vid");

I also used hook_nodeapi to update the table on insert, update, and delete operations.

Then I started experimenting through the SQL console. I used COUNT and GROUP BY to find out how many terms the session had in common with the person. Selecting from that MySQL subquery let me filter the list to the nodes where the total number of terms equaled the number of terms the session had. I ended up with a query that looked like this:

SELECT nid, vid FROM (SELECT tns.nid, tns.vid, 
  COUNT(tns.tid) AS match_count, 
  c.term_count FROM term_node tns 
  INNER JOIN node_term_count c ON tns.vid=c.vid 
  WHERE tns.tid in (55, 56, 42, 39, 41) 
  GROUP BY tns.vid) AS result 
WHERE term_count = match_count;

When I was happy with the query, I used hook_views_pre_execute to change my $view->build_info['query'] and $view->build_info['count_query']. With all the other filters I needed, it eventually looked like this:

    $view->build_info['query'] = "SELECT * FROM (
SELECT tns.nid, tns.vid, count(tns.tid) AS match_count, c.term_count, workflow_node.sid FROM node n 
INNER JOIN term_node tns ON (n.vid=tns.vid AND n.nid=tns.nid)
LEFT JOIN workflow_node workflow_node ON n.nid = workflow_node.nid 
INNER JOIN node_term_count c ON tns.vid=c.vid
INNER JOIN content_type_session session ON (n.nid=session.nid AND n.vid=session.vid)
INNER JOIN node school_node ON (session.field_session_school_nid=school_node.nid)
INNER JOIN content_type_school school ON (school_node.nid=school.nid AND school_node.vid=school.vid)
INNER JOIN content_field_session_dates date ON (n.nid=date.nid AND n.vid=date.vid AND date.delta=0)
WHERE (n.type in ('%s')
AND workflow_node.sid=%d
AND session.field_session_request_mode_value = '%s'
AND (n.status <> %d) 
AND (DATE_FORMAT(ADDTIME(date.field_session_dates_value, SEC_TO_TIME(-14400)), '%Y-%m-%%d') >= '" . date('Y-m-d') . "')
AND school.field_school_district_nid IN ($district_where)
AND tns.tid in ($tid_where))
GROUP BY tns.vid
) as result WHERE term_count = match_count AND match_count > 0";

I used variables like $tid_where and $district_where to simplify the query. They use array_fill to create placeholders for the arguments.

Result: I think it works the way it’s supposed to. It passes my unit tests and manual testing, anyway. If performance becomes an issue, I might precalculate the results and store them in a table. I hope I don’t have to do that, though.

Views 3 is supposed to have arbitrary data stores that let you write views on top of any sort of query or function, but I’m going to stay with Views 2 for now.

Whenever I write about stuff we’re doing with Drupal, I often hear about even awesomer ways to do things. =) Is this one of those times? Is there a little-known module that Does the Right Thing?

You can view 5 comments or e-mail me at sacha@sachachua.com.

5 comments

These are the fun challenges!

It won't help with your PHP and SQL, but I recently learned of the set operations in Python in the course I am taking. I was excited to read your post because I was wondering about additional applications of those methods that I could try. This looks like a good application of the .issubset() method.

I thought I found a PHP example at StackOverflow, but it doesn't seem to meet the needs raised in the question. http://stackoverflow.com/qu...

Thank you for sharing!

I made a similar query to yours; but I used a lot of joins to accomplish the goal.
http://drupalcode.org/proje...

mikeytown2: I see, you're looking for a full match for one set of items. If you felt like rewriting this query, you could try the "count distinct matching items" approach to simplify your joins. You would use something like COUNT(DISTINCT filename_md5) FROM ... WHERE filename_md5 in (....) and then check if that count matches the number of items you're looking for. I wonder how it compares in terms of performance...

My current query
SELECT root.bundle_md5 FROM advagg_bundles AS root
INNER JOIN advagg_bundles AS a USING(bundle_md5)
INNER JOIN advagg_bundles AS b USING(bundle_md5)
INNER JOIN advagg_bundles AS c USING(bundle_md5)
INNER JOIN advagg_bundles AS d USING(bundle_md5)
INNER JOIN advagg_bundles AS e USING(bundle_md5)
INNER JOIN advagg_bundles AS f USING(bundle_md5)
INNER JOIN advagg_bundles AS g USING(bundle_md5)
INNER JOIN advagg_bundles AS h USING(bundle_md5)
INNER JOIN advagg_bundles AS i USING(bundle_md5)
INNER JOIN advagg_bundles AS j USING(bundle_md5)
INNER JOIN advagg_bundles AS ba USING(bundle_md5)
INNER JOIN advagg_bundles AS bb USING(bundle_md5)
WHERE a.filename_md5 = '0dc1450974648a310b26653371f9f487'
AND b.filename_md5 = 'd4f55cb7b3b57b49a99c1147e576d8f8'
AND c.filename_md5 = '62c082f1d04b034856f75ec62d5ea86b'
AND d.filename_md5 = 'ccca38d39379c4951aaadc7a1271880b'
AND e.filename_md5 = 'b80b4eba08abdda2819f2de3b8b05031'
AND f.filename_md5 = '485be5283ce508ad492c513fa39c5e6a'
AND g.filename_md5 = '03b26593d0c248d98bb01cc1fc723678'
AND h.filename_md5 = 'f80f5f67e42850cc967bab8d99f1fa49'
AND i.filename_md5 = '393f98e0b1996cbccb7f297e3eb848ca'
AND j.filename_md5 = 'cf692d77bebd591019f7ee40b55f8fae'
AND ba.filename_md5 = 'f4481865eee1dd1acd14193e18570a9e'
AND bb.filename_md5 = 'd7c2b997cc37719706892882d8432c17'
GROUP BY bundle_md5
ORDER BY bundle_md5 ASC

Returns 49 Records and runs in 0.0313 sec

Your proposed query

SELECT DISTINCT bundle_md5
FROM advagg_bundles
WHERE filename_md5 IN (
'0dc1450974648a310b26653371f9f487',
'd4f55cb7b3b57b49a99c1147e576d8f8',
'62c082f1d04b034856f75ec62d5ea86b',
'ccca38d39379c4951aaadc7a1271880b',
'b80b4eba08abdda2819f2de3b8b05031',
'485be5283ce508ad492c513fa39c5e6a',
'03b26593d0c248d98bb01cc1fc723678',
'f80f5f67e42850cc967bab8d99f1fa49',
'393f98e0b1996cbccb7f297e3eb848ca',
'cf692d77bebd591019f7ee40b55f8fae',
'f4481865eee1dd1acd14193e18570a9e',
'd7c2b997cc37719706892882d8432c17'
)
ORDER BY bundle_md5 ASC

Returns 55 Records and runs in 0.0044 sec

The 6 different records do not contain all 12 filename_md5 values.

My query returns
0246ef2a4bc53cf822450f2f9d4679ee
0c370899c8c57683353b28927afe7ca2
13f44cf748cafb743f0bd02d4783300e
145da19439b03e6aabb6494490fb4336
1a9f1285434560290c6e0f3bcd34ff9e
1c6455bac4d28089e6ad81288fd8b03e
21c151cca5232c94e566e19c13fcb39b
2c63ebf4b5e1df139caacbf9eb464588
2d90807d56ac8dcd45217eda2f34b9d7
3afb8c5c264d82f8559429828b6e37c2
3d7705aedc537e663b27634e2e6a79e2
417cc93782fa27ce93485688aa3420ed
44c5edb1e890ba9e074cd47602d26639
...
Yours returns
0246ef2a4bc53cf822450f2f9d4679ee
0c370899c8c57683353b28927afe7ca2
13f44cf748cafb743f0bd02d4783300e
145da19439b03e6aabb6494490fb4336
1a9f1285434560290c6e0f3bcd34ff9e
1c6455bac4d28089e6ad81288fd8b03e
21c151cca5232c94e566e19c13fcb39b
2c63ebf4b5e1df139caacbf9eb464588
2d90807d56ac8dcd45217eda2f34b9d7
3afb8c5c264d82f8559429828b6e37c2
3d7705aedc537e663b27634e2e6a79e2
417cc93782fa27ce93485688aa3420ed
4294c8a0f8233eb55c42e867059f0b20
44c5edb1e890ba9e074cd47602d26639
...

4294c8a0f8233eb55c42e867059f0b20 only contains 4 files in it; where as 44c5edb1e890ba9e074cd47602d26639 contains 19.

So your query is more like an OR match where mine is an AND match. By the way I do not use this function any more; explanation is here http://drupal.org/node/1196...

Ah, try that as a SELECT ... FROM (SELECT root.bundle_md5, COUNT(DISTINCT filename_md5) AS count WHERE ... GROUP BY root.bundle_md5) AS sub WHERE sub.count = .... But glad you figured out the bigger problem!