The "Gapfinder" Study: Using Schemes SG's data to identify unmet social needs in our users' queries and improve our search model

The SchemesSG database has recently grown to 300 schemes and >1,500 user queries. It’s not quite a data lake yet, but perhaps a wading pool from which we can uncover insights on the top social needs in Singapore. We share some of these insights.

#technology

In this article, our very own ML engineer Quintus Lim performs a data driven study to analyse the topology of the >300 schemes in our Bank and >1,500 user queries we have received and served so far. Quintus then analyses the fit between the listings and the queries, and explains how our team leverages such insights to improve Schemes Pal's search model.

Quintus's work underscores how the data collated by Schemes SG can serve to provide a critical sensing of social needs on the ground.

--------

My blog post today first briefly explains the analytical techniques used to profile the schemes and user queries we've collected, then analyses the results to identify which social areas frequently see unmet needs, and how schemes may be addressing them. This is still preliminary work, and we hope that as our user base grows and the Schemes Bank becomes a more comprehensive snapshot of the entire social service, it can help serve as a “gapfinder” for unmet social needs in Singapore.

Profiling user queries

I start by teasing out the main topics that user queries tend to fall under. With >1,500 queries to date collected via Schemes Pal, it’s inconvenient to read them manually and we need data science techniques. There are at least 3 ways to approach this:
• Classification/supervised learning (e.g. k-Nearest Neighbours) which requires humans to define all relevant topics beforehand;
• Self-supervised learning (e.g. Doc2Vec) which picks labels from within the training data itself;
• Clustering/unsupervised learning (e.g. k-Means) which solely analyses the features intrinsic to unlabelled data.

Sometimes even I don't fully remember how these models work, but what’s important is to recognise their pros and cons. For instance, supervised learning uses explicit labels which make our results more interpretable. But this approach has its drawbacks. First, creating those labels is labour-intensive and they may not be accurate. Second, it's hard to label cross-domain queries (e.g. “Financial aid for caregiver respite services, or failing which, transportation to eldercare facilities”) without overfitting and/or taking an exponential amount of time to solve. Third, it’s hard to ensure models don’t just “spin the lottery”, i.e. blindly pick the most frequently-made query to maximise the chances of being correct. Too much work for no guarantee of results – basically the typical Singaporean’s work day!

This made us consider unsupervised learning (e.g. k-Means, LSI, LDA etc). These models are much less labour-intensive because there’s no need to manually label user queries. But they entail their own set of challenges (e.g. with many possible topics and the high-dimensionality of text data, tuning k-Means is like trying to get a promotion – literally nothing works. Also, k-Means and LSI topics aren’t always interpretable, while LDA has so many hyperparameters I didn’t even try).

The above considerations directed our attention to models based on word vectors, such as Doc2Vec and pre-trained transformers, which essentially find ways to convert text into numbers. These are getting increasingly accurate. We further found that Doc2Vec and LSI underperformed pre-trained BERTs in our evaluations. It’s hard for a “shallow deep” net to capture the full complexity of a language and multiple domains, plus Doc2Vec can’t handle words with multiple meanings. The word “health” for instance, is often used metaphorically in business and economic contexts. This necessitated a more sophisticated array of algorithms.

I won't bore the readers with more details, but the general idea is use thicc transformer models for clustering. Then, to visualise our results, we convert numerical representations of each query from 768 numbers into 2 (for dimensionality reduction, I much prefer UMAP to PCA, but if you don't trust noobs, look at this, then read this and this). We arrive at this:

Image placeholder
An interactive visualisation is hosted here for now, where readers can hover their mouse over any bar or bubble to highlight the topic of the query and display share of queries falling under that topic.

Some key points to be gleaned from this:
• Search terms are generated by users, but topics are hand-labelled by me after looking at what tends to fall under the topic, and are neither fully objective nor comprehensive.
• The embedding quality is pretty good even in just 2 dimensions. Not only are same-topic queries closely spaced, similar topics are reasonably close to each other. E.g. suicide, mental health, divorce, and counselling all neighbour each other. Health-related queries are mostly located in the bottom half, while financial assistance is in the top right. Queries related to retrenchment and unemployment are in the top left.
• As is always the case with dimensionality reduction, the X and Y axes have no interpretation - it's 768 dimensions squashed into 2. Imagine trying to draw a die on a piece of paper - you simply can't show all 6 faces in 1 drawing.
• Clusters may seem artificially large because people try multiple variations on their queries. Either the results don't suit their needs, or they already know of the schemes returned, and want to see if there's anything new.
• When I sampled 100 queries and manually read through them, I identified around 20 topics. My algorithms landed on 26 topics (post cleaning), which is pleasantly close.
• Mislabelling commonly occurs when people "bao ga liao" – lumping disparate issues into 1 search, like your parents reciting all your flaws in 1 breath. Also, some topics like youth at risk and cyberbullying have been subsumed under other topics even though they should have their own.
• These categorizations only indicate the likeliest individual topic of a query. Certainly, queries can and do have more than 1 topic, but visualising this is a nightmare, so I'm reserving this more for internal analysis.

One might ask: would it be better to use transformers trained on Singlish? (Yes, they exist.) I’d say no; our users mostly type in formal English, and BERT models have seen enough data to not be tripped up when people type "cuz" or “cos” instead of "because". Broken English can in fact be better for machines, because it strips out mostly useless words (like grammar), thereby providing clearer signals to algorithms which don’t really care about linguistic formalities.

Profiling schemes listings

Meeting these user queries are the various public, private, and non-profit schemes across Singapore which form the Schemes Bank. But queries and schemes have some key differences – queries tend to focus on the problem, while schemes focus on the solution. For instance, schemes in health care tend to describe their financial assistance, caregiving/nursing services, caregiver respite services, transportation services, field trips and wellness programs etc.

Conversely, queries can mismatch schemes in at least 3 ways:
1. The searcher elaborates heavily on the user's condition and situation, and the future repercussions. Technically, this is suboptimal as the model is trained to identify the solution you want, not the problem you are experiencing. (Granted, not everyone, including SchemesSG ourselves, knows what solutions are out there. The model would still work decently, just not at its best.)
2. The searcher barely types anything. Now, I get it – when people experience difficulties in life, their first instinct isn't to go write autobiographies on random search engines. Nonetheless, your query remains anonymous, plus, being specific gives better results. We do not collect identifiable data because we neither fill out nor approve applications for schemes.
3. The searcher just wants to look see look see. We do appreciate y'all “sliding into our DMs” like this, but if you want proper results, remember to type a proper query =)

What these differences in language ultimately means is that the topics identified for schemes will be different from the topics identified for queries – we cannot use the same model. Also, keep in mind that because all schemes are heavily multi-domain in nature, parking them under topics strips them of their nuances and richness.

As with queries, schemes are also clustered reasonably well. This is what the distribution of schemes looks like:

Image placeholder
An interactive visualisation is hosted here for now.

I'd point out that having many schemes focused on youth at risk etc. does not mean the area is well-addressed – it's likely the opposite. Volunteers and charitable organisations don't expend their resources on areas that are already thriving – they seek out areas of high unmet need where they can do the most good. Plus, it's not clear-cut that more schemes mean better availability of services, as fragmentation and bureaucracy can easily outgrow their solutions, plus some schemes target only specific demographics. As such, having many schemes for one area simply means this is a common problem with some known solutions.

Analysing the fit between queries and schemes – insights and where do we go from here

We can further match queries and schemes to see which query topics go unanswered:

Image placeholder
An interactive visualisation is hosted here for now.

Queries with low matches either encompass too many different problems/schemes such that matching is imprecise (e.g. retrenchment), or simply have few/no schemes available (e.g. funeral services). Queries with high matches tend to be narrow and specific (e.g. cancer finance), but again, do take note that high matches do not always mean an area is well-addressed.

Some further caveats: merely comparing the text of queries to the online descriptions of schemes does not do enough legwork in evaluating adequacy. Simply counting the number of organizations which say that they are providing something gives little to no indication of the resources they can bring to bear, how well they are addressing needs on the ground, or how broad and sustainable their efforts are. Moreover, our Schemes Bank is still growing.

As of now, the basic information we have displayed is a budding feature that gives us some sense of growing needs, but it is still not a foolproof indication of social gaps. However, it does give us critical guidance on how best to improve Schemes Pal's model. For now, that is to work on the accuracy for short and generic queries, or where queries cross into multiple domains.

Don’t get me wrong – there are pretty easy ways to evaluate the relevance of schemes, just that they can open other cans of worms. For instance, it’s certainly possible ask for more in-depth information that allows us deeper insights into the user’s profile. But we’re reluctant to collect granular user data due to high sensitivities in certain areas, and the ongoing push for data privacy. For instance, Schemes SG is completely blind to the profile of users who type in queries, even though there are clear and immediate use cases for improving SchemesSG services.

As research progresses, we’ll keep thinking of ways to identify unmet needs and social gaps while maximising data privacy. Of course, we ourselves are learning earnestly, and we welcome public contributions and feedback on our website.

We hope you've enjoyed this article and our services in general, and we’re only just getting started! The dream is to get listed on NASDAQ, and acquire FAANG (just kidding – you can read about our vision here. We aim to democratise information on social sector assistance and to make navigation easier). But in the meantime, if you like what we've done, it'd help us a lot by spreading the word, especially to people working in the social sector or those who deliver National Day Rallies. The more our search engine is used, the better it will be!

Quintus is an ex-MPS volunteer working in a policy think tank while pursuing a master’s in data science. He likes fluffy corgis more than they like him. All writing reflects the author's own thoughts, not that of better.sg or any of the organizations listed in the Schemes Bank