< Search Platform < Weekly Updates
Search Platform/Weekly Updates/2024-02-09
Summary
Further investigation of failed queries on the WDQS main graph shows that most are coming for a few sources, which gives us some confidence that we can improve the situation significantly by focusing on a small number of use cases.
Other projects are moving along nicely.
What we've accomplished
Improve multilingual zero-results rate
- ICU token repair corpus is built and daily diffs are running. Reviewing diffs from enabling the ICU tokenizer. Mostly looks good, but there are a few things to track down. (Malayalam has the most unusual results and I'm having a little trouble figuring out what's going on—diffs from my regresion test set aren't reproducing easily in focused testing. I'll get to the bottom of it eventually.)
WDQS graph splitting
- A draft of the analysis is available on wiki. We confirm the numbers from last week that showed some categories of queries having a high failure rate on a split graph. This is mitigated by the large majority of failed queries coming from a small number of user agents (the top 5 user agent account for >90% of failures), which indicates that it is likely that a targeted effort can reduce the number of failures significantly. This will need a qualitative analysis which is planned as a next step. https://wikitech.wikimedia.org/wiki/User:DCausse/WDQS_Graph_Split_Impact_Analysis
- Intermediary report on query performance is published. This analysis covers only queries that run on the main graph (without requiring federation) and shows a modest performance improvement compared to running on the full graph. This confirms our initial assumptions. https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/Graph_split_IGUANA_performance
- February update is published, inviting feedback from our communities: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/February_2024_scaling_update
- Project page is published: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split
Misc
- Investigated, restarted and back filled failed data pipeline. https://phabricator.wikimedia.org/T356030
- We participated to a Unicode Consortium meeting about the Foundation's membership. Nothing concrete yet, but a lot of good will and promises to do introductions and work together in the future. This is especially timely with our current work on ICU token repair.
This article is issued from Wikimedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.