Two weeks into building the wikireview system,and having performed a data structure and dataflow analysis, I noticed issues with the SQL schema changes and the complexity that came along with them. The database model was getting cluttered with added roles and tables which probably could be abstracted away.
We realized that we could simplify the model to a great extent.
This involved switching over to NoSQL via file systems. We did away with user registrations for the time being (I’ll explain later how we intend to include optional registrations into our system). We built the entire codebase from scratch. This blog post is dedicated to explaining the new system that we built.
NOTE: Project code is now hosted here
File system Structure
Moving away from SQL provided us a lot more freedom in terms of storing our data. We no longer had to be bound by the rigid structures imposed by SQL. In the newly devised NoSQL approach, we decided to use a purely file-based system for storing questions and answers. We have separate files for every reviewer’s actions (ask, answer, recommend).
The files can be of 6 types:
- q – question
- a – answer
- e – endorses original answer
- o – opposes original answer
- t – tiebreaker
- d – diff
The system currently assigns a 9-digit code (ranging from 000000001 to 999999999) as the name for every question that is asked to the system. These numbers aren’t assigned randomly and follow a round robin approach. This number is followed by ‘q’ to denote that the file contains a question. So the first question asked will have a name like ‘000000001q’. Next when a reviewer wishes to answer a question, they are presented with the questions and answers if present. If the question hasn’t been answered before, the reviewer needs to answer it. The file name has the same 9-digit code as the question followed by an ‘a’. If the reviewer is presented with a question and corresponding answer, they can choose to endorse or oppose the answer which creates ‘e’ or ‘o’ files respectively. in case the answer is endorsed, it is now ready to be opened by a recommender to implement the necessary changes suggested in the answer file. In the case that the original answer has been opposed, the question along with the answer and comments passed by the opposer are presented to a third reviewer, who can finally decide to endorse or oppose the original answer.
The application has 4 action end-points:
This endpoint is for posting questions into the system. The template consists of a text space to type in the question’s content. It is then followed by a space to type in an optional url related to the question, which then shows up in the iframe below. Clicking on ‘Submit’ creates a file having a unique 9 digit code (the question id) followed by ‘q’. Subsequent hanling of this question is done by referencing this id. So even the answer, comments and diff files (more details about these ahead) will have the same 9-digit code that was used for the question.
We have also added url checking functionality to the question’s content, which means any url in the question will automatically be highlighted when the question is displayed. This is done using a suitable regex matching expression for detecting url within texts. This functionality has been applied to all input text boxes in /anwer and /recommend as well.
Question creation scripts
Apart from building the review system itself, the aim of the project was to devise new ways to analyze outdated/inaccurate content in the wikis. This analysis would then be applied to feed auto-generated questions to the review system. I wrote several scripts for question creation. I will now explain them in some detail.
Questions from articles containing the word ‘recent’
The idea here was to scan wiki articles for the words ‘recent’ or ‘recently’, use the page view counts to extract articles of high importance. For getting articles containing the word ‘recent’, I used Wikipedia’s Special Search tool. I then used BeautifulSoup to extract relevant tag values (like article title, link, etc) from the html source file. So I eventually got hold of all such articles that potentially fell into the ‘outdated’ category and created questions from them. Figure 1 shows a screenshot of such a question.
Figure 1: Question generated from the ‘recent’ script
Questions from poor readability scores
In the section on Readability Measurements, I had introduced the SMOG readability index and the Flesch-Kincaid readability scores(referred to as FK from hereon). I now use the FK scores on the WP Backlog Category ‘Articles in need of copy editing’ to get those scoring poorly on readability. The next idea was to lay focus on those articles that needed immediate attention. One metric that determines the relative importance of updating an article is the number of page views it receives. To get this information, I used the Wikimedia Pageviews API and wrote a small script that sends a request to this API with the article name and time span of one month from present. So this returns the number of page views on that article for the last one month. Finally, the articles are ordered based on a combination of poor readability scores and high page view counts.
Questions from WP backlog Categories
This script that just creates questions from the backlog categories specified in the script.
When a user lands at this endpoint, a randomly chosen question is displayed to them, along with the answer and comments if any. Like I explained in the section on File System Structure, can be presented with just a question to answer, or be asked to provide comments, or even be the tie-breaker.
Questions that have been resolved i.e. either the answer has been endorsed or opposed based on common consensus, are displayed to the recommender, whose job it is to address the issue at hand and make necessary changes to close the issue. This could involve updating the erroneous wiki and pasting the diff of the changes made.
For example, say the question displayed was related to the word ‘recent’ added in 2008 in an article. The answer and comments would have analyzed the question and suggested necessary changes to be made. It is now the recommender’s job to implement these changes and paste a diff of the changes for legitimacy. After this is done, the question, answer, comments and diff files are all archived in a separate folder.
This endpoint is for seeing all records of questions, answers, reviews, and implementations. It provides a well-rounded summary of the number of files of each type that were created as well as the mean times of modification.
More importantly, it serves as a handy tool for analyzing the question files for specific content. For this purpose there are two options: one to search for strings in questions and two, to search for a reviewer’s token id. Based on the query, a summary is drawn up. This tool can be used to analyze, say articles generated from the ‘recent’ script, or to analyze the questions answered by a reviewer.