Benutzer:NetAction/WikiTrust/SURE

The SURE web service connects WikiTrust with the browser. It consists of an API providing a lot of information about a Wikipedia page and a cache database.

Sequences: What came into the HTML of the page in which revision
Users: User IDs and bot status for every user name in the page
Revisions: User name, timestamp, trust number, minor flag, anonymous flag for every revision id in the page
Extra: Some extra stuff like the page title

If you are interested in the software itself look at the SURE documentation and checkout everything from SVN.

Input parameters

revid: The revision id you want to have the information about.
debug: If you set this to "wiki" you get the raw WikiTrust wiki text. "html" returns the html code generated by SURE.
project: Either "de" or "en". Other projects like Wikibooks could follow.
callback: Name of callback function.

Example: https://toolserver.org/~netaction/sure/?revid=234234&project=en&callback=123

Output

The API will return JsonP code.

pageinfo is an array with some basic data from MediaWiki API doc.

pageid this page's id.
title how the page is called
touched unixtime of last change of this page (It is really the timestamp of the last change and not the timestamp of the revision you asked for)
lastrevid latest revision
length page size

sequences is an array of revision ids followed by the count of non whitespace Unicode characters in the page's HTML.

revisions is an array of revisions ordered by revision id. The data comes from the MediaWiki properties revisions API.

u User name
m minor change flag
a edit by anonymous user
t trust number
ts unix timestamp

users is an array of user names. It contains only the users that are not anonymous. The data comes from the MediaWiki Users API

id User id, 0 if unknown
b bot flag

The size of the whole output is typically 10%-40% of the page HTML.

How SURE works

When the user calls WikiTrustSURE API it will use a simple prepared SQL statement to fetch the corresponding json code from the database.

   SELECT sequences FROM pages WHERE revid=:revid

Usually the result is exactly what the user needs. He will get it via echo and the job is done. If SURE does not have the sequences in the database it will generate and return them.

If you are interested in the software itself look at the SURE documentation and checkout everything from SVN.

Generation of sequences

If a browser or a bot requests the output of WikiTrustSURE the API will look into its own database if it already has what is needed. If not we have to generate the sequences and other stuff.

The API does a call to Wikipedia's MediaWiki Query API and collects some basic information about the revision id. WikiTrustSURE finds out if the revision id is valid, is in namespace 0 and to which page title it belongs.
WikiTrustSURE requests the WikiTrust wiki text from the WikiTrust API. This is wiki text with some extra tags to note who wrote which part. This takes 1-2 seconds.
The wiki text will be sent to the Wikipedia API. The result is something like HTML and very useful to generate the sequences. This takes up to 40 seconds.
WikiTrustSURE parses the HTML and counts the number of characters every user wrote. White spaces and HTML tags are skipped.
Finally we fetch meta information about the involved revisions and users. The whole process needs up to 50 seconds and is nothing more than waiting for other servers. SURE converts everything into json and drops it into its cache database.

You can see the status on the SURE screen page

Why SURE

Everything WikiTrustSURE does could be done by the browser. For large pages like Heiliges Römisches Reich or Berlin this will take over one minute. SURE provides all the information in under a second if the page has been cached before.

SURE takes the WikiTrust wiki markup, posts it to Wikipedia's API, receives back the HTML code and talks a lot to other APIs. The traffic can easily increase over 2MB. But the traffic sent to the browser will not exceed 300kB even in the worst case of extremely large pages. Those pages already cause more then 1.5MB traffic. Using the WikiTrustSURE API raises the traffic for typical pages by 4%.

To do the work SURE will need lots of disk space and traffic. If it has many users a proxy server would be great too. All this is easier to scale if WikiTrustSURE is not only another WikiTrust API but a dedicated service. Therefore it can run independent from WikiTrust servers and Toolserver on its own box behind the Wikipedia proxy.

The DE Wikipedia has 1.3M pages in namespace 0. Assume that only three revisions per page are cached and the entries need 5KB averange in the database. Then the database has 19GB.

5kB\cdot 1.3M\cdot 3=19GB

Why not WikiTrust's code

There are some issues with the regular expressions in WikiTrustBase::color_Wiki2Html from includes/WikiTrustBase.php. The same code is in the Firefox Addon.

Some templates do not work. Regular Expressions damage their markup. Rudersdorf (Burgenland)
HTML in the WikiCode (like NOWIKI sections) destroys the resulting HTML. Trinity Hall
Bold text and other markups get lost.
Special characters destroyed, even some white spaces.
Image Descriptions do not have the information about the author. Yes, it is in WikiTrust's API response. But it does not survive the regular expressions.
The code is sent from WikiCode to the browser, from the browser to Wikipedia and back. This should be done on the server. WikiTrust already has the needed code in WikiTrustBase.php.
Heavy traffic. The browser already has the page HTML and does not need it again.
Too few information. No meta data for revisions and users.
Lack of a real parser, hard to maintain because of inconvenient regular expressions.
sensitive to XSS attacs.

Updates

Sometimes the name of a page changes or templates generate new HTML code. It is possible that users modify their names too. At the moment WikiTrustSURE never updates. There are two ways to trigger updating. Either Wikipedia initiates the update when it updates its own HTML. Or the browers use checksums or something like that and trigger the update when they think it is neccessary. Both methods do not work if user names are modified or other things change that do not affect the HTML code and sequences but the revision or user data.