GSOC 2012/Student Application Gugu93/Geeklog Crowdsourcing Translations(422): Difference between revisions

Revision as of 11:48, 2 May 2013

Proposal Title :

Geeklog - Crowd-sourcing Translations

Additional Info :

http://web.iiit.ac.in/~rishabh.raj/pluginuiplan.png

A Short Description :

Geeklog has been translated into over 30 translations, but most of it was some-time ago so would actually need feedback and improvements.

Crowd-sourcing is the practice of soliciting contributions from a large community of people rather than a select few [cited from Wikipedia The Free Encyclopedia] and I intend to develop a platform for crowd-sourcing Geeklog translations.

Contact Information :

Rishabh Raj

+917842797467

Email : rishabhr123@gmail.com

IRC - gugu freenode.net #geeklog

Experience, Interests

Most interesting programming / scripting languages - C, C++, PHP, JS, Python.

Developer tools I work with - Vim, debuggers, git (now mercurial too), Firebug for Chromium, Clementine :)

Some time back Richard Stallman (needs no introduction) had visited our college, he was on a short trip to India, I attended his talk and I must say he got me inspired. When I really started, the fun involved was no less. This I think will be my way of giving back to the society.

Quite some time ago i developed, http://web.iiit.ac.in/~rishabh.raj/chat2/form.php , ( a basic group Chat Application involving AJAX ), had quite fun developing the design for it. In the first release it lacked a proper sign in feature, so one could crawl it with bots which would then generate heavy load on the server, having realized my mistake, i added an in-house CAPTCHA solution !. With that also added support for some cool smileys!. This application currently has a small session bug, i'm not currently getting the time to fix it.

Made the Indian top 20 for the Long Contest (11 days) held in the month of December on Codechef !. A long contest tests your programming ability in the sense of writing efficient code, good implementation, etc. apart from the usual Problem Solving ability. Indian Rank 16 - December Long @ Codechef - http://www.codechef.com/DEC12/#rank-tab-2 World Rank - 57 in the Same. Username - rishabhraj93

Having completed courses like Data Structures, Operating Systems, Graphics, each with extensive amount of coding, successfully written compiled codes of upto (5-6)k LOC, love to read lots of code, conduct experiments with a defined logic.

I am determined to go on a steep learning curve this summer (the time when we have the most free time), hence I would like to use this period of Summer of Code to learn more, , gain a more deep insight into the wonderful open-source community, develop good relations with Geeklog and FOSS community, while developing a useful functionality.

Having already being associated with the community and having submitted patches I am comfortable with the coding guidelines, the review procedure, i’m already liking it, with not such a long period of association.

Is this my first time participating in the SoC - YES

Project Insight

Geeklog has been translated into over 30 translations, but most of it was some-time ago so would actually need feedback and improvements. Upon adding a new feature or fixing a bug, if it requires a message to be displayed to the user (almost always the case), may be an error message or some instructions to follow, the current way is by editing a PHP file, adding content to a LANG array. We do not actually see the “context” in which this string gets used. This is cumbersome for someone without the knowledge of PHP, plus the amount of text we are dealing with is quite huge. With the idea from Wim Niemans, we target to address these issues by crowd-sourcing. Crowd-sourcing is the practice of soliciting contributions from a large community of people rather than a select few [cited from Wikipedia The Free Encyclopedia] and I intend to develop a platform for crowd-sourcing Geeklog translations.

Project Goals

The crowd-sourcing platform enables a user to contribute to translating Geeklog into a different language, this may be via promoting or demoting actual translations which exist, providing a new translation or by actually adding a translation for something which is missing.

Since users may lose interest in this daunting task of translation, we gamify this whole thing into a Game of learning a new language. (contd in More Details)

Development of an intuitive user-interface for the administering of suggested translations, parts of which can be accessed by users as they gain privileges by becoming more trusted users.

The system is rather autonomous in its own self but is monitored by a language admin for spammers or for translations which may require further review, even ban spammers.

To develop in a most reusable way, so that such a platform may be used by other organisations.

Future Goal

To develop a system of pull requests from the base installation to merge it with the existing Geeklog main repository, so that the new translations can be released in a newer version of Geeklog. Integrate a similar crowd-sourcing theme with Captcha’s built into Geeklog.

More details follow

The basic idea involves developing the crowd-sourcing platform as a plugin which existing users can install to their Geeklog installation.

Each “pieceoftext” has some review points associated with it, for each language, which basically indicate the quality of translation

One of the ideas was to display a string that's to be translated "in context", so that you - as the translator - would get a better idea what that word or sentence is really about.

Each user has a rating associated with him, better known as “trust points” which will basically indicate his level of proficiency .

Each user has the choice to participate in the crowd-sourcing by modifying his status of the plugin.

Each element pulled from the LANG array has an ID associated with it.

The main text translation table has the following fields, ID of text (just once), translate_language (for each language), language_reviewpoints (for each language), a field for minimalistic context, this is pulled from the currently existing data, such as values from the $LANG04 array are used in the user-settings form.

There is a ban list maintained in a table, which tracks users who have been banned by the admin, before recording any activity there is a check if the user is banned from doing that, in which case his response is refused.

Control Flow

The admin, installs the crowd-sourcing plugin, enables it.

If it is an already registered user with no specific language defined, we take his language choice as input or suggest a language based on the Geographic location of his IP address.

From now on, upon log in receives some text, pulled from the database, this is either for review / for translation, based upon the language he selects. (via an AJAX call, a smooth popup maybe ?). This is part of the retrieval subsystem.

The Retrieval :

Suppose the user has the language selected as “X”, retrieval will return any one text and its current rating among the text with lowest ratings belonging to the language “X”.

There are three options now, based on the rating of the translation of the text retrieved, to simplify things a bit, the option of up-vote or down-vote on the translation which changes the current rating of that translation or suggest a new translation. In any case the response is POST to the server at the plugin which records the request and does the required changes to the database as follows.

If the user up-votes/down-votes, his vote is registered by adding / subtracting the rating of the translation by an amount equal to the user’s trust points. The concept of “trust points” is defined later on, simply put, a more trustworthy user to has more “trust points”.

If the user suggests a new translation: The data is now entered into the database, in a separate table with the fields as : (named as the pending_review table).

Text, Translated, Language, Userid_which_translated

We start with the “minimalistic context” and then ask the "crowd" to improve it, when necessary. This interface for providing alternate contexts (advanced options) are available to more “trusted” users (one with more “trust points”) i.e basically when they have already done more translations. The user may also be allowed to provide an example for the use case or explain how the translation changes with the new context, thus making the job of the admin easier

What is the 'context issue' ?

One of the ideas was to display a string that's to be translated "in context", so that you - as the translator - would get a better idea what that word or sentence is really about. The problem is that you can't really provide the context for a text string that you pick from the language file at random There may be a template file involved or the text depends on the user's role or it may only be displayed when an error occurs. Such contextual issues are not expected to be too many in number.

Another possible solution to the context issue, “display the user an option to see context, now on clicking the show context, in a modal box display that page, with the current user settings, possibly highlighting the context for that translation in that page inside the modal box, or strip out majority of the content by JavaScript ? Such as a "Show Context" for $LANG04 would reveal the user-settings form, in the modal box, but formatted in a good way.

Note: This may not be accessible always (different right to different users on pages), that is a limitation, but is indeed a better solution. More direct than asking them to submit by thinking more.

We would probably have to come up with a list of text strings for which this works or does not work, i.e. we would have to remember somewhere that you can not use this approach for $LANG24 (story editor) unless you have story.edit permissions. That information is not currently available in Geeklog, so we would have to invest some time into that first.

We can map the permissions into the main translation table, and check with the permissions of the user currently trying to access the context.

The community bonding period is when I plan to do this.

The up-votes or down-votes are also logged in a table to which the admin has a view, if he feels some user has made a mistake he can warn him or send a message using this interface.

Text, Translation, Language, Userid_whichvoted

A language admin may not have terminal access to the database, hence we can show it to the users that their privacy is not being invaded in any case, the language admin cannot see which user suggested the translation but just has an interface to accept / reject that translation, the back-end handles sending a notification to the concerned user.

More Details The language admin view of the plugin consists of data fetched from this pending_review table, grouped by language. Further if the same text has been translated into different things for the same language, that is displayed in a grouped fashion too to the admin. (Part of the UI)

Here the admin has three options, either to approve or reject a translation, approval increases the review rating of the translation by a very large amount, (think of Admin being a user with a very very high trust rating), and user trust points increase for that user who suggested that change.

Moreover sometimes we just need to correct a typo, so before approving the admin can edit the translation to fix such minor mistakes.

From here the admin is even allowed to “ban” (for spammers) a user, or remove the ban on him.

There exists an option to “Undo” the latest change, details for the functionality decided later on. (may be time based or count based, etc)

We provide him with an interface that allows him to merge suggested translations ( based on a different context ) and make it presentable while entering back into the database, so that when the next time it is pulled, other users are able to understand it well. If the user can put a use case for some context, it should be more helpful!, the important point here would be "merging two similar sounding arguments" from two or more different users.

In case of a rejection, warning is sent to the user, that the admin did not like the translation and a penalty is issued as per his history, (if the user has had a lot of rejections lately he is not good ) and there is a decrease in the user trust points.

In all cases the record is deleted from the pending_review table and a notification is sent to the user (this is optional, maybe as a digest too i.e per day).

At the end of the day or may be some other fixed time as per the server, a cron is configured to browse through the language text translation table, fetch translated text of some defined translation rating range and write it to the language file, this is automated but optional, can be done manually by exporting. This export also includes the “user”, whose suggested the translation if the option is chosen by the user.

The concept of rating review points for a translation ensures that even if the admin is not sure of the change but if the users are sure they can bring about the change, moreover it gives the language admin a chance to correct his mistake (if he does it) to manually lower the rating (a special interface for this will be provided).

The major part that ensures the long term survival of crowd-sourcing is incentives to the crowd. We need to make better use of the data of translations we already have, hence the game of learning a new language.

For already defined translations approved by the admin, we display pop up's with a message something like

“ Did you know <insert text> is known as <translated text> in <language> ? ”, an option for “I wanna learn more”, this then decreases the time period with which the user is asked for translations or taught a new language instead!

With that, “How do you say it in <the language of the logged in user> ?”

More detail on this :

Considering the user is from Brazil. He comes to know that "an Apple" in German is known as "an Apfel", something he doesn't know. He is then asked "What is "an Apple" in Portuguese ?", for which his response is regarded as the Portuguese translation.

Now say a response already exists stored in the database, as could be the case here implying that the question is a quiz question for say “y” points to be added to the trust rating of the user.

We use fuzzy-matching to compare the users input with the data in the system, a ( more than 80 % match is regarded as a correct answer and the user is rewarded ), when the match is not so significant say less than 90%, a notification is sent to the admin that a better translation might exist. ( this notification system will have to be enabled by the admin and could be made a future goal restricting the scope of the SoC project ).

Fuzzy matching works best with phrases and may not work with individual words, to tackle this we store multiple translations for single words in a new table named as multiple_choice table linked to the main translations table by ID.

Users, with less trust points are shown “smaller text”, it increases as they improve. With more trust points the user is given more privileges too, the first promotion would involve being able to see other suggestions from users, promote one more and then he can approve or reject etc.

This can also be shown to users who opted not to specify their native language in the beginning. (since we are showing them the benefit, they can learn a new language easily now).

Some motivated users could lead this, later turn into a chain reaction. The language which we are teaching to the user, can be chosen from a list which contains the languages somewhat similar to the native language of the user, with which he can connect more.

Where a translation is used, we could use a tool-tip, that indicates which user helped with the translation, this provides added incentives to the user, as now they would see their name displayed along with the translation! (this is the users choice, if selected, while exporting the language files the user’s name is also exported else not) These changes if further imported into the Geeklog repository will also acknowledgement the users for their contributions.

Since user history is being tracked we use this to reward a “star of the week”, this appears on the side panel on the plugin page for the users who have installed it. This is the user with the most contribution to the task at hand, contribution for the current scope refers only to most accepted translations in recent times.

Along with this we have a leaderboard, users with most contributions remain on top.

Future Scope for Leader-board / Star of the Week: Can be developed into a global leader-board that tracks contributions from people all across the world, Star of the week in a similar way

After being reviewed by the admin, and getting a specific number of votes in favor from the community, the translation is removed from the translations table and is hard-written into the language files. It is these changes which can be sent for pull requests into the main Geeklog repository by the admin but is a future goal for this project.

For a glance at a rough mock-up, see link under "Additional Info" below a "Short description" of the proposal.

Scheduling (Rough Plan)

The project has been divided into phases for ease of operation and monitoring by the mentor. During the entire SoC i will be committing code to [ a repository, this will be assigned later on].

At the end of each day, I intend to write a very small daily blog, about the work done during the day, problems I faced, how they were solved, including suggestions from the mentor and create a rough TO-DO for the next day. This will help me keep in sight of the target, so that it is always in reach. A good way to do just that is http://idonethis.com .

I plan to give a rating to each task and proceed from easy to difficult, this i believe would help me being more comfortable when I reach the difficult portions.

Community Bonding Period [May 27 - June 17]

Read up on documentation, create a much more clear draft and design for methods to be implemented, update on new technologies, further discussion with Dirk. Finalize database designs, any possible normalization and denormalization applied.

Work on the other solution for solving the context issue as mentioned above, coming up with a list of strings based on permissions for users who can access them. This information is currently not available in Geeklog.

Week 1 -

Basic plugin design, Database creations, fetching data from LANG arrays, any other table modifications to incorporate the plugin.

Week 2 - Setting up the AJAX back-end to handle different queries.

Week 3 - Set up the front-end for the plugin. Developing automated tests for the database, AJAX back-end.

Week 4 - Testing front-end communication with back-end AJAX handlers, vice versa.

Week 5 - Setting up fuzzy matching, pushing data in for “pending_reviews”, system for promoting or demoting.

Week 6 [Catch Up Period][July 22 - July 28] No matter how good the plan is, there could always be some delays, this is the week for making up for those delays, before the Mid-Term evaluation. In case of no delay, we are ahead of time.

Week 7 -

Design view for the admin, taking into account user privileges, users with some defined trust points can access different parts of the ADMIN user-interface.

Week 8 - Integrate admin responses with the back-end for handling them, such as response “x” results in “query y” to the database.

Week 9 - Further integration and testing with the user account, test user notifications.

Week 10 - Set-up of intricate user settings in the main user-interface for the plugin, such as “time interval upon which you want to answer the quiz”, “setting for exporting the username along with the translation” etc.

Week 11 - Develop a system for exporting the changes, test with writing language files. Develop tests for the export function.

Week 12 - Improve the interface, animations for popup's, notifications etc. This later part is involved with developer community actually testing the plugin, reporting bugs, suggestions for a better UI, etc

Week 13 [Catch Up Period][Sept 9 - Sept 16]

Any further re-factoring, tweaking, debugging (this is apart from the debugging during the actual development) of code is to be done during this period. i.e before the Suggested pencils down date.

[Sept 16 - Sept 23] [Final week] Build on documentation, add tutorials so that users can now make full use of the features. Merge sections of the blog.

Since they already contain a large portion of the work that has been done.

Any time later - Continue association with the Geeklog community, possibly work on extending this system, fulfilling future goals.

Documentation and testing are an integral part of this schedule. All unit tests are tested with Jenkins.

Other obligations:

I will be travelling to my other house on June 6th, I would need a day off to settle there. This is during the Community Bonding Period.

I have 5 exams over a period of 3 days. (5th - 7th Sept.)I would want to take 2-3 days off for preparation.

I wish to attend PyCon '13 --http://in.pycon.org/2013/ (30th Aug - 1st Sept) (one of a kind conference held in India that brings together developers from all corners, not too many opportunities otherwise)

This depends on mentor reviews about my progress.

No other obligations during the SoC period.

Patches for Geeklog :

http://project.geeklog.net/tracking/view.php?id=1580

http://project.geeklog.net/tracking/view.php?id=1490 - (minor fix to patch remains)

@@ Line 208: / Line 208: @@
 '''More detail on this :'''
 Considering the user is from Brazil.
 He comes to know that "an Apple" in German is known as "an Apfel", something he doesn't know. He is then asked "What is "an Apple" in Portuguese ?", for     which his response is regarded as the Portuguese translation.

Search

GSOC 2012/Student Application Gugu93/Geeklog Crowdsourcing Translations(422): Difference between revisions

Revision as of 11:48, 2 May 2013