Dušan Kreheľ (bot)

The bot URLs

Letzter Kommentar: vor 2 Jahren2 Kommentare2 Personen sind an der Diskussion beteiligt

@Lustiger seth: You can look my source code.

Disabled URLs with ";", because semantic differences:

Wiki code:
 "<p>https://www.example.com/here&x67;text</p>"

For human and HTML/XML is URL:
 https://www.example.com/here&x67;text

For MediaWiki is URL:
 https://www.example.com/here

But, if i now testing, so i don't have the true. Ah. ✍️ Dušan Kreheľ (Diskussion) 20:16, 12. Jul. 2022 (CEST)Beantworten

hi!

thanks for continuing. i don't understand your last line. i guess, your english is even worse than mine, haha. :-)

however, i don't understand your example. https://www.example.com/here&x67;text is rendered as one url, same with https://www.example.com/heregtext. maybe that was different in an older mediawiki-version. i guess a semicolon is always considered as part of the url unless it's the last character and not part of an encoded char. -- seth 00:17, 13. Jul. 2022 (CEST)Beantworten

Der Bot entfernt leider auch Inhalte

Letzter Kommentar: vor 2 Jahren12 Kommentare5 Personen sind an der Diskussion beteiligt

Spezial:Diff/222416525/224713336 Aoi Miyazaki. Biography. kursiv + Name wruden entfernt
Spezial:Diff/223109321/224713552 Ariel gehört zum Linktest ebenso das kursiv-Tag [https://www.imdb.com/name/nm1659650/bio? ref_=nm_ov_bio_sm''Ariel Waller.Biography'']
Spezial:Diff/224458719/224713840 auch dort und vermutlich weitere

Das sollte nicht passieren, bitte eine Prüfung auf Kursivtags in der URL einbauen und vor diesen abbrechen. --Liebe Grüße, Lómelinde Diskussion 11:51, 22. Jul. 2022 (CEST)Beantworten

@Lómelinde: U.a. mittels insource:/\[https?:\/\/[^\] ]+''/ findet man weitere solcher Syntaxfehler. Gibt es da nicht schon eine Fehlerliste dazu? --Leyo 15:20, 22. Jul. 2022 (CEST)Beantworten

Ich sags mal so, wir haben das Problem als solches tatsächlich bereits →hier aktiv angesprochen, weil einige der Fälle eben auch zu Linterfehlern führen, wie diese drei da oben, weshalb ich die überhaupt bearbeitet habe. Der Wurgl sagt, sein APPERBot würde die kursiv-Tags regelmäßig versuchen von URLs zu trennen, er würde aber leider immer wieder im Timeout landen. Es wird leider noch hunderte solcher Fälle geben, und es kommen auch immer wieder neue Einträge hinzu. --Liebe Grüße, Lómelinde Diskussion 15:30, 22. Jul. 2022 (CEST)Beantworten

Schimpfen oder Jammern (oder war es nur Nörgeln?) hat geholfen. die letzten Tage hat der Bot nix mehr gefunden. --Wurgl (Diskussion) 15:39, 22. Jul. 2022 (CEST)Beantworten

Die obige Cirrus-Suche findet aber noch ein paar Dutzend. --Leyo 15:50, 22. Jul. 2022 (CEST)Beantworten

Ja und auch, die Suchmuster, die PC auf der Hilfeseite notiert hatte, finden noch immer welche. Wie ich schon sagte, es könnten noch hunderte sein. Aber ich bin an einer anderen Baustelle. Fehlende Tags sind die, die am wenigsten Spaß machen. Daher ist das auch noch die größte Baustelle im Linterfehlerbereich. Rund 70.000 Fehler. --Liebe Grüße, Lómelinde Diskussion 16:03, 22. Jul. 2022 (CEST)Beantworten

Geht morgen wieder. Ich hab da was erweitern müssen und … naja. --Wurgl (Diskussion) 16:13, 22. Jul. 2022 (CEST)Beantworten

erledigtErledigt The URL string are ending on char Apostrophe ('). (@Lómelinde, Wurgl: Ping.)✍️ Dušan Kreheľ (Diskussion) 17:27, 22. Jul. 2022 (CEST)Beantworten

Thanks a lot. --Liebe Grüße, Lómelinde Diskussion 17:30, 22. Jul. 2022 (CEST)Beantworten

@Dušan Kreheľ: FYI: InternetArchiveBot had (or still has) the same problem with the apostroph, so you are not alone ;^) --Wurgl (Diskussion) 19:57, 22. Jul. 2022 (CEST)Beantworten

yes, this is not nice. imho the mediawiki software should rather fail on this instead of trying to parse it. however. we should eliminate the remaining syntax errors. i fixed a few now. -- seth 23:23, 23. Jul. 2022 (CEST)Beantworten

ich hab was gefunden, wo der aktuelle mediawiki-quirks-mode fehlzuschlagen scheint (auch wenn es im konkreten fall wurscht ist, weil der link eh tot ist): in https://de.wikipedia.org/w/index.php?title=Geschichte_des_Ruhrgebiets&oldid=223679815#cite_note-LandtagIntern-227 lautet der link:

LIN02912++)+and+((HNR%20ph%20like%203)%20and%20(JAHR%20=%2038))')&order=native('ID(1)%2FDescend+')&view=detail LANDTAG INTERN 3/2007

und wird aktuell so dargestellt:

LIN02912++)+and+((HNR%20ph%20like%203)%20and%20(JAHR%20=%2038))')&order=native('ID(1)%2FDescend+')&view=detail LANDTAG INTERN 3/2007

statt

LANDTAG INTERN 3/2007

-- seth 23:58, 23. Jul. 2022 (CEST)Beantworten

Issues with finding archived versions in some cases

Letzter Kommentar: vor 2 Jahren18 Kommentare5 Personen sind an der Diskussion beteiligt

(Wir sind hier zwar in der deutschen Wikipedia, aber dem Botbetreiber zuliebe schreibe ich auf Englisch): Hi Dušan, it seems that in some cases, the bot makes it a bit difficult to find an archived version for a weblink that is offline, if the Internet Archive has only archived the URL including the tracking parameter (or session ID). For example, in Anton Rotzetter, there was this link in a reference:

http://www.kapuziner.ch/index.php?PHPSESSID=0ppcanms5hkm550qos100cehv1&na=11,0,0,0,d,,4939

This was a list of the members of the Capuchin convent in Fribourg, Switzerland, as of 2013 (as a source for the last convent where Anton Rotzetter lived), but now redirects to the Swiss Capuchins main page, so we need the archived version from 2013. Your bot removed the session ID and changed the URL to

http://www.kapuziner.ch/index.php?na=11,0,0,0,d,,4939

The Internet Archive hasn't archived anything under that URL. Only with the full URL originally used, including the session ID, we find the archived version we need:

https://web.archive.org/web/20131017102548/http://www.kapuziner.ch/index.php?PHPSESSID=0ppcanms5hkm550qos100cehv1&na=11,0,0,0,d,,4939

But I suppose it would be too complicated to find a way to somehow include this aspect into the activity of your bot? Gestumblindi 12:09, 22. Jul. 2022 (CEST)Beantworten

@Gestumblindi: Am besten is die 2 Tips des URL zu haben: wie URL und wie archive-url.

archive.org can handle some parameters.

More read this. --Dušan Kreheľ (Diskussion) 16:34, 22. Jul. 2022 (CEST)Beantworten

I'm not sure that this answer is helpful. Of course, it might be good to have the original URL and an archive URL as well, but this is about existing, older references where only a certain non-archived URL is in the reference, and about your bot making it harder to find the archived version in some cases, such as the described one. Gestumblindi 20:39, 22. Jul. 2022 (CEST)Beantworten

@Gestumblindi: My solution is: I ignore the past. Because if I wanted to respect it constantly, the world would only get more complicated in the future. Change is change, sometimes you have to give some tax. Today is like "age zero". We simply cannot predict some changes. That is, if the link is not functional, then maybe it would be good to have a template that is a dead link. My bot doesn't change archive links. --Dušan Kreheľ (Diskussion) 20:55, 22. Jul. 2022 (CEST)Beantworten

Well, personally I can't really share an "ignore the past" approach when it comes to the massive existing article (and weblink) base of Wikipedia. Every edit should improve things, not make them worse, this also applies to bot edits, I'm sure you agree - but if a bot edit changes a non-working link where an archive version can be found quite easily to a non-working link where you have to dig out the previous version to find an archive version, this is not an improvement, obviously. And it's a simple fact that we have, unfortunately, still many dead links where we have to find an archive version. In the specific case, I have now inserted the archive link. Gestumblindi 21:35, 22. Jul. 2022 (CEST)Beantworten

@Gestumblindi: "Every edit should improve things, not make them worse, this also applies to bot edits" Ja, sie haben recht.

Short: New context on Wikipedia would be as: Cleaning the URLs and then to archive.

Why cleaning the URLS?

Yes
- None personal data (as URL parameters).
- The cleaning URL to share into the world.
- The syntax bot changes would be only the syntax bot changes. None archive task. None real link analyses.
- Archive bots should not create an unofficial dictatorship (at URLs) without officially announced rules and consensus, if it now functions more like a supplement to the existing wikipedies (as a supplement, the whole can also function without it).
No
- One must search the archive page URL in the Wikipedia history.
  - Why to ignore?
    - The dead link will be only the dead link.
    - Be sure? The user manual cleaning URL and archive.
    - Another bots do the cleaning URL.
    - The primary purpose of Wikipedia is to share information, not to archive it. That is cleaning URLs have a higher priority than archiving. Archiving should not block the sharing of information.
    - The current link may not be correct. Someone could change history. Certainty - manually verify whether it is for the source (the author of the link, who added it for the first time in history). Even the current version may not be correct. --Dušan Kreheľ (Diskussion) 10:51, 23. Jul. 2022 (CEST), ✍️ Dušan Kreheľ (Diskussion) 11:43, 23. Jul. 2022 (CEST)Beantworten

@Gestumblindi and others: Was fuer Anworte is right? Vieleicht so eine: In Wikimedia Movement are active the 2 bots on enwiki and 1 bot (without me) on dewiki with the similarity-identical bot task. So, this question could be resulted, or to ask the another bot admins or the community to change the actual (global) community way. --Dušan Kreheľ (Diskussion) 21:25, 22. Jul. 2022 (CEST)Beantworten

@Cyberpower678, Harej: Please, You read this discussion. ✍️ Dušan Kreheľ (Diskussion) 11:02, 23. Jul. 2022 (CEST)Beantworten

I share Gestumblindi's concern. If your bot effectively makes it impossible for my bot to fix certain links, then it's being disruptive. I see no reason to be cleaning up dead URLs. If it's still alive, great, clean it up, and send a request to the Wayback Machine to archive it. If you can't accurately determine if the URL is alive, then leave it alone. You can't re-archive a dead URL. --—CYBERPOWER (Diskussion) 11:32, 23. Jul. 2022 (CEST)Beantworten

@Cyberpower678: If your bot effectively makes it impossible for my bot to fix certain links, then it's being disruptive. Fuer Your bot, no, nut there is certainly a decrease in comfort for the user. --Dušan Kreheľ (Diskussion) 11:51, 23. Jul. 2022 (CEST)Beantworten

"Fuer Your bot, no"? Cyberpower's bot relies on searching in the Internet Archive for the link currently found in the article, too, not a previous version in the history. Gestumblindi 13:12, 23. Jul. 2022 (CEST)Beantworten

Your response makes no sense to me. There is no point cleaning up dead links. Their archive versions won't even track since the trackers don't work on snapshot pages. By cleaning up a dead link, you make it impossible for InternetArchiveBot to find a working archive for it. Such behavior goes against the universal rule of making sources as accessible as possible. --—CYBERPOWER (Diskussion) 16:01, 25. Jul. 2022 (CEST)Beantworten

@Cyberpower678 and anothers: Not every URL has a link in the archive. So changing the URL does not always equate to loss in the archive.

If archive.org does not have exactly such a URI in its history, it can offer a similar one (i.e. also with other parameters).

What should an archive bot do? Archive the current URI (or access it) or find the oldest archived version of the URI content? If we want the oldest archived version, then the idea is a mistake: the 0 changes from my bot is not the 0 changed URL in the site's history. --Dušan Kreheľ (Diskussion) 16:58, 25. Jul. 2022 (CEST)Beantworten

gudn tach!

i'm not sure, whether we talk about the same thing. imagine the following scenario:

last year:
- https://example.org?PHPSESSION=123 leads to the same page as https://example.org,
- both urls return 200
- somebody archived https://example.org?PHPSESSION=123 at archive.org, such that it is reachable via https://web.archive.org/web/20210706120000/https://example.org?PHPSESSION=123 from then on.
- https://example.org is not archived.
now:
- https://example.org?PHPSESSION=123 and https://example.org, both return 404
- https://web.archive.org/web/20210706120000/https://example.org?PHPSESSION=123 still leads to the old content
- https://web.archive.org/web/*/https://example.org returns 404, because the url without php session id was not archived.

this is the scenario i mentioned at [1] and what was explained by Gestumblindi in this thread (and what was repeated by Cyberpower678 later in this thread).

is this scenario clear upto this point?

in this case the deletion of the php session id would break a working link. that's why you should not touch those archived urls (although there is an old tracking param inside).

such tracking params in archived urls may only be deleted, if you are sure that the resulting pages are the same. do you have a mechanism to ensure this? -- seth 01:03, 26. Jul. 2022 (CEST)Beantworten

Why this discussion? I thought that this bot has been deactivated and cannot be used anymore. -- Ulanwp (Diskussion) 09:22, 26. Jul. 2022 (CEST)Beantworten

we are talking about concerns that are actually independent of this special bot. furthermore the bot is deactivated for the present, but the problem might occur any time again. -- seth 10:33, 26. Jul. 2022 (CEST)Beantworten

@Lustiger seth: do you have a mechanism to ensure this? No. And Your bot? --Dušan Kreheľ (Diskussion) 22:00, 26. Jul. 2022 (CEST)Beantworten

i just skip archived urls. [2] -- seth 00:50, 27. Jul. 2022 (CEST)Beantworten

Der Bot zerschießt Links von Webseiten des New Zealand Herald

Letzter Kommentar: vor 2 Jahren12 Kommentare4 Personen sind an der Diskussion beteiligt

Der Bot muss sofort gestoppt werden, da er Links auf die Webseiten vom New Zealand Herald zerschießt: Original: http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=10810672 - zerschossen: http://www.nzherald.co.nz/nz/news/article.cfm?objectid=10810672. Gruß -- Ulanwp (Diskussion) 13:02, 22. Jul. 2022 (CEST)Beantworten

Das sind die zerstörten Links: [3]. -- Gruß, aka 13:35, 22. Jul. 2022 (CEST)Beantworten

Danke für die Info, aber das sind nicht alle, habe auch schon knapp 50 repariert. Der Bot darf so nicht wieder starten !!! Gruß -- Ulanwp (Diskussion) 14:06, 22. Jul. 2022 (CEST)Beantworten

Auch sowas hier: [4]. Da darf man wohl nun alle Artikeländerungen von heute durchgehen und einzeln checken. Oder halt generell alles von heute zurücksetzen und dann schauen wie es weiter geht.--Maphry (Diskussion) 14:24, 22. Jul. 2022 (CEST)Beantworten

Vielleicht sind auch hier noch ein paar dabei: [5]. -- Gruß, aka 14:34, 22. Jul. 2022 (CEST)Beantworten

Thanks for the report. I repariere Seiten und mein Bot. (@Aka, Ulanwp, Maphry: Ping.) ✍️ Dušan Kreheľ (Diskussion) 15:00, 22. Jul. 2022 (CEST)Beantworten

@Dušan Kreheľ: Please stop developing bot-scripts. If you can't make sure that the bot doesn't produce errors, please stop it. Many Links now are damaged, not only these from New Zealand Herald, New York Times is also a victim of your script. -- Ulanwp (Diskussion) 15:11, 22. Jul. 2022 (CEST)Beantworten

@Ulanwp: New York Times: Haben Sie ein Beispiell? --Dušan Kreheľ (Diskussion) 15:16, 22. Jul. 2022 (CEST)Beantworten

Christopher Small ... and there might be more ... -- Ulanwp (Diskussion) 12:04, 23. Jul. 2022 (CEST)Beantworten

@Ulanwp: The change of URL is okey. The content on URL is everytime actived, with or without my URL change. --Dušan Kreheľ (Diskussion) 12:12, 23. Jul. 2022 (CEST)Beantworten

OK that's fine; yesterday it produced a 404 Error. But with your changes you are producing a lot of errors and works if you are not checking of a valid URL-Path. That's compulsory before changes. -- Ulanwp (Diskussion) 12:33, 23. Jul. 2022 (CEST)Beantworten

erledigtErledigt – The domain nzherald.co.nz is on the blacklist. ✍️ Dušan Kreheľ (Diskussion) 16:23, 22. Jul. 2022 (CEST)Beantworten

General proposals

Letzter Kommentar: vor 2 Jahren4 Kommentare4 Personen sind an der Diskussion beteiligt

Hi, let me share some proposals for this task:

check wheter the new (truncated) link returns http 404 or equal, and if so, keep it as it was (to avoid insertion of broken links)
also check if the old (untruncated) link returns http 404 or equal, and if so, keep it as it was (to adress the issue with the web archive brought up by Gestumblindi)
there should also be a domain blacklist, because simple parameter names like c_id could be used also in other contexts than tracking.
rule of thumb for edit rate in dewiki is 5 edits/min, so slow down a bit.

Regards, -- hgzh 15:56, 22. Jul. 2022 (CEST)Beantworten

@Hgzh: Meine Antwort.

Erste, zweite read You this.
Domain blacklistfor the parameter, ok.
5 changes per minute is for unflaged testing bot. For the normal bot, on metawiki is one change page per 5 seconds. --Dušan Kreheľ (Diskussion) 16:29, 22. Jul. 2022 (CEST)Beantworten

I support the proposals and cannot follow the objection. The bot obviously causes some problems, and/so there is no reason to run this with high speed. (Already suggested in Special:Diff/224412796.) --Krd 17:34, 22. Jul. 2022 (CEST)Beantworten

i fully agree with Krd (and don't understand the objections against the http status checks, too).

CamelBot for example still uses the 5edits/min rate.[6] and normally it checks the http status code before changing links. -- seth 23:18, 23. Jul. 2022 (CEST)Beantworten

Abschnitt hinzufügen