Page MenuHomePhabricator

CCC: Identify processes to increase speed and files checked per minute
Open, Needs TriagePublic

Description

Currently running at a rate of 800 files per half hour, which will require 4.2 years to check all files.

Based on these numbers, it seems appropriate to identify and implement processes to increase speed to reduce the duration needed to check all files.

Event Timeline

TheSandDoctor changed the visibility from "All Users" to "Public (No Login Required)".Feb 9 2020, 10:28 PM

It's hard to know exactly what is slowing things down without a profile. @TheSandDoctor, could you run corrupt.py through CProfile? I'd expect the largest source of delay to come from three places:

  1. Downloading the file (network)
  2. Checking the file (computation)
  3. Communicating with the wiki for edits and lists (network)

I don't know much about pillow, so I can't speak to how fast it is or how to speed it up. There are two ways to speed up network processes: bypassing it and parallelization. You could use a database connection to get page data instead of relying on PWB and going out over the network, which could speed things up a tiny bit. That effect, unfortunately, isn't super significant in the face of trying to download the often-large Commons files.

You could try checking a smaller thumbnail of the image for corruption instead of the full size, which would download faster.

Running a second process is probably the best way to speed things up overall.