ArzymKoteyko

joined 1 week ago

Finally got my hands on original DS9 OPT file and I have started downloading files from it. Don't know how long it will take. Also made a git with stats and index files from doj website and opt from archive: https://github.com/ArzymKoteyko/JEDatasets In short the only difference is that I got additional 1753 links to video files and a strange .docx file with size of 0 bytes [EFTA00335487.docx].

[โ€“] ArzymKoteyko@lemmy.world 4 points 1 week ago (3 children)

Hi every one, maybe I'm a bit late to this, but I wanted to share my findings. I parsed every page up to 40k in DS9 3 times and results matched by distribution with PeoplesElbow findings (no content after page 14k and a lot of dublications) BUT I parsed 4 times more unique urls 246_079 (still 2x short of official size). And a strange thing is that on second pass (one day after the first one) I started receiving new urls on old pages.

Here is stat by file type:

 count  | file type 
--------+------
      1 | ts
      8 | mov
    236 | mp4
 244326 | pdf
     73 | m4a
      1 | vob
      1 | docx
      1 | doc
      9 | m4v
   1422 | avi
      1 | wmv