This is something that keeps me worried at night. Unlike other historical artefacts like pottery, vellum writing, or stone tablets, information on the Internet can just blink into nonexistence when the server hosting it goes offline. This makes it difficult for future anthropologists who want to study our history and document the different Internet epochs. For my part, I always try to send any news article I see to an archival site (like archive.ph) to help collectively preserve our present so it can still be seen by others in the future.

  • thejml@lemm.ee
    link
    fedilink
    English
    arrow-up
    17
    ·
    1 year ago

    It’s important here to think about a few large issues with this data.

    First Data Storage. Other people in here are talking about decentralizing and creating fully redundant arrays so multiple copies are always online and can be easily migrated from one storage tech to the next. There’s a lot of work here not just in getting all the data, but making sure it continues to move forward as we develop new technologies and new storage techniques. This won’t be a cheap endeavor, but it’s one we should try to keep up with. Hard drives die, bit rot happens. Even off, a spinning drive will fail, as will an SSD with time. CD’s I’ve written 15+ years ago aren’t 100% readable.

    Second, there’s data organization. How can you find what you want later when all you have are images of systems, backups of databases, static flat files of websites? A lot of sites now require JavaScript and other browser operations to be able to view/use the site. You’ll just have a flat file with a bunch of rendered HTML, can you really still find the one you want? Search boxes wont work, API calls will fail without the real site up and running. Databases have to be restored to be queried and if they’re relational, who will know how to connect those dots?

    Third, formats. Sort of like the previous, but what happens when JPG is deprecated in favor of something better? Can you currently open up that file you wrote in 1985? Will there still be a program available to decode it? We’ll have to back those up as well… along with the OSes that they run on. And if there’s no processors left that can run on, we’ll need emulators. Obviously standards are great here, we may not forget how to read a PCX or GIF or JPG file for a while, but more niche things will definitely fall by the wayside.

    Fourth, Timescale. Can we keep this stuff for 50 yrs? 100 yrs? 1000 yrs? What happens when our great*30-grand-children want to find this info. We regularly find things from a few thousand years ago here on earth with archeological digsites and such. There’s a difference between backing something up for use in a few months, and for use in a few years, what about a few hundred or thousand? Data storage will be vastly different, as will processors and displays and such. … Or what happens in a Horizon Zero Dawn scenario where all the secrets are locked up in a vault of technology left to rot that no one knows how to use because we’ve nuked ourselves into regression.

    • digitallyfree@kbin.social
      link
      fedilink
      arrow-up
      4
      ·
      edit-2
      1 year ago

      I guess I can talk a bit about the first and third points for my personal archiving (certainly not on a global scale).

      • For data storage data should be regularly be checked for bitrot and corruption, preferably with a file system that can heal itself if such a situation occurs. Personally I use ZFS RAIDZ with regular scrubs to sure that my data is bitperfect. Disks that regularly show issues are trashed, even if they appear to run fine and show good SMART status. For optical disks in a safe or something I reburn them every ten years or so even if they’re still readable to keep the medium fresh.

      • I’ve actually known someone who had to painfully setup a Windows 95 computer in order to convert some old digital pictures from a equally old digital camera stored in a prop format. Obviously that’s a no go. For my archives I try to use standard open formats like PNG, PDF, etc. that won’t go away for a long time and can be reconverted as part of an archive update if the format starts to become obsolete. You can’t just digitally archive everything and expect it to be easily readable after a hundred years. I don’t do this but if space is limitless lossless format could be used (PNG for photos, FLAC for audio, etc.) so any conversions remain true to the original capture.

        • digitallyfree@kbin.social
          link
          fedilink
          arrow-up
          1
          ·
          edit-2
          1 year ago

          TIFF is a classic storage format, but PNG is common for web images and isn’t going away either. DNG is for RAW sensor output from professional cameras and is not used for edited and published images. However if you’re archiving your photo collection or something than keep the DNGs!