A leading newspaper, XYZ, is all set to make their previous public domain articles (from 1920-1980) freely available on their website. Since they have stored their older articles as scanned TIFF images, so they need a scalable image processing conversion system that can combine different pieces of each article together into a single file in the desired PDF format. Since these articles were behind a paid wall earlier, so didn't receive much traffic. Now, they decide to use a real-time approach to scale, glue and convert the TIFF images. Sooner they realized that this approach will work well enough for a low volume of requests, but will not scale to handle the significant traffic increase that is expected from the articles' free availability. So they decide to pre-generate all the articles as PDF files and serve them like any other static content. XYZ already has the code to convert the TIFF images to PDF files. It looks like a simple matter of batch processing all the articles in one setting instead of dealing with each individual article as a request came in. The challenging part of this solution has come when they realize that the archive has 1100 million articles consisting of 400 TB of data.

You have been hired as a consultant to solve this problem. Suggest the best suitable solution to the problem. Also, explain the technological architecture of the solution.

pur-new-sol

Related Questions