Problem 1 Automatically collect from memphis.edu 10,000
unique documents. The documents should be proper after converting them to txt
(>50 valid tokens after saved as text); only collect .html, .txt, and and .pdf
web files and then convert them to text - make sure you do not keep any of
the presentation tags such as html tags. You may use third party tools to
convert the original files to text. Your output should be a set of 10,000 text
files (not html, txt, or pdf docs) of at least 50 textual tokens each. You must
write your own code to collect the documents - DO NOT use an existing or third party crawlercrawler.
Store for each proper file the original URL as you will need it later
when displaying the results to the user.
Problem 2 Preprocess all the files using assignment #4( "python program that preprocesses a
collection of documents using the recommendations given in the
Text Operations lecture. The input to the program will be a directory
containing a list of (10000 unique documents)text files collected in above program. documents must be converted to text before using them.
Remove the following during the preprocessing:
- stop words (use the generic list available at ...ir-websearch/papers/english.stopwords.txt)
- urls and other html-like strings
- morphological variations).)" This directory should have index terms( inverted
index of a set of already preprocessed files.Use raw term frequency (tf) in the document without normalizing it. Think about saving the generated index, including the document frequency (df), in a file so that you can retrieve it later) .Save all preprocessed documents in a single directory .