# Processing HTML Files

We will be using **docling**

References
- [docling](https://github.com/DS4SD/docling)

## Step-1: Data

We will process data that is downloaded using [1_crawl_site.ipynb](1_crawl_site.ipynb).

We have a couple of crawled HTML files in `input` directory. 

## Step-2: Configuration

In [1]:
## All config is defined here
from my_config import MY_CONFIG

In [2]:
import os, sys
import shutil

shutil.rmtree(MY_CONFIG.PROCESSED_DATA_DIR, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.PROCESSED_DATA_DIR, exist_ok=True)
print (f"✅ Cleared processed data directory : {MY_CONFIG.PROCESSED_DATA_DIR}")

✅ Cleared processed data directory : workspace/processed


## Step-3: Convet FILES --> MD

Process HTML documents and extract the text in markdown format

In [None]:
%%time 

import os
import sys
from pathlib import Path
from docling.document_converter import DocumentConverter

converter = DocumentConverter(format_options={"preserve_links": True})

input_path = Path(MY_CONFIG.CRAWL_DIR)
input_files = list(input_path.glob('*.html')) + list(input_path.glob('*.htm')) + list(input_path.glob('*.pdf'))
print (f"Found {len(input_files)} files to convert")

files_processed = 0
errors = 0
for input_file in input_files:
 try:
 result = converter.convert(input_file)
 markdown_content = result.document.export_to_markdown()
 
 md_file_name = os.path.join(MY_CONFIG.PROCESSED_DATA_DIR, f"{input_file.stem}.md")
 with open(md_file_name, "w", encoding="utf-8") as md_file:
 md_file.write(markdown_content)
 
 print (f"Converted '{input_file}' --> '{md_file_name}'")
 files_processed += 1
 except Exception as e:
 errors += 1
 print (f"Error processing {input_file}: {e}")

print (f"✅ Processed {files_processed} files. Errors: {errors}")