Spaces:
Runtime error
Runtime error
File size: 3,640 Bytes
a7d2416 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Processing HTML Files\n",
"\n",
"We will be using **docling**\n",
"\n",
"References\n",
"- [docling](https://github.com/DS4SD/docling)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step-1: Data\n",
"\n",
"We will process data that is downloaded using [1_crawl_site.ipynb](1_crawl_site.ipynb).\n",
"\n",
"We have a couple of crawled HTML files in `input` directory. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step-2: Configuration"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"## All config is defined here\n",
"from my_config import MY_CONFIG"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"✅ Cleared processed data directory : workspace/processed\n"
]
}
],
"source": [
"import os, sys\n",
"import shutil\n",
"\n",
"shutil.rmtree(MY_CONFIG.PROCESSED_DATA_DIR, ignore_errors=True)\n",
"shutil.os.makedirs(MY_CONFIG.PROCESSED_DATA_DIR, exist_ok=True)\n",
"print (f\"✅ Cleared processed data directory : {MY_CONFIG.PROCESSED_DATA_DIR}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step-3: Convet FILES --> MD\n",
"\n",
"Process HTML documents and extract the text in markdown format"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time \n",
"\n",
"import os\n",
"import sys\n",
"from pathlib import Path\n",
"from docling.document_converter import DocumentConverter\n",
"\n",
"converter = DocumentConverter(format_options={\"preserve_links\": True})\n",
"\n",
"input_path = Path(MY_CONFIG.CRAWL_DIR)\n",
"input_files = list(input_path.glob('*.html')) + list(input_path.glob('*.htm')) + list(input_path.glob('*.pdf'))\n",
"print (f\"Found {len(input_files)} files to convert\")\n",
"\n",
"files_processed = 0\n",
"errors = 0\n",
"for input_file in input_files:\n",
" try:\n",
" result = converter.convert(input_file)\n",
" markdown_content = result.document.export_to_markdown()\n",
" \n",
" md_file_name = os.path.join(MY_CONFIG.PROCESSED_DATA_DIR, f\"{input_file.stem}.md\")\n",
" with open(md_file_name, \"w\", encoding=\"utf-8\") as md_file:\n",
" md_file.write(markdown_content)\n",
" \n",
" print (f\"Converted '{input_file}' --> '{md_file_name}'\")\n",
" files_processed += 1\n",
" except Exception as e:\n",
" errors += 1\n",
" print (f\"Error processing {input_file}: {e}\")\n",
"\n",
"print (f\"✅ Processed {files_processed} files. Errors: {errors}\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "allycat-1",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
|