Open In Colab

How to extract?

1.

!pip install PyMuPDF
!pip install pillow
!pip install pymupdf-fonts 
!apt-get install poppler-utils
!pip install pdf2image
!pip install easyocr
Requirement already satisfied: PyMuPDF in /usr/local/lib/python3.7/dist-packages (1.18.17)
Requirement already satisfied: pillow in /usr/local/lib/python3.7/dist-packages (7.1.2)
Requirement already satisfied: pymupdf-fonts in /usr/local/lib/python3.7/dist-packages (1.0.3)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
poppler-utils is already the newest version (0.62.0-2ubuntu2.12).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 40 not upgraded.
Collecting pdf2image
  Downloading pdf2image-1.16.0-py3-none-any.whl (10 kB)
Requirement already satisfied: pillow in /usr/local/lib/python3.7/dist-packages (from pdf2image) (7.1.2)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.16.0
Collecting easyocr
  Downloading easyocr-1.4-py3-none-any.whl (63.6 MB)
     |████████████████████████████████| 63.6 MB 8.6 kB/s 
Requirement already satisfied: torch in /usr/local/lib/python3.7/dist-packages (from easyocr) (1.9.0+cu102)
Collecting python-bidi
  Downloading python_bidi-0.4.2-py2.py3-none-any.whl (30 kB)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from easyocr) (3.13)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from easyocr) (1.19.5)
Requirement already satisfied: scikit-image in /usr/local/lib/python3.7/dist-packages (from easyocr) (0.16.2)
Requirement already satisfied: torchvision>=0.5 in /usr/local/lib/python3.7/dist-packages (from easyocr) (0.10.0+cu102)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from easyocr) (1.4.1)
Requirement already satisfied: Pillow in /usr/local/lib/python3.7/dist-packages (from easyocr) (7.1.2)
Requirement already satisfied: opencv-python in /usr/local/lib/python3.7/dist-packages (from easyocr) (4.1.2.30)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch->easyocr) (3.7.4.3)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from python-bidi->easyocr) (1.15.0)
Requirement already satisfied: PyWavelets>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image->easyocr) (1.1.1)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image->easyocr) (3.2.2)
Requirement already satisfied: networkx>=2.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image->easyocr) (2.6.2)
Requirement already satisfied: imageio>=2.3.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image->easyocr) (2.4.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image->easyocr) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image->easyocr) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image->easyocr) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image->easyocr) (2.4.7)
Installing collected packages: python-bidi, easyocr
Successfully installed easyocr-1.4 python-bidi-0.4.2
from pdf2image import convert_from_path
import easyocr
import numpy as np
import PIL
from PIL import ImageDraw
import spacy
reader=easyocr.Reader(['en'])
CUDA not available - defaulting to CPU. Note: This module is much faster with a GPU.
Downloading detection model, please wait. This may take several minutes depending upon your network connection.

Downloading recognition model, please wait. This may take several minutes depending upon your network connection.

images = convert_from_path('attention.pdf')
from IPython.display import display, Image 
display(images[0])
bounds = reader.readtext(np.array(images[0]),min_size=0, slope_ths=0.2, ycenter_ths=0.7, height_ths=0.6, width_ths=0.8)
bounds
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
[([[586, 274], [1116, 274], [1116, 322], [586, 322]],
  'Attention Is All You Need',
  0.5345810919336591),
 ([[368, 514], [558, 514], [558, 544], [368, 544]],
  'Ashish Vaswani',
  0.9173051771683187),
 ([[662, 514], [838, 514], [838, 544], [662, 544]],
  'Noam Shazeer',
  0.8793502329607361),
 ([[938, 516], [1094, 516], [1094, 542], [938, 542]],
  'Niki Parmar',
  0.935491384408255),
 ([[1176, 513], [1383, 513], [1383, 546], [1176, 546]],
  'Jakob Uszkoreit*',
  0.6710093745918249),
 ([[383, 545], [541, 545], [541, 581], [383, 581]],
  'Google Brain',
  0.9998002881679157),
 ([[672, 546], [828, 546], [828, 578], [672, 578]],
  'Google Brain',
  0.9990559995329158),
 ([[917, 545], [1113, 545], [1113, 581], [917, 581]],
  'Google Research',
  0.7325136297360176),
 ([[1177, 545], [1373, 545], [1373, 581], [1177, 581]],
  'Google Research',
  0.9187121523133165),
 ([[322, 578], [602, 578], [602, 608], [322, 608]],
  'avaswani@google . com',
  0.7111753136116307),
 ([[638, 578], [862, 578], [862, 608], [638, 608]],
  'noam@google. com',
  0.6576573857256617),
 ([[896, 577], [1134, 577], [1134, 607], [896, 607]],
  'nikipegoogle . com',
  0.9172951907534743),
 ([[1168, 576], [1378, 576], [1378, 609], [1168, 609]],
  'usz@google . com',
  0.7882798825568623),
 ([[42, 596], [98, 596], [98, 716], [42, 716]], '3', 0.9783897523011689),
 ([[398, 651], [550, 651], [550, 682], [398, 682]],
  'Llion Jones *',
  0.81909539602757),
 ([[689, 650], [903, 650], [903, 686], [689, 686]],
  'Aidan N. Gomez*',
  0.9102146632803905),
 ([[1090, 652], [1282, 652], [1282, 682], [1090, 682]],
  'Eukasz Kaiser*',
  0.8289530821006998),
 ([[368, 682], [567, 682], [567, 720], [368, 720]],
  'Google Research',
  0.999920335219019),
 ([[676, 681], [923, 681], [923, 719], [676, 719]],
  'University of Toronto',
  0.9788376275771502),
 ([[1100, 682], [1259, 682], [1259, 720], [1100, 720]],
  'Google Brain',
  0.999715231406975),
 ([[350, 716], [588, 716], [588, 746], [350, 746]],
  'llion@google . com',
  0.6921915064430954),
 ([[652, 716], [948, 716], [948, 742], [652, 742]],
  'aidan@cs. toronto. edu',
  0.5912333463365905),
 ([[1010, 716], [1350, 716], [1350, 748], [1010, 748]],
  'lukaszkaiser@google . com',
  0.6630610129099045),
 ([[44, 716], [98, 716], [98, 820], [44, 820]], '8', 0.706954842287022),
 ([[744, 792], [948, 792], [948, 822], [744, 822]],
  'Illia Polosukhin *',
  0.8111600782126449),
 ([[48, 826], [94, 826], [94, 858], [48, 858]], 'CO', 0.1609912122487263),
 ([[656, 821], [1043, 821], [1043, 857], [656, 857]],
  'illia polosukhin@gmail . com',
  0.8083479656374708),
 ([[43, 876], [97, 876], [97, 1056], [43, 1056]], '8', 0.4089501389494252),
 ([[398, 1004], [818, 1004], [818, 1034], [398, 1034]],
  'The dominant sequence transduction',
  0.5307300207114422),
 ([[788, 934], [916, 934], [916, 966], [788, 966]],
  'Abstract',
  0.7721555702071696),
 ([[820, 1004], [1304, 1004], [1304, 1034], [820, 1034]],
  'models are based on complex recurrent or',
  0.7767568230904064),
 ([[396, 1034], [1304, 1034], [1304, 1064], [396, 1064]],
  'convolutional neural networks that include an encoder and a decoder:  The best',
  0.734635562741831),
 ([[395, 1063], [1303, 1063], [1303, 1099], [395, 1099]],
  'performing models also connect the encoder and decoder through an attention',
  0.5749336075523187),
 ([[395, 1091], [1307, 1091], [1307, 1129], [395, 1129]],
  'mechanism.   We propose a new simple network architecture, the Transformer;',
  0.7621255986921258),
 ([[395, 1121], [1305, 1121], [1305, 1161], [395, 1161]],
  'based solely on attention mechanisms, dispensing with recurrence and convolutions',
  0.7850719094304044),
 ([[42, 1074], [98, 1074], [98, 1280], [42, 1280]], '3', 0.6249629781220314),
 ([[396, 1156], [1302, 1156], [1302, 1188], [396, 1188]],
  'entirely.   Experiments on two machine translation tasks show these models to',
  0.6730372265230755),
 ([[395, 1180], [1307, 1180], [1307, 1222], [395, 1222]],
  'be superior in quality while being more parallelizable and requiring significantly',
  0.968514839564006),
 ([[395, 1213], [1311, 1213], [1311, 1249], [395, 1249]],
  'less time to train:  Our model achieves 28.4 BLEU 0n the WMT 2014 English-',
  0.6867570388820313),
 ([[395, 1241], [1306, 1241], [1306, 1281], [395, 1281]],
  'to-German translation task, improving over the existing best results, including',
  0.8110534563140633),
 ([[394, 1273], [1307, 1273], [1307, 1311], [394, 1311]],
  'ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,',
  0.690006493236371),
 ([[41, 1269], [101, 1269], [101, 1377], [41, 1377]], 'g', 0.883222628744651),
 ([[395, 1305], [1305, 1305], [1305, 1341], [395, 1341]],
  'our model establishes a new single-model state-of-the-art BLEU score of 41.8 after',
  0.623964909430979),
 ([[395, 1331], [1303, 1331], [1303, 1374], [395, 1374]],
  'training for 3.5 days on eight GPUs, a small fraction of the training costs of the',
  0.6892982594247865),
 ([[395, 1363], [1303, 1363], [1303, 1401], [395, 1401]],
  'best models from the literature. We show that the Transformer generalizes well to',
  0.8877125779373923),
 ([[395, 1393], [1303, 1393], [1303, 1433], [395, 1433]],
  'other tasks by applying it successfully to English constituency parsing both with',
  0.9280293131237901),
 ([[42, 1365], [97, 1365], [97, 1546], [42, 1546]], '1', 0.29850959259656307),
 ([[392, 1422], [739, 1422], [739, 1464], [392, 1464]],
  'large and limited training data:',
  0.7625638553773784),
 ([[345, 1517], [535, 1517], [535, 1553], [345, 1553]],
  'Introduction',
  0.9848649591997218),
 ([[296, 1585], [1175, 1585], [1175, 1624], [296, 1624]],
  'Recurrent neural networks, long short-term memory [13] and gated recurrent',
  0.9075834255515828),
 ([[1216, 1588], [1402, 1588], [1402, 1614], [1216, 1614]],
  'neural networks',
  0.8222652625583747),
 ([[295, 1611], [1403, 1611], [1403, 1653], [295, 1653]],
  'in particular; have been firmly established as state of the art approaches in sequence modeling and',
  0.9473955013352517),
 ([[341, 1665], [1403, 1665], [1403, 1703], [341, 1703]],
  'Equal contribution. Listing order is random: Jakob proposed replacing RNNs with self-attention and started',
  0.6905737200201844),
 ([[296, 1697], [1402, 1697], [1402, 1729], [296, 1729]],
  'the effort to evaluate this idea. Ashish; with Illia, designed and implemented the first Transformer models and',
  0.6892796914034701),
 ([[295, 1720], [1403, 1720], [1403, 1757], [295, 1757]],
  'has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head',
  0.8518118752302728),
 ([[296, 1751], [1403, 1751], [1403, 1783], [296, 1783]],
  'attention and the parameter-free position representation and became the other person involved in nearly every',
  0.8881246746259066),
 ([[296, 1782], [1402, 1782], [1402, 1810], [296, 1810]],
  'detail: Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and',
  0.7163966436950893),
 ([[296, 1806], [1402, 1806], [1402, 1839], [296, 1839]],
  'tensorZtensor: Llion also experimented with novel model variants, was responsible for Our initial codebase, and',
  0.675663232988096),
 ([[295, 1833], [1403, 1833], [1403, 1869], [295, 1869]],
  'efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and',
  0.8460050547443896),
 ([[295, 1859], [1405, 1859], [1405, 1895], [295, 1895]],
  'implementing tensorZtensor; replacing our earlier codebase, greatly improving results and massively accelerating',
  0.7100502049881963),
 ([[296, 1891], [428, 1891], [428, 1919], [296, 1919]],
  'our research:',
  0.8374079262385009),
 ([[344, 1920], [742, 1920], [742, 1952], [344, 1952]],
  'Work performed while at Google Brain:',
  0.8561383716965495),
 ([[331, 1947], [779, 1947], [779, 1983], [331, 1983]],
  'tWork performed while at Google Research:',
  0.7374319451603422),
 ([[295, 2028], [1277, 2028], [1277, 2066], [295, 2066]],
  '3lst Conference on Neural Information Processing Systems (NIPS 2017), Long Beach; CA, USA',
  0.6544223920198357)]
def draw_boxes(image, bounds, color='yellow', width=2):
  draw =ImageDraw.Draw(image)
  for bound in bounds: 
    p0, p1, p2, p3 = bound[0]
    draw.line([*p0, *p1, *p2, *p3, *p0], fill=color, width=width)
  return image 

draw_boxes(images[0], bounds)