Release v1.8.0. (Installation)
Welcome to the documentation for the internetarchive Python library. internetarchive is a command-line and Python interface to archive.org. Please report any issues on Github.
If you’re not sure where to begin, the quickest and easiest way to get started is downloading a binary and taking a look at the command-line interface documentation.
Installing the internetarchive library globally on your system can be done with pip. This is the recommended method for installing internetarchive (see below for details on installing pip):
$ sudo pip install internetarchive
or, with easy_install:
$ sudo easy_install internetarchive
Either of these commands will install the internetarchive Python library and ia command-line tool on your system.
Note: Some versions of Mac OS X come with Python libraries that are required by internetarchive (e.g. the Python package six). This can cause installation issues. If your installation is failing with a message that looks something like:
OSError: [Errno 1] Operation not permitted: '/var/folders/bk/3wx7qs8d0x79tqbmcdmsk1040000gp/T/pip-TGyjVo-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'
You can use the --ignore-installed parameter in pip to ignore the libraries that are already installed, and continue with the rest of the installation:
$ sudo pip install --ignore-installed internetarchive
More details on this issue can be found here: https://github.com/pypa/pip/issues/3165
The easiest way to install pip is probably using your operating systems package manager.
Mac OS, with homebrew:
$ brew install pip
Ubuntu, with apt-get:
$ sudo apt-get install python-pip
If your OS doesn’t have a package manager, you can also install pip with get-pip.py:
$ curl -LOs https://bootstrap.pypa.io/get-pip.py
$ python get-pip.py
If you don’t want to, or can’t, install the package system-wide you can use virtualenv to create an isolated Python environment.
First, make sure virtualenv is installed on your system. If it’s not, you can do so with pip:
$ sudo pip install virtualenv
With easy_install:
$ sudo easy_install virtualenv
Or your systems package manager, apt-get for example:
$ sudo apt-get install python-virtualenv
Once you have virtualenv installed on your system, create a virtualenv:
$ mkdir myproject
$ cd myproject
$ virtualenv venv
New python executable in venv/bin/python
Installing setuptools, pip............done.
Activate your virtualenv:
$ . venv/bin/activate
Install internetarchive into your virtualenv:
$ pip install internetarchive
You can install the latest ia snap, and help testing the most recent changes of the master branch in all the supported Linux distros with:
$ sudo snap install ia --edge
Every time a new version of ia is pushed to the store, you will get it updated automatically.
Binaries are also available for the ia command-line tool:
$ curl -LOs https://archive.org/download/ia-pex/ia
$ chmod +x ia
Binaries are generated with PEX. The only requirement for using the binaries is that you have Python installed on a Unix-like operating system.
For more details on the command-line interface please refer to the README, or ia help.
Internetarchive is actively developed on GitHub.
You can either clone the public repository:
$ git clone git://github.com/jjjake/internetarchive.git
Download the tarball:
$ curl -OL https://github.com/jjjake/internetarchive/tarball/master
Or, download the zipball:
$ curl -OL https://github.com/jjjake/internetarchive/zipball/master
Once you have a copy of the source, you can install it into your site-packages easily:
$ python setup.py install
Certain functionality of the internetarchive Python library requires your archive.org credentials. Your IA-S3 keys are required for uploading, searching, and modifying metadata, and your archive.org logged-in cookies are required for downloading access-restricted content and viewing your task history. To automatically create a config file with your archive.org credentials, you can use the ia command-line tool:
$ ia configure
Enter your archive.org credentials below to configure 'ia'.
Email address: user@example.com
Password:
Config saved to: /home/user/.config/ia.ini
Your config file will be saved to $HOME/.config/ia.ini, or $HOME/.ia if you do not have a .config directory in $HOME. Alternatively, you can specify your own path to save the config to via ia --config-file '~/.ia-custom-config' configure.
If you have a netc file with your archive.org credentials in it, you can simply run ia configure --netrc. Note that Python’s netrc library does not currently support passphrases, or passwords with spaces in them, and therefore not currently suported here.
Creating a new item on archive.org and uploading files to it is as easy as:
>>> from internetarchive import upload
>>> md = dict(collection='test_collection', title='My New Item', mediatype='movies')
>>> r = upload('<identifier>', files=['foo.txt', 'bar.mov'], metadata=md)
>>> r[0].status_code
200
You can set remote filename using a dictionary:
>>> r = upload('<identifier>', files={'remote-name.txt': 'local-name.txt'})
You can upload file-like objects:
>>> r = upload('iacli-test-item301', {'foo.txt': StringIO(u'bar baz boo')})
If the item already has a file with the same filename, the existing file within the item will be overwritten.
upload can also upload directories. For example, the following command will upload my_dir and all of it’s contents to https://archive.org/download/my_item/my_dir/:
>>> r = upload('my_item', 'my_dir')
To upload only the contents of the directory, but not the directory itself, simply append a slash to your directory:
>>> r = upload('my_item', 'my_dir/')
This will upload all of the contents of my_dir to https://archive.org/download/my_item/. upload accepts relative or absolute paths.
Note: metadata can only be added to an item using the upload function on item creation. If an item already exists and you would like to modify it’s metadata, you must use modify_metadata.
You can access all of an item’s metadata via the Item object:
>>> from internetarchive import get_item
>>> item = get_item('iacli-test-item301')
>>> item.item_metadata['metadata']['title']
'My Title'
get_item retrieves all of an item’s metadata via the Internet Archive Metadata API. This metadata can be accessed via the Item.item_metadata attribute:
>>> item.item_metadata.keys()
dict_keys(['created', 'updated', 'd2', 'uniq', 'metadata', 'item_size', 'dir', 'd1', 'files', 'server', 'files_count', 'workable_servers'])
All of the top-level keys in item.item_metadata are available as attributes:
>>> item.server
'ia801507.us.archive.org'
>>> item.item_size
161752024
>>> item.files[0]['name']
'blank.txt'
>>> item.metadata['identifier']
'iacli-test-item301'
Adding new metadata to an item can be done using the modify_metadata function:
>>> from internetarchive import modify_metadata
>>> r = modify_metadata('<identifier>', metadata=dict(title='My Stuff'))
>>> r.status_code
200
Modifying metadata can also be done via the Item object. For example, changing the title we set in the example above can be done like so:
>>> r = item.modify_metadata(dict(title='My New Title'))
>>> item.metadata['title']
'My New Title'
To remove a metadata field from an item’s metadata, set the value to 'REMOVE_TAG':
>>> r = item.modify_metadata(dict(foo='new metadata field.'))
>>> item.metadata['foo']
'new metadata field.'
>>> r = item.modify_metadata(dict(title='REMOVE_TAG'))
>>> print(item.metadata.get('foo'))
None
The default behaviour of modify_metadata is to modify item-level metadata (i.e. title, description, etc.). If we want to modify different kinds of metadata, say the metadata of a specific file, we have to change the metadata target in the call to modify_metadata:
>>> r = item.modify_metadata(dict(title='My File Title'), target='files/foo.txt')
>>> f = item.get_file('foo.txt')
>>> f.title
'My File Title'
Refer to Internet Archive Metadata for more specific details regarding metadata and archive.org.
Downloading files can be done via the download function:
>>> from internetarchive import download
>>> download('nasa', verbose=True)
nasa:
downloaded nasa/globe_west_540.jpg to nasa/globe_west_540.jpg
downloaded nasa/NASAarchiveLogo.jpg to nasa/NASAarchiveLogo.jpg
downloaded nasa/globe_west_540_thumb.jpg to nasa/globe_west_540_thumb.jpg
downloaded nasa/nasa_reviews.xml to nasa/nasa_reviews.xml
downloaded nasa/nasa_meta.xml to nasa/nasa_meta.xml
downloaded nasa/nasa_archive.torrent to nasa/nasa_archive.torrent
downloaded nasa/nasa_files.xml to nasa/nasa_files.xml
By default, the download function sets the mtime for downloaded files to the mtime of the file on archive.org. If we retry downloading the same set of files we downloaded above, no requests will be made. This is because the filename, mtime and size of the local files match the filename, mtime and size of the files on archive.org, so we assume that the file has already been downloaded. For example:
>>> download('nasa', verbose=True)
nasa:
skipping nasa/globe_west_540.jpg, file already exists based on length and date.
skipping nasa/NASAarchiveLogo.jpg, file already exists based on length and date.
skipping nasa/globe_west_540_thumb.jpg, file already exists based on length and date.
skipping nasa/nasa_reviews.xml, file already exists based on length and date.
skipping nasa/nasa_meta.xml, file already exists based on length and date.
skipping nasa/nasa_archive.torrent, file already exists based on length and date.
skipping nasa/nasa_files.xml, file already exists based on length and date.
Alternatively, you can skip files based on md5 checksums. This is will take longer because checksums will need to be calculated for every file already downloaded, but will be safer:
>>> download('nasa', verbose=True, checksum=True)
nasa:
skipping nasa/globe_west_540.jpg, file already exists based on checksum.
skipping nasa/NASAarchiveLogo.jpg, file already exists based on checksum.
skipping nasa/globe_west_540_thumb.jpg, file already exists based on checksum.
skipping nasa/nasa_reviews.xml, file already exists based on checksum.
skipping nasa/nasa_meta.xml, file already exists based on checksum.
skipping nasa/nasa_archive.torrent, file already exists based on checksum.
skipping nasa/nasa_files.xml, file already exists based on length and date.
By default, the download function will download all of the files in an item. However, there are a couple parameters that can be used to download only specific files. Files can be filtered using the glob_pattern parameter:
>>> download('nasa', verbose=True, glob_pattern='*xml')
nasa:
downloaded nasa/nasa_reviews.xml to nasa/nasa_reviews.xml
downloaded nasa/nasa_meta.xml to nasa/nasa_meta.xml
downloaded nasa/nasa_files.xml to nasa/nasa_files.xml
Files can also be filtered using the formats parameter. formats can either be a single format provided as a string:
>>> download('goodytwoshoes00newyiala', verbose=True, formats='MARC')
goodytwoshoes00newyiala:
downloaded goodytwoshoes00newyiala/goodytwoshoes00newyiala_meta.mrc to goodytwoshoes00newyiala/goodytwoshoes00newyiala_meta.mrc
Or, a list of formats:
>>> download('goodytwoshoes00newyiala', verbose=True, formats=['DjVuTXT', 'MARC'])
goodytwoshoes00newyiala:
downloaded goodytwoshoes00newyiala/goodytwoshoes00newyiala_meta.mrc to goodytwoshoes00newyiala/goodytwoshoes00newyiala_meta.mrc
downloaded goodytwoshoes00newyiala/goodytwoshoes00newyiala_djvu.txt to goodytwoshoes00newyiala/goodytwoshoes00newyiala_djvu.txt
Some files on archive.org are generated on-the-fly as requested. This currently includes non-original files of the formats EPUB, MOBI, DAISY, and archive.org’s own MARC XML. These files can be downloaded using the on_the_fly parameter:
>>> download('goodytwoshoes00newyiala', verbose=True, formats='EPUB', on_the_fly=True)
goodytwoshoes00newyiala:
downloaded goodytwoshoes00newyiala/goodytwoshoes00newyiala.epub to goodytwoshoes00newyiala/goodytwoshoes00newyiala.epub
The search_items function can be used to iterate through archive.org search results:
>>> from internetarchive import search_items
>>> for i in search_items('identifier:nasa'):
... print(i['identifier'])
...
nasa
search_items can also yield Item objects:
>>> from internetarchive import search_items
>>> for item in search_items('identifier:nasa').iter_as_items():
... print(item)
...
Collection(identifier='nasa', exists=True)
search_items will automatically paginate through large result sets.
The ia command-line tool is installed with internetarchive, or available as a binary. ia allows you to interact with various archive.org services from the command-line.
The easiest way to start using ia is downloading a binary. The only requirements of the binary are a Unix-like environment with Python installed. To download the latest binary, and make it executable simply:
$ curl -LOs https://archive.org/download/ia-pex/ia
$ chmod +x ia
$ ./ia help
A command line interface to archive.org.
usage:
ia [--help | --version]
ia [--config-file FILE] [--log | --debug] [--insecure] <command> [<args>]...
options:
-h, --help
-v, --version
-c, --config-file FILE Use FILE as config file.
-l, --log Turn on logging [default: False].
-d, --debug Turn on verbose logging [default: False].
-i, --insecure Use HTTP for all requests instead of HTTPS [default: false]
commands:
help Retrieve help for subcommands.
configure Configure `ia`.
metadata Retrieve and modify metadata for items on archive.org.
upload Upload items to archive.org.
download Download files from archive.org.
delete Delete files from archive.org.
search Search archive.org.
tasks Retrieve information about your archive.org catalog tasks.
list List files in a given item.
See 'ia help <command>' for more information on a specific command.
You can use ia to read and write metadata from archive.org. To retrieve all of an item’s metadata in JSON, simply:
$ ia metadata TripDown1905
A particularly useful tool to use alongside ia is jq. jq is a command-line tool for parsing JSON. For example:
$ ia metadata TripDown1905 | jq '.metadata.date'
"1906"
Once ia has been configured, you can modify metadata:
$ ia metadata <identifier> --modify="foo:bar" --modify="baz:foooo"
You can remove a metadata field by setting the value of the given field to REMOVE_TAG. For example, to remove the metadata field foo from the item <identifier>:
$ ia metadata <identifier> --modify="foo:REMOVE_TAG"
Note that some metadata fields (e.g. mediatype) cannot be modified, and must instead be set initially on upload.
The default target to write to is metadata. If you would like to write to another target, such as files, you can specify so using the --target parameter. For example, if we had an item whose identifier was my_identifier and we wanted to add a metadata field to a file within the item called foo.txt:
$ ia metadata my_identifier --target="files/foo.txt" --modify="title:My File"
You can also create new targets if they don’t exist:
$ ia metadata <identifier> --target="extra_metadata" --modify="foo:bar"
There is also an --append option which allows you to append a string to an existing metadata strings (Note: use --append-list for appending elments to a list). For example, if your item’s title was Foo and you wanted it to be Foo Bar, you could simply do:
$ ia metadata <identifier> --append="title: Bar"
If you would like to add a new value to an existing field that is an array (like subject or collection), you can use the --append-list option:
$ ia metadata <identifier> --append-list="subject:another subject"
This command would append another subject to the items list of subjects, if it doesn’t already exist (i.e. no duplicate elements are added).
Metadata fields or elements can be removed with the --remove option:
$ ia metadata <identifier> --remove="subject:another subject"
This would remove another subject from the items subject field, regardless of whether or not the field is a single or multi-value field.
Refer to Internet Archive Metadata for more specific details regarding metadata and archive.org.
If you have a lot of metadata changes to submit, you can use a CSV spreadsheet to submit many changes with a single command. Your CSV must contain an identifier column, with one item per row. Any other column added will be treated as a metadata field to modify. If no value is provided in a given row for a column, no changes will be submitted. If you would like to specify multiple values for certain fields, an index can be provided: subject[0], subject[1]. Your CSV file should be UTF-8 encoded. See metadata.csv for an example CSV file.
Once you’re ready to submit your changes, you can submit them like so:
$ ia metadata --spreadsheet=metadata.csv
See ia help metadata for more details.
ia can also be used to upload items to archive.org. After configuring ia, you can upload files like so:
$ ia upload <identifier> file1 file2 --metadata="mediatype:texts" --metadata="blah:arg"
Please note that, unless specified otherwise, items will be uploaded with a data mediatype. This cannot be changed afterwards. Therefore, you should specify a mediatype when uploading, eg. --metadata="mediatype:movies"
You can upload files from stdin:
$ curl http://dumps.wikimedia.org/kywiki/20130927/kywiki-20130927-pages-logging.xml.gz \
| ia upload <identifier> - --remote-name=kywiki-20130927-pages-logging.xml.gz --metadata="title:Uploaded from stdin."
You can use the --retries parameter to retry on errors (i.e. if IA-S3 is overloaded):
$ ia upload <identifier> file1 --retries 10
Note that ia upload makes a backup of any files that are clobbered. They are saved to a directory in the item named history/files/. The files are named in the format $key.~N~. These files can be deleted like normal files. You can also prevent the backup from happening on clobbers by adding -H x-archive-keep-old-version:0 to your command.
Refer to archive.org Identifiers for more information on creating valid archive.org identifiers. Please also read the Internet Archive Items page before getting started.
Uploading in bulk can be done similarly to Modifying Metadata in Bulk. The only difference is that you must provide a file column which contains a relative or absolute path to your file. Please see uploading.csv for an example.
Once you are ready to start your upload, simply run:
$ ia upload --spreadsheet=uploading.csv
See ia help upload for more details.
Download an entire item:
$ ia download TripDown1905
Download specific files from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4 TripDown1905.ogv
Download specific files matching a glob pattern:
$ ia download TripDown1905 --glob="*.mp4"
Note that you may have to escape the * differently depending on your shell (e.g. \*.mp4, '*.mp4', etc.).
Download only files of a specific format:
$ ia download TripDown1905 --format='512Kb MPEG4'
Note that --format cannot be used with --glob. You can get a list of the formats of a given item like so:
$ ia metadata --formats TripDown1905
Download an entire collection:
$ ia download --search 'collection:glasgowschoolofart'
Download from an itemlist:
$ ia download --itemlist itemlist.txt
See ia help download for more details.
Some files on archive.org are generated on-the-fly as requested. This currently includes non-original files of the formats EPUB, MOBI, DAISY, and archive.org’s own MARC XML. These files can be downloaded using the --on-the-fly parameter:
$ ia download goodytwoshoes00newyiala --on-the-fly
You can use ia to delete files from archive.org items:
$ ia delete <identifier> <file>
Delete a file and all files derived from the specified file:
$ ia delete <identifier> <file> --cascade
Delete all files in an item:
$ ia delete <identifier> --all
Note that ia delete makes a backup of any files that are deleted. They are saved to a directory in the item named history/files/. The files are named in the format $key.~N~. These files can be deleted like normal files. You can also prevent the backup from happening on deletes by adding -H x-archive-keep-old-version:0 to your command.
See ia help delete for more details.
ia can also be used for retrieving archive.org search results in JSON:
$ ia search 'subject:"market street" collection:prelinger'
By default, ia search attempts to return all items meeting the search criteria, and the results are sorted by item identifier. If you want to just select the top n items, you can specify a page and rows parameter. For example, to get the top 20 items matching the search ‘dogs’:
$ ia search --parameters="page=1&rows=20" "dogs"
You can use ia search to create an itemlist:
$ ia search 'collection:glasgowschoolofart' --itemlist > itemlist.txt
You can pipe your itemlist into a GNU Parallel command to download items concurrently:
$ ia search 'collection:glasgowschoolofart' --itemlist | parallel 'ia download {}'
See ia help search for more details.
You can also use ia to retrieve information about your catalog tasks, after configuring ia. To retrieve the task history for an item, simply run:
$ ia tasks <identifier>
View all of your queued and running archive.org tasks:
$ ia tasks
See ia help tasks for more details.
You can list files in an item like so:
$ ia list goodytwoshoes00newyiala
See ia help list for more details.
You can copy files in archive.org items like so:
$ ia copy <src-identifier>/<src-filename> <dest-identifier>/<dest-filename>
If you’re copying your file to a new item, you can provide metadata as well:
$ ia copy <src-identifier>/<src-filename> <dest-identifier>/<dest-filename> --metadata 'title:My New Item' --metadata collection:test_collection
Note that ia copy makes a backup of any files that are clobbered. They are saved to a directory in the item named history/files/. The files are named in the format $key.~N~. These files can be deleted like normal files. You can also prevent the backup from happening on clobbers by adding -H x-archive-keep-old-version:0 to your command.
ia move works just like ia copy except the source file is deleted after the file has been successfully copied.
Note that ia move makes a backup of any files that are clobbered or deleted. They are saved to a directory in the item named history/files/. The files are named in the format $key.~N~. These files can be deleted like normal files. You can also prevent the backup from happening on clobbers or deletes by adding -H x-archive-keep-old-version:0 to your command.
Archive.org is made up of “items”. An item is a logical “thing” that we represent on one web page on archive.org. An item can be considered as a group of files that deserve their own metadata. If the files in an item have separate metadata, the files should probably be in different items. An item can be a book, a song, an album, a dataset, a movie, an image or set of images, etc. Every item has an identifier that is unique across archive.org.
An item is just a directory of files and possibly subdirectories. Every item has at least two files named in the following format (see metadata page for more context on what an identifier is):
- <identifier>_files.xml
- <identifier>_meta.xml
The _meta.xml file is an XML file containing all of the metadata describing the item. The _files.xml file is an XML file containing all of the file-level metadata. There can only be one _meta.xml file and one _files.xml file per item.
Alongside these metadata files and the original files uploaded to the item, the item may also contain derivative files automatically generated by archive.org.
As a rule of thumb, items should:
- not be over 100GB
- not contain more than 10,000 files.
All items must be part of a collection. A collection is simply an item with special characteristics. Besides an image file for the collection logo, files should never be uploaded directly to a collection item. Items can be assigned to a collection at the time of creation, or after the item has been created by modifying the collection element in an item’s metadata to contain the identifier for the given collection (i.e. ia metadata <identifier> -m collection:<collection-identifier>. Currently collections can only be created by archive.org staff. Please contact info@archive.org if you need a collection.
An item’s “details” page will always be available at:
https://archive.org/details/<identifier>
The item directory is always available at:
https://archive.org/download/<identifier>
A particular file can always be downloaded from:
https://archive.org/download/<identifier>/<filename>
Note: Archival URLs may redirect to an actual server that contains the content. The resultant URL is not a permalink. For example, the archival URL:
https://archive.org/download/popeye_taxi-turvey/popeye_taxi-turvey_meta.xml
currently redirects to:
https://ia802304.us.archive.org/30/items/popeye_taxi-turvey/popeye_taxi-turvey_meta.xml
DO NOT LINK to any archive.org URL that begins with numbers like this. This refers to the particular machine that we’re serving the file from right now, but we move items to new servers all the time. If you link to this sort of URL, instead of the archival URL, your link WILL break at some point.
Metadata is data about data. In the case of Internet Archive items, the metadata describes the contents of the items. Metadata can include information such as the performance date for a concert, the name of the artist, and a set list for the event.
Metadata is a very important element of items in the Internet Archive. Metadata allows people to locate and view information. Items with little or poor metadata may never be seen and can become lost.
Note that metadata keys must be valid XML tags. Please refer to the XML Naming Rules section here.
Each item at Internet Archive has an identifier. An identifier is composed of any unique combination of alphanumeric characters, underscore (_) and dash (-). While there are no official limits it is strongly suggested that identifiers be between 5 and 80 characters in length.
Identifiers must be unique across the entirety of Internet Archive, not simply unique within a single collection.
Once defined an identifier can not be changed. It will travel with the item or object and is involved in every manner of accessing or referring to the item.
There are several standard metadata fields recognized for Internet Archive items. Most metadata fields are optional.
Contains the date on which the item was added to Internet Archive.
Please use an ISO 8601 compatible format for this date. For instance, these are all valid date formats:
While it is possible to set the addeddate metadata value it is not recommended. This value is typically set by automated processes.
The name of the account which added the item to the Internet Archive.
While is is possible to set the adder metadata value it is not recommended. This value is typically set by automated processes.
A collection is a specialized item used for curation and aggregation of other items. Assigning an item to a collection defines where the item may be located by a user browsing Internet Archive.
A collection must exist prior to assigning any items to it. Currently collections can only be created by Internet Archive staff members. Please contact Internet Archive if you need a collection created.
All items should belong to a collection. If a collection is not specified at the time of upload, it will be added to the opensource collection. For testing purposes, you may upload to the test_collection collection.
The value of the contributor metadata field is information about the entity responsible for making contributions to the content of the item. This is often the library, organization or individual making the item available on Internet Archive.
The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.
The extent or scope of the content of the material available in the item. The value of the coverage metadata field may include geographic place, temporal period, jurisdiction, etc. For items which contain multi-volume or serial content, place the statement of holdings in this metadata field.
An entity primarily responsible for creating the files contained in the item.
The participants in the production of the materials contained in the item.
The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.
The publication, production or other similar date of this item.
Please use an ISO 8601 compatible format for this date.
A description of the item.
The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.
The primary language of the material available in the item.
While the value of the language metadata field can be any value, Internet Archive prefers they be MARC21 Language Codes.
A URL to the license which covers the works contained in the item.
Internet Archive recommends (but does not require) Creative Commons licensing. Creative Commons provides a license selector for finding the correct license for your needs.
The primary type of media contained in the item. While an item can contain files of diverse mediatypes the value in this field defines the appearance and functionality of the item’s detail page on Internet Archive. In particular, the mediatype of an item defines what sort of online viewer is available for the files contained in the item.
The mediatype metadata field recognizes a limited set of values:
If the mediatype value you set is not in the list above it will be saved but ignored by the system. The item will be treated as though it has a mediatype value of data.
If a value is not specified for this field it will default to data.
All items will have their metadata included in the Internet Archive search engine. To disable indexing in the search engine, include a noindex metadata tag. The value of the tag does not matter. Its presence is enough to trigger not including the metadata in the search engine.
If an item’s metadata has already been indexed in the search engine, setting noindex will remove it from the index.
Items whose metadata is not included in the search engine index are not considered “public” per se and therefore will not have a value in the publicdate metadata field (see below).
Contains user-defined information about the item.
The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.
On the v1 archive.org site, each collection page on Internet Archive may include a “Staff Picks” section. This section will highlight a single item in the collection. This item will be selected at random from the items with a pick metadata value of 1. If there are no items with this pick metadata value the “Staff Picks” section will not appear on the collection page.
By default all new items have no pick metadata value. Note: v2 of the archive.org website does not make use of this value.
Items which have had their metadata included in the Internet Archive search engine index are considered to be public. The date the metadata is added to the index is the public date for the item.
Please use an ISO 8601 compatible format for this date. For instance, these are all valid date formats:
While it is possible to set the publicdate metadata value it is not recommended. This value is typically set by automated processes.
The publisher of the material available in the item.
A statement of the rights held in and over the files in the item.
The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.
Keyword(s) or phrase(s) that may be searched for to find your item. This field can contain multiple values:
$ ia metadata <identifier> --modify='subject:foo' --modify='subject:bar'
Or, in Python:
>>> from internetarchive import modify_metadata
>>> md = dict(subject=['foo', 'bar'])
>>> r = modify_metadata('<identifier>', md)
It is helpful but not necessary for you to use Library of Congress Subject Headings for the value of this metadata header.
The title for the item. This appears in the header of the item’s detail page on Internet Archive.
If a value is not specified for this field it will default to the identifier for the item.
The date on which an update was made to the item. This field is repeatable.
Please use an ISO 8601 compatible format for this date.
While it is possible to set the publicdate metadata value it is not recommended. This value is typically set by automated processes.
The name of the account which updated the item. This field is repeatable.
While it is possible to set the updater metadata value it is not recommended. This value is typically set by automated processes.
The name of the account which uploaded the file(s) to the item.
The uploader has ownership over the item and is allowed to maintain it.
This value is set by automated processes.
Internet Archive strives to be metadata agnostic, enabling users to define the metadata format which best suits the needs of their material. In addition to the standard metadata fields listed above you may also define as many custom metadata fields as you require. These metadata fields can be defined ad hoc at item creation or metadata editing time and do not have to be defined in advance.
Certain functions of the internetarchive library require your archive.org credentials (i.e. uploading, modifying metadata, searching). Your credentials and other configurations can be provided via a dictionary when instantiating an ArchiveSession or Item object, or in a config file.
The easiest way to create a config file is with the configure function:
>>> from internetarchive import configure
>>> configure('user@example.com', 'password')
Config files are stored in either $HOME/.ia or $HOME/.config/ia.ini by default. You can also specify your own path:
>>> from internetarchive import configure
>>> configure('user@example.com', 'password', config_file='/home/jake/.config/ia-alternate.ini')
Custom config files can be specified when instantiating an ArchiveSession object:
>>> from internetarchive import get_session
>>> s = get_session(config_file='/home/jake/.config/ia-alternate.ini')
Or an Item object:
>>> from internetarchive import get_item
>>> item = get_item('nasa', config_file='/home/jake/.config/ia-alternate.ini')
Your IA-S3 keys are required for uploading and modifying metadata. You can retrieve your IA-S3 keys at https://archive.org/account/s3.php.
They can be specified in your config file like so:
[s3]
access = mYaccEsSkEY
secret = mYs3cREtKEy
Or, using the ArchiveSession object:
>>> from internetarchive import get_session
>>> c = {'s3': {'access': 'mYaccEsSkEY', 'secret': 'mYs3cREtKEy'}}
>>> s = get_session(config=c)
>>> s.access_key
'mYaccEsSkEY'
Your archive.org logged-in cookies are required for downloading access-restricted files that you have permissions to and retrieving information about archive.org catalog tasks.
Your cookies can be specified like so:
[cookies]
logged-in-user = user%40example.com
logged-in-sig = <redacted>
Or, using the ArchiveSession object:
>>> from internetarchive import get_session
>>> c = {'cookies': {'logged-in-user': 'user%40example.com', 'logged-in-sig': 'foo'}}
>>> s = get_session(config=c)
>>> s.cookies['logged-in-user']
'user%40example.com'
You can specify logging levels and the location of your log file like so:
[logging]
level = INFO
file = /tmp/ia.log
Or, using the ArchiveSession object:
>>> from internetarchive import get_session
>>> c = {'logging': {'level': 'INFO', 'file': '/tmp/ia.log'}}
>>> s = get_session(config=c)
By default logging is turned off.
By default all requests are HTTPS in Python versions 2.7.10 or newer. You can change this setting in your config file in the general section:
[general]
secure = False
Or, using the ArchiveSession object:
>>> from internetarchive import get_session
>>> s = get_session(config={'general': {'secure': False}})
In the example above, all requests will be made via HTTP.
The ArchiveSession object is subclassed from requests.Session. It collects together your credentials and config.
Return a new ArchiveSession object. The ArchiveSession object is the main interface to the internetarchive lib. It allows you to persist certain parameters across tasks.
Parameters: |
|
---|---|
Returns: | ArchiveSession object. |
Usage:
>>> from internetarchive import get_session
>>> config = dict(s3=dict(access='foo', secret='bar'))
>>> s = get_session(config)
>>> s.access_key
'foo'
From the session object, you can access all of the functionality of the internetarchive lib:
>>> item = s.get_item('nasa')
>>> item.download()
nasa: ddddddd - success
>>> s.get_tasks(task_ids=31643513)[0].server
'ia311234'
Item objects represent Internet Archive items. From the Item object you can create new items, upload files to existing items, read and write metadata, and download or delete files.
Get an Item object.
Parameters: |
|
---|
>>> from internetarchive import get_item
>>> item = get_item('nasa')
>>> item.item_size
121084
Uploading to an item can be done using Item.upload():
>>> item = get_item('my_item')
>>> r = item.upload('/home/user/foo.txt')
>>> from internetarchive import upload
>>> r = upload('my_item', '/home/user/foo.txt')
The item will automatically be created if it does not exist.
Refer to archive.org Identifiers for more information on creating valid archive.org identifiers.
Remote filenames can be defined using a dictionary:
>>> from io import BytesIO
>>> fh = BytesIO()
>>> fh.write(b'foo bar')
>>> item.upload({'my-remote-filename.txt': fh})
Upload files to an item. The item will be created if it does not exist.
Parameters: |
|
---|---|
Returns: | A list of requests.Response objects. |
Modify the metadata of an existing item on Archive.org.
Parameters: |
|
---|---|
Returns: | requests.Response object or requests.Request object if debug is True. |
The default target to write to is metadata. If you would like to write to another target, such as files, you can specify so using the target parameter. For example, if we had an item whose identifier was my_identifier and you wanted to add a metadata field to a file within the item called foo.txt:
>>> r = modify_metadata('my_identifier', metadata=dict(title='My File'), target='files/foo.txt')
>>> from internetarchive import get_files
>>> f = list(get_files('iacli-test-item301', 'foo.txt'))[0]
>>> f.title
'My File'
You can also create new targets if they don’t exist:
>>> r = modify_metadata('my_identifier', metadata=dict(foo='bar'), target='extra_metadata')
>>> from internetarchive import get_item
>>> item = get_item('my_identifier')
>>> item.item_metadata['extra_metadata']
{'foo': 'bar'}
Download files from an item.
Parameters: |
|
---|---|
Return type: | bool |
Returns: | True if all files were downloaded successfully. |
Delete files from an item. Note: Some system files, such as <itemname>_meta.xml, cannot be deleted.
Parameters: |
|
---|
Get File objects from an item.
Parameters: |
|
---|
>>> from internetarchive import get_files
>>> fnames = [f.name for f in get_files('nasa', glob_pattern='*xml')]
>>> print(fnames)
['nasa_reviews.xml', 'nasa_meta.xml', 'nasa_files.xml']
Search for items on Archive.org.
Parameters: |
|
---|---|
Returns: | A Search object, yielding search results. |
Get tasks from the Archive.org catalog. internetarchive must be configured with your logged-in-* cookies to use this function. If no arguments are provided, all queued tasks for the user will be returned.
Parameters: |
|
---|---|
Returns: | A set of CatalogTask objects. |
Feautres and Improvements
Bugfixes
Feautres and Improvements
Bugfixes
Feautres and Improvements
Bugfixes
Feautres and Improvements
Feautres and Improvements
Bugfixes
Bugfixes
Feautres and Improvements
Bugfixes
Bugfixes
Bugfixes
Feautres and Improvements
Bugfixes
Features and Improvements
Bugfixes
Features and Improvements
Bugfixes
Bugfixes
Features and Improvements
Bugfixes
Features and Improvements
Bugfixes
Bugfixes
Features and Improvements
Features and Improvements
Bugfixes
Bugfixes
Features and Improvements
Features and Improvements
Bugfixes
Features and Improvements
Use scrape API for getting total number of results rather than the advanced search API.
Improved error messages for IA-S3 (upload) related errors.
Added retry suport to delete.
ia delete no longer exits if a single request fails when deleting multiple files, but continues onto the next file. If any file fails, the command will exit with a non-zero status code.
All search requests now require authentication via IA-S3 keys. You can run ia configure to generate a config file that will be used to authenticate all search requests automatically. For more details refer to the following links:
http://internetarchive.readthedocs.io/en/latest/quickstart.html?highlight=configure#configuring
http://internetarchive.readthedocs.io/en/latest/api.html#configuration
Added ability to specify your own filepath in ia configure and internetarchive.configure().
Bugfixes
Bugfixes
Features and Improvements
Bugfixes
Features and Improvements
Bugfixes
Bugfixes
Bugfixes
Features and Improvements
Bugfixes
Features and Improvements
Features and Improvements
Features and Improvements
Bugfixes
Bugfixes
Bugfixes
Bugfixes
Features and Improvements
Features and Improvements
Features and Improvements
Bugfixes
The internetarchive library uses the HTTPS protocol for making secure requests by default. This can cause issues when using versions of Python earlier than 2.7.9:
Certain Python platforms (specifically, versions of Python earlier than 2.7.9) have restrictions in their ssl module that limit the configuration that urllib3 can apply. In particular, this can cause HTTPS requests that would succeed on more featureful platforms to fail, and can cause certain security features to be unavailable.
See https://urllib3.readthedocs.org/en/latest/security.html for more details.
If you are using a Python version earlier than 2.7.9, you might see InsecurePlatformWarning and SNIMissingWarning warnings and your requests might fail. There are a few options to address this issue:
Upgrade your Python to version 2.7.9 or more recent.
Install or upgrade the following Python modules as documented here: PyOpenSSL, ndg-httpsclient, and pyasn1.
- Use HTTP to make insecure requests in one of the following ways:
- Adding the following lines to your ia.ini config file (usually located at ~/.config/ia.ini or ~/.ia.ini):
[general] secure = falseIn the Python interface, using a config dict:
>>> from internetarchive import get_item >>> config = dict(general=dict(secure=False)) >>> item = get_item('<identifier>', config=config)In the command-line interface, use the --insecure option:
$ ia --insecure download <identifier>
On some 32-bit systems you may run into issues uploading files larger than 2 GB. You may see an error that looks something like OverflowError: long int too large to convert to int. You can get around this by upgrading requests:
pip install --upgrade requests
You can find more details about this issue at the following links:
https://github.com/sigmavirus24/requests-toolbelt/issues/80 https://github.com/kennethreitz/requests/issues/2691
Thank you for considering contributing. All contributions are welcome and appreciated!
Please don’t use the Github issue tracker for asking support questions. All support questions should be emailed to info@archive.org.
Github issues is used for tracking bugs. Please consider the following when opening an issue:
All pull requests and patches are welcome, but please consider the following:
The minimal requirements for running tests are pytest, pytest-pep8 and responses:
$ pip install pytest pytest-pep8 responses
Clone the internetarchive lib:
$ git clone https://github.com/jjjake/internetarchive
Install the internetarchive lib as an editable package:
$ cd internetarchive
$ pip install -e .
Run the tests:
$ py.test --pep8
Note that this will only test against the Python version you are currently using, however internetarchive tests against multiple Python versions defined in tox.ini. Tests must pass on all versions defined in tox.ini for all pull requests.
To test against all supported Python versions, first make sure you have all of the required versions of Python installed. Then simply install execute tox from the root directory of the repo:
$ pip install tox
$ tox
Even easier is simply creating a pull request. Travis is used for continuous integration, and is set up to run the full testsuite whenever a pull request is submitted or updated.
The Internet Archive Python library and command-line tool is written and maintained by Jake Johnson and various contributors: